https://flyte.org logo
#ask-the-community
Title
# ask-the-community
g

Georgi Ivanov

10/10/2023, 4:15 PM
@Eduardo Apolinario (eapolinario) can you tell us why this is not merged to master - https://github.com/flyteorg/flyteplugins/pull/407
k

Kevin Su

10/10/2023, 4:38 PM
We are using Monorepo now, we have merged it into Flyte repo
e

Eduardo Apolinario (eapolinario)

10/10/2023, 5:30 PM
@Georgi Ivanov, this PR was moved to the monorepo in https://github.com/flyteorg/flyte/pull/4132 (which ended up being close) and finally implemented in https://github.com/flyteorg/flyte/pull/4115
g

Georgi Ivanov

10/10/2023, 10:58 PM
oh ok
do you guys know when the next flyte release will be
is there a chance we can get a build from master to test if the bug is fixed
I deployed flyte 1.9.9 (all components) but the problem with syncing with databricks is still present
the previous error is no longer there, but we get a CorruptedPluginState error
Copy code
{"json":{"routine":"databricks-worker-0","src":"cache.go:78"},"level":"debug","msg":"Sync loop - processing resource with cache key [fd06daeb1efbc438390b-n0-0]","ts":"2023-10-11T00:22:53Z"}
{"json":{"routine":"databricks-worker-0","src":"cache.go:101"},"level":"debug","msg":"Querying AsyncPlugin for fd06daeb1efbc438390b-n0-0","ts":"2023-10-11T00:22:53Z"}
{"json":{"routine":"databricks-worker-0","src":"cache.go:104"},"level":"info","msg":"Error retrieving resource [fd06daeb1efbc438390b-n0-0]. Error: [CorruptedPluginState] can't get the job
so I would say that the databricks plugin still does not work
e

Eduardo Apolinario (eapolinario)

10/11/2023, 12:26 AM
cc: @Kevin Su
k

Kevin Su

10/11/2023, 12:30 AM
did you see any other error on databricks console or in propeller pod?
g

Georgi Ivanov

10/11/2023, 12:47 AM
yes
so in the previous example the databricks job did not succeed
however I managed to fix this
now the databricks jobs suceeeds
we still see errors in propeller
for example
Copy code
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "routine": "worker-0",
    "src": "handler.go:181"
  },
  "level": "info",
  "msg": "Processing Workflow.",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:364",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "info",
  "msg": "Handling Workflow [f5661a9a05a7b485b956], id: [project:\"flytesnacks\" domain:\"development\" name:\"f5661a9a05a7b485b956\" ], p [Running]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "start-node",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:1024",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Node has [Succeeded], traversing downstream.",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:826",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Handling downstream Nodes",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:986",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Handling node Status [Running]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:959",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Parallelism criteria not met, Current [0], Max [25]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:740",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Handling Node [n0]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:586",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "node executing, current phase [Running]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:457",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Executing node",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "handler.go:190",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "info",
  "msg": "Dynamic handler.Handle's called with phase 0.",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "handler.go:348",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Plugin [databricks] resolved for Handler type [spark]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "pre_post_execution.go:117",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "info",
  "msg": "Catalog CacheSerializeDisabled: for Task [flytesnacks/development/dbx_simplified_example.print_spark_config/8TWBtAb_p0C_8jXnVr8IyA==]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "secrets.go:41",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "reading secrets from filePath [/vault/secrets/databricks_api_token]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "template.go:87",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Using [f5661a9a05a7b485b956_n0_0] from [f5661a9a05a7b485b956-n0-0]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "launcher.go:23",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "info",
  "msg": "Created Resource Name [f5661a9a05a7b485b956-n0-0] and Meta [&{<nil> xxx.cloud.databricks.com token_xxx}]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "launcher.go:27",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "error",
  "msg": "Failed to check resource status. Error: [SystemError] unknown execution phase [403].",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "handler.go:400",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "warning",
  "msg": "Runtime error from plugin [databricks]. Error: [SystemError] unknown execution phase [403].",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "handler.go:233",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "error",
  "msg": "handling parent node failed with error: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [databricks]: [SystemError] unknown execution phase [403].",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:462",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Node execution round complete",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:595",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "error",
  "msg": "failed Execute for node. Error: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [databricks]: [SystemError] unknown execution phase [403].",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:596",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "node execution completed",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:820",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Completed node [n0]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:395",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "warning",
  "msg": "Error in handling running workflow [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [databricks]: [SystemError] unknown execution phase [403].]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:396",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "info",
  "msg": "Handling Workflow [f5661a9a05a7b485b956] Done",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "handler.go:145",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "error",
  "msg": "Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [databricks]: [SystemError] unknown execution phase [403].]. Error Type[*errors.NodeErrorWithCause]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "routine": "worker-0",
    "src": "passthrough.go:89"
  },
  "level": "debug",
  "msg": "Observed FlyteWorkflow Update (maybe finalizer)",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "routine": "worker-0",
    "src": "passthrough.go:109"
  },
  "level": "debug",
  "msg": "Updated workflow.",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "src": "controller.go:159"
  },
  "level": "info",
  "msg": "==> Enqueueing workflow [flytesnacks-development/f5661a9a05a7b485b956]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "routine": "worker-0",
    "src": "handler.go:367"
  },
  "level": "info",
  "msg": "Completed processing workflow.",
  "ts": "2023-10-11T00:44:56Z"
}
this is with debug logging
I formatted it with jq
@Kevin Su looks like the error is related to a HTTP 403 returned from databricks. What kind of permissions you require for the plugin to work?
k

Kevin Su

10/11/2023, 5:16 AM
The token you pass to the propeller should allows you to get job status, right?
could you help me run this command locally
Copy code
curl --netrc --request GET --header "Authorization: Bearer $DATABRICKS_TOKEN" \                                
'https://<databricks_instance>.<http://cloud.databricks.com/api/2.0/jobs/runs/get?run_id=<job_id|cloud.databricks.com/api/2.0/jobs/runs/get?run_id=<job_id>>'
use the token you pass to flyte
g

Georgi Ivanov

10/11/2023, 9:42 AM
yes, it works
i can use curl or the databricks cli and I don’t have any issues with those
@Kevin Su if you want you can cut us a build with more verbosity, i.e. dumping all the needed bits from the databricks plugin status method and I can deploy it and see exactly what is broken
k

Kevin Su

10/11/2023, 4:46 PM
yes, I’ll create a PR today, and share it with you.
g

Georgi Ivanov

10/11/2023, 9:32 PM
thank you
k

Kevin Su

10/12/2023, 6:40 AM
I feel like token becomes empty for some reason.
I just built for a image for it. pingsutw/flytepropeller:debug-dbx
g

Georgi Ivanov

10/12/2023, 8:50 AM
thanks
i will deploy it now
k

Kevin Su

10/12/2023, 8:59 AM
tyty
g

Georgi Ivanov

10/12/2023, 10:21 AM
hi Kevin
I just deployed it
i think i may have found the problem
there are several errors in it though
one is related to databricks.ResourceMetaWrapper
however the actual error with the job state I believe is this one
Copy code
{
  "json": {
    "routine": "databricks-worker-0",
    "src": "plugin.go:163"
  },
  "level": "info",
  "msg": "Get databricks job requestreq&{GET <https://xxx.cloud.databricks.com/api/2.0/jobs/runs/get?run_id=2.57406661136039e+14> HTTP/1.1 1 1 map[Authorization:[Bearer xxx] Content-Type:[application/json]] <nil> <nil> 0 [] false <http://xxx.cloud.databricks.com|xxx.cloud.databricks.com> map[] map[] <nil> map[]   <nil> <nil> <nil> 0xc000060020}",
  "ts": "2023-10-12T10:11:26Z"
}
notice how the run id is passed as float
the run id is correct, but it should be passed as string or integer not as float
this is the response from databricks
Copy code
{
  "json": {
    "routine": "databricks-worker-4",
    "src": "plugin.go:165"
  },
  "level": "info",
  "msg": "Get databricks job responseresp&{400 Bad Request 400 HTTP/2.0 2 0 map[Content-Type:[application/json] Date:[Thu, 12 Oct 2023 10:10:56 GMT] Server:[databricks] Strict-Transport-Security:[max-age=31536000; includeSubDomains; preload] Vary:[Accept-Encoding] X-Content-Type-Options:[nosniff] X-Databricks-Org-Id:[xxx]] 0xc002455290 -1 [] false true map[] 0xc00538a600 0xc00c5c5ad0}",
  "ts": "2023-10-12T10:10:56Z"
}
and right after I get this
Copy code
{
  "json": {
    "routine": "databricks-worker-4",
    "src": "cache.go:114"
  },
  "level": "info",
  "msg": "Error retrieving resource [f96d0eebbeef841c0ac3-n0-0]. Error: [CorruptedPluginState] can't get the job state",
  "ts": "2023-10-12T10:10:56Z"
}
Copy code
{
  "json": {
    "routine": "databricks-worker-0",
    "src": "plugin.go:155"
  },
  "level": "info",
  "msg": "Get databricks job statusrunID2.57406661136039e+14",
  "ts": "2023-10-12T10:11:26Z"
}
even though exec.RunID is declared as string in the struct, somehow it is stored as float
k

Kevin Su

10/12/2023, 10:30 AM
Ohh, I see. I know how to fix it, will create a pr tomorrow morning.
This is really helpful, thank you so much
g

Georgi Ivanov

10/12/2023, 10:30 AM
sure
there is also some other errors related to the resourcewrapper
do you want me to send them to you ?
email I guess ?
k

Kevin Su

10/12/2023, 10:31 AM
Sure
g

Georgi Ivanov

10/12/2023, 10:31 AM
which one shall I use?
or perhaps I can raise a github issue on flyte repo
whatever you prefer
k

Kevin Su

10/12/2023, 10:33 AM
Yes, creating a issue is better, thanks
[flyte-bug]
g

Georgi Ivanov

10/12/2023, 11:00 AM
ok I archived some old databricks workflows and restarted propeller and I don’t see the errors messages. I will monitor though since we are actively testing and If I spot anything I will raise a bug request
@Kevin Su can you ping me when you are ready for me to test it?
k

Kevin Su

10/13/2023, 5:58 PM
I just updated the PR, could you try it again. https://github.com/flyteorg/flyte/pull/4206
image: pingsutw/flytepropeller:debug-dbx-v1
g

Georgi Ivanov

10/13/2023, 7:08 PM
will do in an hour. will ping back to you in this thread.
k

Kevin Su

10/13/2023, 7:12 PM
tyty
g

Georgi Ivanov

10/13/2023, 10:47 PM
@Kevin Su are you still there
k

Kevin Su

10/13/2023, 10:47 PM
yes
g

Georgi Ivanov

10/13/2023, 10:47 PM
the plugin can now fetch the job status correctly however I don’t see that it’s syncing properly
the run id appears correct
k

Kevin Su

10/13/2023, 10:48 PM
the status showed on UI is not correct ?
g

Georgi Ivanov

10/13/2023, 10:48 PM
what I mean is that job has failed in databricks (not due to flyte), but it appears as running in flyte
yes
Screenshot 2023-10-13 at 23.46.52.png,Screenshot 2023-10-13 at 23.47.00.png
k

Kevin Su

10/13/2023, 10:49 PM
do you know the status code returned from databricks
g

Georgi Ivanov

10/13/2023, 10:49 PM
the status code is 200
let me sieve through the mountain of logs 😉
k

Kevin Su

10/13/2023, 10:50 PM
what’s the resultState and lifeCycleState
you could use cli to check it
Copy code
curl --netrc --request GET --header "Authorization: Bearer $DATABRICKS_TOKEN" \                                
'https://<databricks_instance>.<http://cloud.databricks.com/api/2.0/jobs/runs/get?run_id=<job_id|cloud.databricks.com/api/2.0/jobs/runs/get?run_id=<job_id>>'
g

Georgi Ivanov

10/13/2023, 10:51 PM
this is from the logs
checking with curl
I can fetch the job just fine
Copy code
"number_in_job": 1028557138752855,
  "state": {
    "life_cycle_state": "INTERNAL_ERROR",
    "result_state": "FAILED",
do you need more logs ?
Untitled
seems like the plugin is not updating the state of the task
No state change for Task, previously observed same transition. Short circuiting
@Kevin Su are you still there ?
Untitled
this is the problem
since the job state is INTERNAL_ERROR, we return core.PhaseInfoRunning, i.e. the task is running. We don’t capture this state
I think the
http.StatusOK:
case should be something like this:
cause TERMINATED, SKIPPED and INTERNAL_ERROR are the terminal states.
or even like this, to be more concise
2 Views