<@U0265RTUJ5B> can you tell us why this is not mer...
# ask-the-community
g
@Eduardo Apolinario (eapolinario) can you tell us why this is not merged to master - https://github.com/flyteorg/flyteplugins/pull/407
k
We are using Monorepo now, we have merged it into Flyte repo
e
@Georgi Ivanov, this PR was moved to the monorepo in https://github.com/flyteorg/flyte/pull/4132 (which ended up being close) and finally implemented in https://github.com/flyteorg/flyte/pull/4115
g
oh ok
do you guys know when the next flyte release will be
is there a chance we can get a build from master to test if the bug is fixed
I deployed flyte 1.9.9 (all components) but the problem with syncing with databricks is still present
the previous error is no longer there, but we get a CorruptedPluginState error
Copy code
{"json":{"routine":"databricks-worker-0","src":"cache.go:78"},"level":"debug","msg":"Sync loop - processing resource with cache key [fd06daeb1efbc438390b-n0-0]","ts":"2023-10-11T00:22:53Z"}
{"json":{"routine":"databricks-worker-0","src":"cache.go:101"},"level":"debug","msg":"Querying AsyncPlugin for fd06daeb1efbc438390b-n0-0","ts":"2023-10-11T00:22:53Z"}
{"json":{"routine":"databricks-worker-0","src":"cache.go:104"},"level":"info","msg":"Error retrieving resource [fd06daeb1efbc438390b-n0-0]. Error: [CorruptedPluginState] can't get the job
so I would say that the databricks plugin still does not work
e
cc: @Kevin Su
k
did you see any other error on databricks console or in propeller pod?
g
yes
so in the previous example the databricks job did not succeed
however I managed to fix this
now the databricks jobs suceeeds
we still see errors in propeller
for example
Copy code
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "routine": "worker-0",
    "src": "handler.go:181"
  },
  "level": "info",
  "msg": "Processing Workflow.",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:364",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "info",
  "msg": "Handling Workflow [f5661a9a05a7b485b956], id: [project:\"flytesnacks\" domain:\"development\" name:\"f5661a9a05a7b485b956\" ], p [Running]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "start-node",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:1024",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Node has [Succeeded], traversing downstream.",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:826",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Handling downstream Nodes",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:986",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Handling node Status [Running]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:959",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Parallelism criteria not met, Current [0], Max [25]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:740",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Handling Node [n0]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:586",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "node executing, current phase [Running]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:457",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Executing node",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "handler.go:190",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "info",
  "msg": "Dynamic handler.Handle's called with phase 0.",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "handler.go:348",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Plugin [databricks] resolved for Handler type [spark]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "pre_post_execution.go:117",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "info",
  "msg": "Catalog CacheSerializeDisabled: for Task [flytesnacks/development/dbx_simplified_example.print_spark_config/8TWBtAb_p0C_8jXnVr8IyA==]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "secrets.go:41",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "reading secrets from filePath [/vault/secrets/databricks_api_token]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "template.go:87",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Using [f5661a9a05a7b485b956_n0_0] from [f5661a9a05a7b485b956-n0-0]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "launcher.go:23",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "info",
  "msg": "Created Resource Name [f5661a9a05a7b485b956-n0-0] and Meta [&{<nil> xxx.cloud.databricks.com token_xxx}]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "launcher.go:27",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "error",
  "msg": "Failed to check resource status. Error: [SystemError] unknown execution phase [403].",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "handler.go:400",
    "tasktype": "spark",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "warning",
  "msg": "Runtime error from plugin [databricks]. Error: [SystemError] unknown execution phase [403].",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "handler.go:233",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "error",
  "msg": "handling parent node failed with error: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [databricks]: [SystemError] unknown execution phase [403].",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:462",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Node execution round complete",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:595",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "error",
  "msg": "failed Execute for node. Error: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [databricks]: [SystemError] unknown execution phase [403].",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:596",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "node execution completed",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "node": "n0",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:820",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "debug",
  "msg": "Completed node [n0]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:395",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "warning",
  "msg": "Error in handling running workflow [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [databricks]: [SystemError] unknown execution phase [403].]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "executor.go:396",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "info",
  "msg": "Handling Workflow [f5661a9a05a7b485b956] Done",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "res_ver": "527200620",
    "routine": "worker-0",
    "src": "handler.go:145",
    "wf": "flytesnacks:development:dbx_simplified_example.my_databricks_job"
  },
  "level": "error",
  "msg": "Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [databricks]: [SystemError] unknown execution phase [403].]. Error Type[*errors.NodeErrorWithCause]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "routine": "worker-0",
    "src": "passthrough.go:89"
  },
  "level": "debug",
  "msg": "Observed FlyteWorkflow Update (maybe finalizer)",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "routine": "worker-0",
    "src": "passthrough.go:109"
  },
  "level": "debug",
  "msg": "Updated workflow.",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "src": "controller.go:159"
  },
  "level": "info",
  "msg": "==> Enqueueing workflow [flytesnacks-development/f5661a9a05a7b485b956]",
  "ts": "2023-10-11T00:44:56Z"
}
{
  "json": {
    "exec_id": "f5661a9a05a7b485b956",
    "ns": "flytesnacks-development",
    "routine": "worker-0",
    "src": "handler.go:367"
  },
  "level": "info",
  "msg": "Completed processing workflow.",
  "ts": "2023-10-11T00:44:56Z"
}
this is with debug logging
I formatted it with jq
@Kevin Su looks like the error is related to a HTTP 403 returned from databricks. What kind of permissions you require for the plugin to work?
k
The token you pass to the propeller should allows you to get job status, right?
could you help me run this command locally
Copy code
curl --netrc --request GET --header "Authorization: Bearer $DATABRICKS_TOKEN" \                                
'https://<databricks_instance>.<http://cloud.databricks.com/api/2.0/jobs/runs/get?run_id=<job_id|cloud.databricks.com/api/2.0/jobs/runs/get?run_id=<job_id>>'
use the token you pass to flyte
g
yes, it works
i can use curl or the databricks cli and I don’t have any issues with those
@Kevin Su if you want you can cut us a build with more verbosity, i.e. dumping all the needed bits from the databricks plugin status method and I can deploy it and see exactly what is broken
k
yes, I’ll create a PR today, and share it with you.
g
thank you
k
I feel like token becomes empty for some reason.
I just built for a image for it. pingsutw/flytepropeller:debug-dbx
g
thanks
i will deploy it now
k
tyty
g
hi Kevin
I just deployed it
i think i may have found the problem
there are several errors in it though
one is related to databricks.ResourceMetaWrapper
however the actual error with the job state I believe is this one
Copy code
{
  "json": {
    "routine": "databricks-worker-0",
    "src": "plugin.go:163"
  },
  "level": "info",
  "msg": "Get databricks job requestreq&{GET <https://xxx.cloud.databricks.com/api/2.0/jobs/runs/get?run_id=2.57406661136039e+14> HTTP/1.1 1 1 map[Authorization:[Bearer xxx] Content-Type:[application/json]] <nil> <nil> 0 [] false <http://xxx.cloud.databricks.com|xxx.cloud.databricks.com> map[] map[] <nil> map[]   <nil> <nil> <nil> 0xc000060020}",
  "ts": "2023-10-12T10:11:26Z"
}
notice how the run id is passed as float
the run id is correct, but it should be passed as string or integer not as float
this is the response from databricks
Copy code
{
  "json": {
    "routine": "databricks-worker-4",
    "src": "plugin.go:165"
  },
  "level": "info",
  "msg": "Get databricks job responseresp&{400 Bad Request 400 HTTP/2.0 2 0 map[Content-Type:[application/json] Date:[Thu, 12 Oct 2023 10:10:56 GMT] Server:[databricks] Strict-Transport-Security:[max-age=31536000; includeSubDomains; preload] Vary:[Accept-Encoding] X-Content-Type-Options:[nosniff] X-Databricks-Org-Id:[xxx]] 0xc002455290 -1 [] false true map[] 0xc00538a600 0xc00c5c5ad0}",
  "ts": "2023-10-12T10:10:56Z"
}
and right after I get this
Copy code
{
  "json": {
    "routine": "databricks-worker-4",
    "src": "cache.go:114"
  },
  "level": "info",
  "msg": "Error retrieving resource [f96d0eebbeef841c0ac3-n0-0]. Error: [CorruptedPluginState] can't get the job state",
  "ts": "2023-10-12T10:10:56Z"
}
Copy code
{
  "json": {
    "routine": "databricks-worker-0",
    "src": "plugin.go:155"
  },
  "level": "info",
  "msg": "Get databricks job statusrunID2.57406661136039e+14",
  "ts": "2023-10-12T10:11:26Z"
}
even though exec.RunID is declared as string in the struct, somehow it is stored as float
k
Ohh, I see. I know how to fix it, will create a pr tomorrow morning.
This is really helpful, thank you so much
g
sure
there is also some other errors related to the resourcewrapper
do you want me to send them to you ?
email I guess ?
k
Sure
g
which one shall I use?
or perhaps I can raise a github issue on flyte repo
whatever you prefer
k
Yes, creating a issue is better, thanks
[flyte-bug]
g
ok I archived some old databricks workflows and restarted propeller and I don’t see the errors messages. I will monitor though since we are actively testing and If I spot anything I will raise a bug request
@Kevin Su can you ping me when you are ready for me to test it?
k
I just updated the PR, could you try it again. https://github.com/flyteorg/flyte/pull/4206
image: pingsutw/flytepropeller:debug-dbx-v1
g
will do in an hour. will ping back to you in this thread.
k
tyty
g
@Kevin Su are you still there
k
yes
g
the plugin can now fetch the job status correctly however I don’t see that it’s syncing properly
the run id appears correct
k
the status showed on UI is not correct ?
g
what I mean is that job has failed in databricks (not due to flyte), but it appears as running in flyte
yes
Screenshot 2023-10-13 at 23.46.52.png,Screenshot 2023-10-13 at 23.47.00.png
k
do you know the status code returned from databricks
g
the status code is 200
let me sieve through the mountain of logs 😉
k
what’s the resultState and lifeCycleState
you could use cli to check it
Copy code
curl --netrc --request GET --header "Authorization: Bearer $DATABRICKS_TOKEN" \                                
'https://<databricks_instance>.<http://cloud.databricks.com/api/2.0/jobs/runs/get?run_id=<job_id|cloud.databricks.com/api/2.0/jobs/runs/get?run_id=<job_id>>'
g
this is from the logs
checking with curl
I can fetch the job just fine
Copy code
"number_in_job": 1028557138752855,
  "state": {
    "life_cycle_state": "INTERNAL_ERROR",
    "result_state": "FAILED",
do you need more logs ?
Untitled
seems like the plugin is not updating the state of the task
No state change for Task, previously observed same transition. Short circuiting
@Kevin Su are you still there ?
Untitled
this is the problem
since the job state is INTERNAL_ERROR, we return core.PhaseInfoRunning, i.e. the task is running. We don’t capture this state
I think the
http.StatusOK:
case should be something like this:
cause TERMINATED, SKIPPED and INTERNAL_ERROR are the terminal states.
or even like this, to be more concise