https://flyte.org logo
#ask-the-community
Title
# ask-the-community
e

Evan Sadler

03/14/2023, 10:01 PM
Hello! I am running a databricks task inside of a dynamic workflow. These have worked for about week, but for some reason I am experiencing this issue:
The Databricks Flyte tasks launch successfully, but hang indefinitely after the DB job finishes successfully.
So far I have tried looking at flytepropeller, but I haven't found any logs relating to the execution id in question. Any tips on ways to debug is much appreciated 🙏
y

Yee

03/14/2023, 10:03 PM
what does the status on the flyte crd say?
yaml dump a hung workflow crd
k

Kevin Su

03/14/2023, 10:05 PM
maybe api token has expired?
e

Evan Sadler

03/14/2023, 10:11 PM
I am testing with a new api token, but it launches, just not registers when it finishes. Always good to check 😆 . @Yee how do I yaml dump a hung workflow crd and what exactly does that mean?
k

Kevin Su

03/14/2023, 10:16 PM
Copy code
kubectl get flyteworkflow -n flytesnacks-development # list cr
kubectl get flyteworkflow atvx5kdcbgfzq862djj4 -n flytesnacks-development -o yaml
e

Evan Sadler

03/14/2023, 10:27 PM
It seems like there is a data issue on my end because one of the output structured datasets has zero rows. I am going to a try a different day and see
I tested on a much simpler example and the job still is hanging.
Copy code
kubectl get flyteworkflow ff3263f587ca646a2a33 -n flytesnacks-development -o yaml
crd.txt
This is challenging because there are so many moving parts. I am going to test with a demo cluster to see if it works. That should give me a bit more of an isolated environment to work with. It very well could be a change on the DB side.
Okay I couldn't get that setup given my permission setup, but I was able to see that the problem extends to failed jobs: • Flyte launches DB task -> running on DB -> fails on DB -> does not update state on flyte • Flyte launches DB task -> running on DB -> succeeds on DB -> does not update state on flyte
k

Kevin Su

03/15/2023, 9:10 PM
@Evan Sadler could you share the output of get DB job by using curl
Copy code
curl --netrc --request GET --header "Authorization: Bearer $DATABRICKS_TOKEN" \
'<https://dbc-a53b7a3c-614c.cloud.databricks.com/api/2.0/jobs/runs/get?run_id=70446>'
Dbx might change their json response format, then propeller couldn’t successfully parse the response, and transition to correct state.
y

Yee

03/15/2023, 9:12 PM
wouldn’t that log an error though?
he’s not seeing anything in propeller logs
k

Kevin Su

03/15/2023, 9:15 PM
yeah, it should have error. not sure why propeller didn’t show it.
e

Evan Sadler

03/15/2023, 9:54 PM
Well funny enough the api token cannot find the databricks jobs. It just says "job_id not found" even though it exists in the platform and was kicked off using the API token. Seems like a permissions issue with the api token. I will reach out to the team. Thanks again!
I had an error in my get request, but I fixed it and got the results!
Copy code
curl --netrc --request GET --header "Authorization: Bearer $DATABRICKS_TOKEN" \
'<https://wbd-dcp-cd-dev.cloud.databricks.com/api/2.0/jobs/runs/get?run_id=1077214>'
It looks like the job is returning a success. I am looking at the GO code and it seems like
life_cycle_state
and
result_state
are correct. I don't see any changes.
Copy code
{
  "job_id": 1060228574312365,
  "run_id": 1077214,
  "creator_user_name": "<mailto:evan.sadler@warnermedia.com|evan.sadler@warnermedia.com>",
  "number_in_job": 1077214,
  "state": {
    "life_cycle_state": "TERMINATED",
    "result_state": "SUCCESS",
    "state_message": "",
    "user_cancelled_or_timedout": false
  },
  "task": {
    "spark_python_task": {
      "python_file": "dbfs:///FileStore/tables/entrypoint.py",
      "parameters": [
        "pyflyte-fast-execute",
        "--additional-distribution",
        "<s3://p13n-flyte-artifacts/flytesnacks/development/MV7ISU63ZHQQZ7ZYKJRF3VXRWI======/scriptmode.tar.gz>",
        "--dest-dir",
        ".",
        "--",
        "pyflyte-execute",
        "--inputs",
        "<s3://p13n-flyte-artifacts/metadata/propeller/flytesnacks-development-alvq9nr4z86dx92nshfk/n0/data/inputs.pb>",
        "--output-prefix",
        "<s3://p13n-flyte-artifacts/metadata/propeller/flytesnacks-development-alvq9nr4z86dx92nshfk/n0/data/0>",
        "--raw-output-data-prefix",
        "<s3://p13n-flyte-artifacts/c7/alvq9nr4z86dx92nshfk-n0-0>",
        "--checkpoint-path",
        "<s3://p13n-flyte-artifacts/c7/alvq9nr4z86dx92nshfk-n0-0/_flytecheckpoints>",
        "--prev-checkpoint",
        "\"\"",
        "--resolver",
        "flytekit.core.python_auto_container.default_task_resolver",
        "--",
        "task-module",
        "wf_tests.simple_db",
        "task-name",
        "test_task"
      ]
    }
  },
  "cluster_spec": {
    "existing_cluster_id": "0315-172340-xc0uhob5"
  },
  "cluster_instance": {
    "cluster_id": "0315-172340-xc0uhob5",
    "spark_context_id": "4678757461727715067"
  },
  "start_time": 1678901558860,
  "setup_duration": 1000,
  "execution_duration": 22000,
  "cleanup_duration": 0,
  "end_time": 1678901582290,
  "run_name": "test_db",
  "run_page_url": "<https://wbd-dcp-cd-dev.cloud.databricks.com/?o=6475167273468992#job/1060228574312365/run/1077214>",
  "run_type": "SUBMIT_RUN",
  "attempt_number": 0,
  "format": "SINGLE_TASK"
}
I am seeing a potential issue in
flyte-propeller
Copy code
{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is nil, not map[string]interface {}","ts":"2023-03-16T15:23:45Z"}
{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is nil, not map[string]interface {}","ts":"2023-03-16T15:23:45Z"}
{"json":{"routine":"databricks-worker-2"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is nil, not map[string]interface {}","ts":"2023-03-16T15:24:15Z"}
{"json":{"routine":"databricks-worker-2"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is nil, not map[string]interface {}","ts":"2023-03-16T15:24:15Z"}
k

Kevin Su

03/16/2023, 6:01 PM
I think for some reason data[state] become nil
has anyone deleted that job?
I can add a nil check shortly.
I created a small pr here, and build a new propeller image (pingsutw/flytepropeller:dd8fefb904bb0560d74b1d398ad3f2e672e9e2e9). @Evan Sadler could you give it a try
e

Evan Sadler

03/16/2023, 6:25 PM
Thank you sooo much @Kevin Su . The job still exists and it returns JSON when I use the
curl
command. I will see about using the new image!