Evan Sadler
03/14/2023, 10:01 PMThe Databricks Flyte tasks launch successfully, but hang indefinitely after the DB job finishes successfully.So far I have tried looking at flytepropeller, but I haven't found any logs relating to the execution id in question. Any tips on ways to debug is much appreciated 🙏
Yee
Kevin Su
03/14/2023, 10:05 PMEvan Sadler
03/14/2023, 10:11 PMKevin Su
03/14/2023, 10:16 PMkubectl get flyteworkflow -n flytesnacks-development # list cr
kubectl get flyteworkflow atvx5kdcbgfzq862djj4 -n flytesnacks-development -o yaml
Evan Sadler
03/14/2023, 10:27 PMkubectl get flyteworkflow ff3263f587ca646a2a33 -n flytesnacks-development -o yaml
Kevin Su
03/15/2023, 9:10 PMcurl --netrc --request GET --header "Authorization: Bearer $DATABRICKS_TOKEN" \
'<https://dbc-a53b7a3c-614c.cloud.databricks.com/api/2.0/jobs/runs/get?run_id=70446>'
Yee
Kevin Su
03/15/2023, 9:15 PMEvan Sadler
03/15/2023, 9:54 PMcurl --netrc --request GET --header "Authorization: Bearer $DATABRICKS_TOKEN" \
'<https://wbd-dcp-cd-dev.cloud.databricks.com/api/2.0/jobs/runs/get?run_id=1077214>'
It looks like the job is returning a success. I am looking at the GO code and it seems like life_cycle_state
and result_state
are correct. I don't see any changes.
{
"job_id": 1060228574312365,
"run_id": 1077214,
"creator_user_name": "<mailto:evan.sadler@warnermedia.com|evan.sadler@warnermedia.com>",
"number_in_job": 1077214,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": "",
"user_cancelled_or_timedout": false
},
"task": {
"spark_python_task": {
"python_file": "dbfs:///FileStore/tables/entrypoint.py",
"parameters": [
"pyflyte-fast-execute",
"--additional-distribution",
"<s3://p13n-flyte-artifacts/flytesnacks/development/MV7ISU63ZHQQZ7ZYKJRF3VXRWI======/scriptmode.tar.gz>",
"--dest-dir",
".",
"--",
"pyflyte-execute",
"--inputs",
"<s3://p13n-flyte-artifacts/metadata/propeller/flytesnacks-development-alvq9nr4z86dx92nshfk/n0/data/inputs.pb>",
"--output-prefix",
"<s3://p13n-flyte-artifacts/metadata/propeller/flytesnacks-development-alvq9nr4z86dx92nshfk/n0/data/0>",
"--raw-output-data-prefix",
"<s3://p13n-flyte-artifacts/c7/alvq9nr4z86dx92nshfk-n0-0>",
"--checkpoint-path",
"<s3://p13n-flyte-artifacts/c7/alvq9nr4z86dx92nshfk-n0-0/_flytecheckpoints>",
"--prev-checkpoint",
"\"\"",
"--resolver",
"flytekit.core.python_auto_container.default_task_resolver",
"--",
"task-module",
"wf_tests.simple_db",
"task-name",
"test_task"
]
}
},
"cluster_spec": {
"existing_cluster_id": "0315-172340-xc0uhob5"
},
"cluster_instance": {
"cluster_id": "0315-172340-xc0uhob5",
"spark_context_id": "4678757461727715067"
},
"start_time": 1678901558860,
"setup_duration": 1000,
"execution_duration": 22000,
"cleanup_duration": 0,
"end_time": 1678901582290,
"run_name": "test_db",
"run_page_url": "<https://wbd-dcp-cd-dev.cloud.databricks.com/?o=6475167273468992#job/1060228574312365/run/1077214>",
"run_type": "SUBMIT_RUN",
"attempt_number": 0,
"format": "SINGLE_TASK"
}
I am seeing a potential issue in flyte-propeller
{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is nil, not map[string]interface {}","ts":"2023-03-16T15:23:45Z"}
{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is nil, not map[string]interface {}","ts":"2023-03-16T15:23:45Z"}
{"json":{"routine":"databricks-worker-2"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is nil, not map[string]interface {}","ts":"2023-03-16T15:24:15Z"}
{"json":{"routine":"databricks-worker-2"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is nil, not map[string]interface {}","ts":"2023-03-16T15:24:15Z"}
Kevin Su
03/16/2023, 6:01 PMEvan Sadler
03/16/2023, 6:25 PMcurl
command.
I will see about using the new image!