Robert Ambrus
07/10/2023, 3:57 PMYee
Frank Shen
07/10/2023, 7:54 PMKevin Su
07/11/2023, 1:08 AMcurl --netrc --request GET --header "Authorization: Bearer $DATABRICKS_TOEN" \
'<https://dbc-32fcad04-13c2.cloud.databricks.com/api/2.0/jobs/runs/get?run_id=306>'
Robert Ambrus
07/11/2023, 7:21 AMKevin Su
07/11/2023, 10:22 AMloader must define exec_module() when running Databricks taskwhich version of python are you using
#3855 [BUG] Flyte task keeps running forever when running a Databricks jobbtw, Does the databricks job job succeed or fail?
Robert Ambrus
07/11/2023, 11:10 AMbtw, Does the databricks job job succeed or fail?Databricks job succeeded
which version of python are you usingI'm running this job on DBR 11.3 LTS (in both cases), it has Python 3.9.5 (added this info to the ticket also)
btw, could you try to send a get request to dbx by using curl?
{
"attempt_number": 0,
"cleanup_duration": 0,
"cluster_instance": {
"cluster_id": "<my-cluster-id>",
"spark_context_id": "<my-spark-context-id>"
},
"cluster_spec": {
"existing_cluster_id": "<my-cluster-id>"
},
"creator_user_name": "<my-username>",
"end_time": 1688987784820,
"execution_duration": 223000,
"format": "SINGLE_TASK",
"job_id": 1060720031042619,
"number_in_job": 574539,
"run_id": 574539,
"run_name": "dbx simplified example",
"run_page_url": "<my-run-page-url>",
"run_type": "SUBMIT_RUN",
"setup_duration": 41000,
"start_time": 1688987520036,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": "",
"user_cancelled_or_timedout": false
},
"task": {
"spark_python_task": {
"parameters": [
"pyflyte-fast-execute",
"--additional-distribution",
"s3://<my-s3-bucket>/flytesnacks/development/UMZ6XPNM4L6KL4YALV56QDMSX4======/script_mode.tar.gz",
"--dest-dir",
".",
"--",
"pyflyte-execute",
"--inputs",
"s3://<my-s3-bucket>/metadata/propeller/flytesnacks-development-ff83ea058624d44ddbe9/n0/data/inputs.pb",
"--output-prefix",
"s3://<my-s3-bucket>/metadata/propeller/flytesnacks-development-ff83ea058624d44ddbe9/n0/data/0",
"--raw-output-data-prefix",
"s3://<my-s3-bucket>/raw_data/sh/ff83ea058624d44ddbe9-n0-0",
"--checkpoint-path",
"s3://<my-s3-bucket>/raw_data/sh/ff83ea058624d44ddbe9-n0-0/_flytecheckpoints",
"--prev-checkpoint",
"\"\"",
"--resolver",
"flytekit.core.python_auto_container.default_task_resolver",
"--",
"task-module",
"dbx_simplified_example",
"task-name",
"print_spark_config"
],
"python_file": "dbfs:/tmp/flyte/entrypoint.py"
}
}
}
(added to the ticket also)Did you see any error in the propeller pod while running databricks task?No, I didn't. It's pretty weird, I also expected some error logs, but haven't seen any - let me double-check
Did you see any error in the propeller pod while running databricks task?It's weird - I've just triggered a run (11/07/2023), but can't see any new logs in
flyteproperrel
. The latest logs are 4 days old.Kevin Su
07/11/2023, 4:02 PMI’ve just triggered a run (11/07/2023)so the task is still running? and the databricks job is already completed.
Robert Ambrus
07/12/2023, 7:17 AMKevin Su
07/12/2023, 7:24 AMRobert Ambrus
07/12/2023, 7:24 AMflyteproperrel
is responsible for the task management. is there a way to monitor the HTTP traffic between flyteproperrel
and Databricks
?Kevin Su
07/12/2023, 7:28 AMis there a way to monitor the HTTP traffic betweenneed to add more logs to the plugin
Robert Ambrus
07/12/2023, 7:30 AMKevin Su
07/12/2023, 7:30 AMRobert Ambrus
07/12/2023, 7:32 AMflyteadmin
logsKevin Su
07/12/2023, 7:34 AMRobert Ambrus
07/12/2023, 7:34 AMflyteadmin
logKevin Su
07/12/2023, 8:11 AMpingsutw/flytepropeller:c04b9260a4f1fe17f30283b470525807357a01ec
Robert Ambrus
07/12/2023, 8:13 AMflyteproperrel
image reference in our setup with the one you sharedflytepropeller:
enabled: true
manager: false
# -- Whether to install the flyteworkflows CRD with helm
createCRDs: true
# -- Replicas count for Flytepropeller deployment
replicaCount: 1
image:
# -- Docker image for Flytepropeller deployment
repository: pingsutw/flytepropeller # FLYTEPROPELLER_IMAGE
tag: c04b9260a4f1fe17f30283b470525807357a01ec # FLYTEPROPELLER_TAG
pullPolicy: IfNotPresent
Kevin Su
07/12/2023, 8:25 AMRobert Ambrus
07/12/2023, 8:26 AMtaskCtx.ResourceMeta
is initialized
The POST request that is creating the job is successfully completed, so probably we can presume that this part is completed successfully:
resp, err := <http://p.client.Do|p.client.Do>(req)
if err != nil {
return nil, nil, err
}
Probably something goes wrong here, right?
data, err := buildResponse(resp)
if err != nil {
return nil, nil, err
}
if data["run_id"] == "" {
return nil, nil, pluginErrors.Wrapf(pluginErrors.RuntimeFailure, err,
"Unable to fetch statementHandle from http response")
}
It is quite strange that we do not have any errors in the logs - my guess is that these errors should be propagated to the flyteproperrel
logs. Right?Georgi Ivanov
07/12/2023, 10:14 AMflytepropeller-8574c869bb-d8bzv 0/1 Error 2 (23s ago) 33s
flytepropeller-8574c869bb-srwqd 0/1 CrashLoopBackOff 1 (16s ago) 24s
k logs flytepropeller-8574c869bb-srwqd -n flyte
exec /bin/flytepropeller: exec format error
ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), statically linked, Go BuildID=yrkx_WEsTHfXYES1W5qM/G0DQz0Khz26svL-2chDO/BgntAhD_JOMfoTcNhtoP/JaCUuM92wwK3w0KL3Pzo, with debug_info, not stripped
Kevin Su
07/12/2023, 10:16 AMGeorgi Ivanov
07/12/2023, 10:16 AMKevin Su
07/12/2023, 10:31 AMGeorgi Ivanov
07/12/2023, 11:11 AMRobert Ambrus
07/12/2023, 11:33 AMflyteproperrel
from the image, we got this in the logs again:
time="2023-07-12T11:24:12Z" level=info msg=------------------------------------------------------------------------
time="2023-07-12T11:24:12Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2023-07-12 11:24:12.690099108 +0000 UTC m=+0.023105387]"
time="2023-07-12T11:24:12Z" level=info msg=------------------------------------------------------------------------
time="2023-07-12T11:24:12Z" level=info msg="Detected: 8 CPU's\n"
{"json":{},"level":"warning","msg":"defaulting max ttl for workflows to 23 hours, since configured duration is larger than 23 [23]","ts":"2023-07-12T11:24:12Z"}
{"json":{},"level":"warning","msg":"stow configuration section missing, defaulting to legacy s3/minio connection config","ts":"2023-07-12T11:24:12Z"}
I0712 11:24:13.017134 1 leaderelection.go:248] attempting to acquire leader lease flyte/propeller-leader...
I0712 11:24:29.591775 1 leaderelection.go:258] successfully acquired lease flyte/propeller-leader
{"json":{"routine":"databricks-worker-1"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-1"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-2"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-2"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-4"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-4"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-6"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-6"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-8"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-8"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-5"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-5"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-9"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-9"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-3"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-3"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-7"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-7"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
flyteproperrel
logsflyteproperrel
is trying to refresh the status of these tasksKevin Su
07/12/2023, 2:11 PMRobert Ambrus
07/12/2023, 2:18 PMGeorgi Ivanov
07/13/2023, 9:56 AM{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-13T09:51:46Z"}
{"json":{"routine":"databricks-worker-9"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-13T09:51:46Z"}
{"json":{"routine":"databricks-worker-9"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-13T09:51:46Z"}
Robert Ambrus
07/25/2023, 3:24 PMworker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper
is happening?Kevin Su
09/26/2023, 8:07 PM