Robert Ambrus
07/10/2023, 3:57 PMYee
Yee
Frank Shen
07/10/2023, 7:54 PMKevin Su
07/11/2023, 1:08 AMKevin Su
07/11/2023, 4:21 AMKevin Su
07/11/2023, 6:18 AMcurl --netrc --request GET --header "Authorization: Bearer $DATABRICKS_TOEN" \
'<https://dbc-32fcad04-13c2.cloud.databricks.com/api/2.0/jobs/runs/get?run_id=306>'
Kevin Su
07/11/2023, 6:19 AMRobert Ambrus
07/11/2023, 7:21 AMKevin Su
07/11/2023, 10:22 AMloader must define exec_module() when running Databricks taskwhich version of python are you using
Kevin Su
07/11/2023, 10:22 AMKevin Su
07/11/2023, 10:25 AM#3855 [BUG] Flyte task keeps running forever when running a Databricks jobbtw, Does the databricks job job succeed or fail?
Robert Ambrus
07/11/2023, 11:10 AMbtw, Does the databricks job job succeed or fail?Databricks job succeeded
Robert Ambrus
07/11/2023, 11:12 AMwhich version of python are you usingI'm running this job on DBR 11.3 LTS (in both cases), it has Python 3.9.5 (added this info to the ticket also)
Robert Ambrus
07/11/2023, 11:47 AMbtw, could you try to send a get request to dbx by using curl?
{
"attempt_number": 0,
"cleanup_duration": 0,
"cluster_instance": {
"cluster_id": "<my-cluster-id>",
"spark_context_id": "<my-spark-context-id>"
},
"cluster_spec": {
"existing_cluster_id": "<my-cluster-id>"
},
"creator_user_name": "<my-username>",
"end_time": 1688987784820,
"execution_duration": 223000,
"format": "SINGLE_TASK",
"job_id": 1060720031042619,
"number_in_job": 574539,
"run_id": 574539,
"run_name": "dbx simplified example",
"run_page_url": "<my-run-page-url>",
"run_type": "SUBMIT_RUN",
"setup_duration": 41000,
"start_time": 1688987520036,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": "",
"user_cancelled_or_timedout": false
},
"task": {
"spark_python_task": {
"parameters": [
"pyflyte-fast-execute",
"--additional-distribution",
"s3://<my-s3-bucket>/flytesnacks/development/UMZ6XPNM4L6KL4YALV56QDMSX4======/script_mode.tar.gz",
"--dest-dir",
".",
"--",
"pyflyte-execute",
"--inputs",
"s3://<my-s3-bucket>/metadata/propeller/flytesnacks-development-ff83ea058624d44ddbe9/n0/data/inputs.pb",
"--output-prefix",
"s3://<my-s3-bucket>/metadata/propeller/flytesnacks-development-ff83ea058624d44ddbe9/n0/data/0",
"--raw-output-data-prefix",
"s3://<my-s3-bucket>/raw_data/sh/ff83ea058624d44ddbe9-n0-0",
"--checkpoint-path",
"s3://<my-s3-bucket>/raw_data/sh/ff83ea058624d44ddbe9-n0-0/_flytecheckpoints",
"--prev-checkpoint",
"\"\"",
"--resolver",
"flytekit.core.python_auto_container.default_task_resolver",
"--",
"task-module",
"dbx_simplified_example",
"task-name",
"print_spark_config"
],
"python_file": "dbfs:/tmp/flyte/entrypoint.py"
}
}
}
(added to the ticket also)Robert Ambrus
07/11/2023, 11:49 AMDid you see any error in the propeller pod while running databricks task?No, I didn't. It's pretty weird, I also expected some error logs, but haven't seen any - let me double-check
Robert Ambrus
07/11/2023, 3:40 PMDid you see any error in the propeller pod while running databricks task?It's weird - I've just triggered a run (11/07/2023), but can't see any new logs in
flyteproperrel
. The latest logs are 4 days old.Robert Ambrus
07/11/2023, 3:41 PMKevin Su
07/11/2023, 4:02 PMI’ve just triggered a run (11/07/2023)so the task is still running? and the databricks job is already completed.
Robert Ambrus
07/12/2023, 7:17 AMKevin Su
07/12/2023, 7:24 AMRobert Ambrus
07/12/2023, 7:24 AMflyteproperrel
is responsible for the task management. is there a way to monitor the HTTP traffic between flyteproperrel
and Databricks
?Robert Ambrus
07/12/2023, 7:25 AMKevin Su
07/12/2023, 7:28 AMis there a way to monitor the HTTP traffic betweenneed to add more logs to the plugin
Kevin Su
07/12/2023, 7:29 AMRobert Ambrus
07/12/2023, 7:30 AMKevin Su
07/12/2023, 7:30 AMRobert Ambrus
07/12/2023, 7:32 AMRobert Ambrus
07/12/2023, 7:32 AMRobert Ambrus
07/12/2023, 7:33 AMRobert Ambrus
07/12/2023, 7:33 AMRobert Ambrus
07/12/2023, 7:33 AMflyteadmin
logsRobert Ambrus
07/12/2023, 7:34 AMKevin Su
07/12/2023, 7:34 AMRobert Ambrus
07/12/2023, 7:34 AMRobert Ambrus
07/12/2023, 7:35 AMflyteadmin
logKevin Su
07/12/2023, 8:11 AMpingsutw/flytepropeller:c04b9260a4f1fe17f30283b470525807357a01ec
Robert Ambrus
07/12/2023, 8:13 AMflyteproperrel
image reference in our setup with the one you sharedRobert Ambrus
07/12/2023, 8:18 AMflytepropeller:
enabled: true
manager: false
# -- Whether to install the flyteworkflows CRD with helm
createCRDs: true
# -- Replicas count for Flytepropeller deployment
replicaCount: 1
image:
# -- Docker image for Flytepropeller deployment
repository: pingsutw/flytepropeller # FLYTEPROPELLER_IMAGE
tag: c04b9260a4f1fe17f30283b470525807357a01ec # FLYTEPROPELLER_TAG
pullPolicy: IfNotPresent
Kevin Su
07/12/2023, 8:25 AMKevin Su
07/12/2023, 8:26 AMRobert Ambrus
07/12/2023, 8:26 AMRobert Ambrus
07/12/2023, 8:55 AMtaskCtx.ResourceMeta
is initialized
The POST request that is creating the job is successfully completed, so probably we can presume that this part is completed successfully:
resp, err := <http://p.client.Do|p.client.Do>(req)
if err != nil {
return nil, nil, err
}
Probably something goes wrong here, right?
data, err := buildResponse(resp)
if err != nil {
return nil, nil, err
}
if data["run_id"] == "" {
return nil, nil, pluginErrors.Wrapf(pluginErrors.RuntimeFailure, err,
"Unable to fetch statementHandle from http response")
}
It is quite strange that we do not have any errors in the logs - my guess is that these errors should be propagated to the flyteproperrel
logs. Right?Robert Ambrus
07/12/2023, 10:12 AMGeorgi Ivanov
07/12/2023, 10:14 AMGeorgi Ivanov
07/12/2023, 10:14 AMGeorgi Ivanov
07/12/2023, 10:14 AMflytepropeller-8574c869bb-d8bzv 0/1 Error 2 (23s ago) 33s
flytepropeller-8574c869bb-srwqd 0/1 CrashLoopBackOff 1 (16s ago) 24s
Georgi Ivanov
07/12/2023, 10:15 AMk logs flytepropeller-8574c869bb-srwqd -n flyte
exec /bin/flytepropeller: exec format error
Georgi Ivanov
07/12/2023, 10:15 AMGeorgi Ivanov
07/12/2023, 10:15 AMELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), statically linked, Go BuildID=yrkx_WEsTHfXYES1W5qM/G0DQz0Khz26svL-2chDO/BgntAhD_JOMfoTcNhtoP/JaCUuM92wwK3w0KL3Pzo, with debug_info, not stripped
Georgi Ivanov
07/12/2023, 10:15 AMGeorgi Ivanov
07/12/2023, 10:15 AMKevin Su
07/12/2023, 10:16 AMGeorgi Ivanov
07/12/2023, 10:16 AMKevin Su
07/12/2023, 10:31 AMGeorgi Ivanov
07/12/2023, 11:11 AMRobert Ambrus
07/12/2023, 11:33 AMflyteproperrel
from the image, we got this in the logs again:
time="2023-07-12T11:24:12Z" level=info msg=------------------------------------------------------------------------
time="2023-07-12T11:24:12Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2023-07-12 11:24:12.690099108 +0000 UTC m=+0.023105387]"
time="2023-07-12T11:24:12Z" level=info msg=------------------------------------------------------------------------
time="2023-07-12T11:24:12Z" level=info msg="Detected: 8 CPU's\n"
{"json":{},"level":"warning","msg":"defaulting max ttl for workflows to 23 hours, since configured duration is larger than 23 [23]","ts":"2023-07-12T11:24:12Z"}
{"json":{},"level":"warning","msg":"stow configuration section missing, defaulting to legacy s3/minio connection config","ts":"2023-07-12T11:24:12Z"}
I0712 11:24:13.017134 1 leaderelection.go:248] attempting to acquire leader lease flyte/propeller-leader...
I0712 11:24:29.591775 1 leaderelection.go:258] successfully acquired lease flyte/propeller-leader
{"json":{"routine":"databricks-worker-1"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-1"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-2"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-2"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-4"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-4"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-6"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-6"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-8"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-8"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-5"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-5"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-9"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-9"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-3"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-3"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-7"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-7"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
Robert Ambrus
07/12/2023, 11:33 AMRobert Ambrus
07/12/2023, 11:35 AMRobert Ambrus
07/12/2023, 11:38 AMflyteproperrel
logsRobert Ambrus
07/12/2023, 11:48 AMRobert Ambrus
07/12/2023, 11:48 AMflyteproperrel
is trying to refresh the status of these tasksRobert Ambrus
07/12/2023, 11:48 AMRobert Ambrus
07/12/2023, 11:49 AMRobert Ambrus
07/12/2023, 12:09 PMRobert Ambrus
07/12/2023, 12:09 PMRobert Ambrus
07/12/2023, 12:10 PMKevin Su
07/12/2023, 2:11 PMKevin Su
07/12/2023, 2:17 PMRobert Ambrus
07/12/2023, 2:18 PMRobert Ambrus
07/12/2023, 2:18 PMGeorgi Ivanov
07/13/2023, 9:56 AMGeorgi Ivanov
07/13/2023, 9:56 AMGeorgi Ivanov
07/13/2023, 9:57 AMGeorgi Ivanov
07/13/2023, 9:57 AM{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-13T09:51:46Z"}
{"json":{"routine":"databricks-worker-9"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-13T09:51:46Z"}
{"json":{"routine":"databricks-worker-9"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-13T09:51:46Z"}
Georgi Ivanov
07/13/2023, 9:57 AMRobert Ambrus
07/25/2023, 3:24 PMworker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper
is happening?Kevin Su
09/26/2023, 8:07 PM