Hi All Our team s working on Flyte + Databricks integration Flyte #flyte-support

Hi All, Our team's working on Flyte + Databricks ...

aloof-painting-18735

07/10/2023, 3:57 PM

Hi All, Our team's working on Flyte + Databricks integration in an Enterprise workspace, struggling with these issues: https://github.com/flyteorg/flyte/issues/3855 https://github.com/flyteorg/flyte/issues/3853 We are happy to contribute, just need some guidance on how to localize the issue.

thankful-minister-83577

07/10/2023, 7:36 PM

@glamorous-carpet-83516 when you’re back up, any ideas.

thankful-minister-83577

07/10/2023, 7:36 PM

@salmon-refrigerator-32115 do you have any ideas? you’ve used this at least briefly right?

salmon-refrigerator-32115

07/10/2023, 7:54 PM

Hi @thankful-minister-83577, I haven’t used databricks tasks in flyte.

glamorous-carpet-83516

07/11/2023, 1:08 AM

looking

glamorous-carpet-83516

07/11/2023, 4:21 AM

@aloof-painting-18735 Did you see any error in the propeller pod while running databricks task?

glamorous-carpet-83516

07/11/2023, 6:18 AM

btw, could you try to send a get request to dbx by using curl? like

Copy code

curl --netrc --request GET --header "Authorization: Bearer $DATABRICKS_TOEN" \
'<https://dbc-32fcad04-13c2.cloud.databricks.com/api/2.0/jobs/runs/get?run_id=306>'

glamorous-carpet-83516

07/11/2023, 6:19 AM

I want to see the json response. databricks probably updated their response spec, so flyte can’t properly parse the response.

aloof-painting-18735

07/11/2023, 7:21 AM

@glamorous-carpet-83516 sure, let me check

glamorous-carpet-83516

07/11/2023, 10:22 AM

loader must define exec_module() when running Databricks task

which version of python are you using

glamorous-carpet-83516

07/11/2023, 10:22 AM

did you see the same error when running workflow locally?

glamorous-carpet-83516

07/11/2023, 10:25 AM

#3855 [BUG] Flyte task keeps running forever when running a Databricks job

btw, Does the databricks job job succeed or fail?

aloof-painting-18735

07/11/2023, 11:10 AM

btw, Does the databricks job job succeed or fail?

Databricks job succeeded

aloof-painting-18735

07/11/2023, 11:12 AM

which version of python are you using

I'm running this job on DBR 11.3 LTS (in both cases), it has Python 3.9.5 (added this info to the ticket also)

aloof-painting-18735

07/11/2023, 11:47 AM

btw, could you try to send a get request to dbx by using curl?

Copy code

{
  "attempt_number": 0,
  "cleanup_duration": 0,
  "cluster_instance": {
    "cluster_id": "<my-cluster-id>",
    "spark_context_id": "<my-spark-context-id>"
  },
  "cluster_spec": {
    "existing_cluster_id": "<my-cluster-id>"
  },
  "creator_user_name": "<my-username>",
  "end_time": 1688987784820,
  "execution_duration": 223000,
  "format": "SINGLE_TASK",
  "job_id": 1060720031042619,
  "number_in_job": 574539,
  "run_id": 574539,
  "run_name": "dbx simplified example",
  "run_page_url": "<my-run-page-url>",
  "run_type": "SUBMIT_RUN",
  "setup_duration": 41000,
  "start_time": 1688987520036,
  "state": {
    "life_cycle_state": "TERMINATED",
    "result_state": "SUCCESS",
    "state_message": "",
    "user_cancelled_or_timedout": false
  },
  "task": {
    "spark_python_task": {
      "parameters": [
        "pyflyte-fast-execute",
        "--additional-distribution",
        "s3://<my-s3-bucket>/flytesnacks/development/UMZ6XPNM4L6KL4YALV56QDMSX4======/script_mode.tar.gz",
        "--dest-dir",
        ".",
        "--",
        "pyflyte-execute",
        "--inputs",
        "s3://<my-s3-bucket>/metadata/propeller/flytesnacks-development-ff83ea058624d44ddbe9/n0/data/inputs.pb",
        "--output-prefix",
        "s3://<my-s3-bucket>/metadata/propeller/flytesnacks-development-ff83ea058624d44ddbe9/n0/data/0",
        "--raw-output-data-prefix",
        "s3://<my-s3-bucket>/raw_data/sh/ff83ea058624d44ddbe9-n0-0",
        "--checkpoint-path",
        "s3://<my-s3-bucket>/raw_data/sh/ff83ea058624d44ddbe9-n0-0/_flytecheckpoints",
        "--prev-checkpoint",
        "\"\"",
        "--resolver",
        "flytekit.core.python_auto_container.default_task_resolver",
        "--",
        "task-module",
        "dbx_simplified_example",
        "task-name",
        "print_spark_config"
      ],
      "python_file": "dbfs:/tmp/flyte/entrypoint.py"
    }
  }
}

(added to the ticket also)

aloof-painting-18735

07/11/2023, 11:49 AM

Did you see any error in the propeller pod while running databricks task?

No, I didn't. It's pretty weird, I also expected some error logs, but haven't seen any - let me double-check

aloof-painting-18735

07/11/2023, 3:40 PM

Did you see any error in the propeller pod while running databricks task?

It's weird - I've just triggered a run (11/07/2023), but can't see any new logs in

flyteproperrel

. The latest logs are 4 days old.

aloof-painting-18735

07/11/2023, 3:41 PM

could it be a plugin initialization error or it's something completely different?

glamorous-carpet-83516

07/11/2023, 4:02 PM

I’ve just triggered a run (11/07/2023)

so the task is still running? and the databricks job is already completed.

aloof-painting-18735

07/12/2023, 7:17 AM

yes, exactly

glamorous-carpet-83516

07/12/2023, 7:24 AM

@aloof-painting-18735 could you check the propeller log again? want to know if there is any new error message?

aloof-painting-18735

07/12/2023, 7:24 AM

as I understand,

flyteproperrel

is responsible for the task management. is there a way to monitor the HTTP traffic between

flyteproperrel

and

Databricks

aloof-painting-18735

07/12/2023, 7:25 AM

yeah, just triggered a new run, let me check

glamorous-carpet-83516

07/12/2023, 7:28 AM

is there a way to monitor the HTTP traffic between

need to add more logs to the plugin

glamorous-carpet-83516

07/12/2023, 7:29 AM

do you remember when did you see this error? when you trying to abort task? do you see the same error in the new run?

aloof-painting-18735

07/12/2023, 7:30 AM

hmm, yeah. anyway, based on this code, we should see errors in the logs if the communication is broken

glamorous-carpet-83516

07/12/2023, 7:30 AM

yes, you’re right.

aloof-painting-18735

07/12/2023, 7:32 AM

here's the status of the recent run:

aloof-painting-18735

07/12/2023, 7:32 AM

aloof-painting-18735

07/12/2023, 7:33 AM

aloof-painting-18735

07/12/2023, 7:33 AM

no change in flyteproperrel logs

aloof-painting-18735

07/12/2023, 7:33 AM

flyteadmin

logs

aloof-painting-18735

07/12/2023, 7:34 AM

I can see log entries related to the run

glamorous-carpet-83516

07/12/2023, 7:34 AM

oh, I see. propeller didn’t create resource metadata for some reasons. investigating

aloof-painting-18735

07/12/2023, 7:34 AM

aloof-painting-18735

07/12/2023, 7:35 AM

this is the

flyteadmin

log

glamorous-carpet-83516

07/12/2023, 8:11 AM

I create a small pr, and build an image. could you give it a try?

Copy code

pingsutw/flytepropeller:c04b9260a4f1fe17f30283b470525807357a01ec

aloof-painting-18735

07/12/2023, 8:13 AM

sure, let us take a look. I guess we just need to override the

flyteproperrel

image reference in our setup with the one you shared

aloof-painting-18735

07/12/2023, 8:18 AM

like this, right?

Copy code

flytepropeller:
  enabled: true
  manager: false
  # -- Whether to install the flyteworkflows CRD with helm
  createCRDs: true
  # -- Replicas count for Flytepropeller deployment
  replicaCount: 1
  image:
    # -- Docker image for Flytepropeller deployment
    repository: pingsutw/flytepropeller # FLYTEPROPELLER_IMAGE
    tag: c04b9260a4f1fe17f30283b470525807357a01ec # FLYTEPROPELLER_TAG
    pullPolicy: IfNotPresent

glamorous-carpet-83516

07/12/2023, 8:25 AM

yes, correct

glamorous-carpet-83516

07/12/2023, 8:26 AM

or you can directly update the image in the flytepropeller deployment

aloof-painting-18735

07/12/2023, 8:26 AM

all right

aloof-painting-18735

07/12/2023, 8:55 AM

thank you, working on the deployment with my team mates (I don't have a required access) so, this change will basically tell us whether the

taskCtx.ResourceMeta

is initialized The POST request that is creating the job is successfully completed, so probably we can presume that this part is completed successfully:

Copy code

resp, err := <http://p.client.Do|p.client.Do>(req)
	if err != nil {
		return nil, nil, err
	}

Probably something goes wrong here, right?

Copy code

data, err := buildResponse(resp)
	if err != nil {
		return nil, nil, err
	}
	if data["run_id"] == "" {
		return nil, nil, pluginErrors.Wrapf(pluginErrors.RuntimeFailure, err,
			"Unable to fetch statementHandle from http response")
	}

It is quite strange that we do not have any errors in the logs - my guess is that these errors should be propagated to the

flyteproperrel

logs. Right?

aloof-painting-18735

07/12/2023, 10:12 AM

cc @billions-midnight-10687

billions-midnight-10687

07/12/2023, 10:14 AM

hi all

billions-midnight-10687

07/12/2023, 10:14 AM

we tried to deploy the image but we got this

billions-midnight-10687

07/12/2023, 10:14 AM

Copy code

flytepropeller-8574c869bb-d8bzv      0/1     Error              2 (23s ago)   33s
flytepropeller-8574c869bb-srwqd      0/1     CrashLoopBackOff   1 (16s ago)   24s

billions-midnight-10687

07/12/2023, 10:15 AM

Copy code

k logs flytepropeller-8574c869bb-srwqd -n flyte
exec /bin/flytepropeller: exec format error

billions-midnight-10687

07/12/2023, 10:15 AM

the reason is because of this

billions-midnight-10687

07/12/2023, 10:15 AM

Copy code

ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), statically linked, Go BuildID=yrkx_WEsTHfXYES1W5qM/G0DQz0Khz26svL-2chDO/BgntAhD_JOMfoTcNhtoP/JaCUuM92wwK3w0KL3Pzo, with debug_info, not stripped

billions-midnight-10687

07/12/2023, 10:15 AM

we are running flyte on EKS and this image is ARMx64

billions-midnight-10687

07/12/2023, 10:15 AM

can we get a x86_64 image or perhaps a multiarch one to test it

glamorous-carpet-83516

07/12/2023, 10:16 AM

Oh sorry, let me build a new one

billions-midnight-10687

07/12/2023, 10:16 AM

thank you

glamorous-carpet-83516

07/12/2023, 10:31 AM

pingsutw/flytepropeller:e37bdfeffac589273f58e024ed9ef8bd5c48dbc6

billions-midnight-10687

07/12/2023, 11:11 AM

checking

aloof-painting-18735

07/12/2023, 11:33 AM

@glamorous-carpet-83516 we deployed

flyteproperrel

from the image, we got this in the logs again:

Copy code

time="2023-07-12T11:24:12Z" level=info msg=------------------------------------------------------------------------
time="2023-07-12T11:24:12Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2023-07-12 11:24:12.690099108 +0000 UTC m=+0.023105387]"
time="2023-07-12T11:24:12Z" level=info msg=------------------------------------------------------------------------
time="2023-07-12T11:24:12Z" level=info msg="Detected: 8 CPU's\n"
{"json":{},"level":"warning","msg":"defaulting max ttl for workflows to 23 hours, since configured duration is larger than 23 [23]","ts":"2023-07-12T11:24:12Z"}
{"json":{},"level":"warning","msg":"stow configuration section missing, defaulting to legacy s3/minio connection config","ts":"2023-07-12T11:24:12Z"}
I0712 11:24:13.017134       1 leaderelection.go:248] attempting to acquire leader lease flyte/propeller-leader...
I0712 11:24:29.591775       1 leaderelection.go:258] successfully acquired lease flyte/propeller-leader
{"json":{"routine":"databricks-worker-1"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-1"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-2"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-2"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-4"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-4"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-6"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-6"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-8"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-8"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-5"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-5"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-9"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-9"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-3"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-3"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-7"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}
{"json":{"routine":"databricks-worker-7"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-12T11:24:59Z"}

aloof-painting-18735

07/12/2023, 11:33 AM

so, this error occurs at container start

aloof-painting-18735

07/12/2023, 11:35 AM

hmm, what is this error about? "Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper"

aloof-painting-18735

07/12/2023, 11:38 AM

after this error, I can't see and

flyteproperrel

logs

aloof-painting-18735

07/12/2023, 11:48 AM

hmm, we have a bunch of workflows / tasks deployed in our Flyte workspace

aloof-painting-18735

07/12/2023, 11:48 AM

I suspect

flyteproperrel

is trying to refresh the status of these tasks

aloof-painting-18735

07/12/2023, 11:48 AM

and that's why we have these errors right after the container start

aloof-painting-18735

07/12/2023, 11:49 AM

does it make sense @glamorous-carpet-83516?

aloof-painting-18735

07/12/2023, 12:09 PM

as I see the error is coming from here

aloof-painting-18735

07/12/2023, 12:09 PM

https://github.com/pingsutw/flyte-monorepo/blob/2c1af12981b09421c4f9d892040df3599a236bfe/flytestdlib/cache/auto_refresh.go#L254

aloof-painting-18735

07/12/2023, 12:10 PM

and this is the interface:

aloof-painting-18735

07/12/2023, 12:10 PM

https://github.com/pingsutw/flyte-monorepo/blob/2c1af12981b09421c4f9d892040df3599a[…]36bfe/flyteplugins/go/tasks/plugins/webapi/databricks/plugin.go

glamorous-carpet-83516

07/12/2023, 2:11 PM

@aloof-painting-18735 are you able to jump on a call, and share screen?

glamorous-carpet-83516

07/12/2023, 2:17 PM

just add one more change, could you test it again? pingsutw/flytepropeller:55fa6a247e659f629d535c022c84d8985054368f

aloof-painting-18735

07/12/2023, 2:18 PM

sorry, I'm in another call

aloof-painting-18735

07/12/2023, 2:18 PM

ok, let us check

billions-midnight-10687

07/13/2023, 9:56 AM

hi @glamorous-carpet-83516

billions-midnight-10687

07/13/2023, 9:56 AM

we deployed this version

billions-midnight-10687

07/13/2023, 9:57 AM

we still see these errors in the flytepropeller logs

billions-midnight-10687

07/13/2023, 9:57 AM

Copy code

{"json":{"routine":"databricks-worker-0"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-13T09:51:46Z"}
{"json":{"routine":"databricks-worker-9"},"level":"error","msg":"worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-13T09:51:46Z"}
{"json":{"routine":"databricks-worker-9"},"level":"error","msg":"Failed to sync. Error: worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper","ts":"2023-07-13T09:51:46Z"}

billions-midnight-10687

07/13/2023, 9:57 AM

but as @aloof-painting-18735 said, we are able to trigger a job on Databricks from flyte, it’s just that this error pops up and it looks like the communication with the flyte plugins on Databricks is not fully working

aloof-painting-18735

07/25/2023, 3:24 PM

@glamorous-carpet-83516 any update on this? we tested the PATCHED images - unfortunately facing the same issue (see above) any idea why the

worker panic'd and is shutting down. Error: interface conversion: interface {} is databricks.ResourceMetaWrapper, not *databricks.ResourceMetaWrapper

is happening?

glamorous-carpet-83516

09/26/2023, 8:07 PM

we just found the main issue, and fixed it. https://github.com/flyteorg/flyteplugins/pull/407

40 Views

Open in Slack

Previous Next