I working through setting up the databricks plugin on the de Flyte #flyte-support

I working through setting up the databricks plugin...

abundant-hamburger-66584

12/22/2022, 3:42 PM

I working through setting up the databricks plugin on the demo cluster here: https://github.com/flyteorg/flyte/blob/master/CHANGELOG/CHANGELOG-v1.3.0-b5.md. Could use some debugging help. See 🧵

abundant-hamburger-66584

12/22/2022, 3:42 PM

I am using an existing cluster where I manually installed the packages required for my tasks and for the entrypoint.py.

Copy code

@task(
    task_config=Databricks(
        databricks_conf={
           "run_name": "test databricks",
           "existing_cluster_id": "1220-215617-43ri4502",
           "timeout_seconds": 3600,
           "max_retries": 1,
       }
    ),

abundant-hamburger-66584

12/22/2022, 3:45 PM

The entry point is having an issue though and I was hoping to get some help. The task is just the one from the examples.

abundant-hamburger-66584

12/22/2022, 3:45 PM

Also my AWS account and DB account are personal, so it is fine to share.

abundant-hamburger-66584

12/22/2022, 3:51 PM

I’m using an existing cluster also cc @brave-island-50333 for visibility

🙏 2

glamorous-carpet-83516

12/22/2022, 6:45 PM

@abundant-hamburger-66584 sorry, could you change line 28 in entrypoint to

_execute_task_cmd.callback(test=False, **args)

👍 1

abundant-hamburger-66584

12/22/2022, 6:59 PM

All good! Checking now

abundant-hamburger-66584

12/22/2022, 8:09 PM

@glamorous-carpet-83516 I got one step futher 🙂. I am trying to not use a custom docker image and rely on fast register, but it isn’t picking up on the top level module.

Copy code

ModuleNotFoundError: No module named 'flyte_cookiecutter'

This is my folder structure and I have init…

Copy code

flyte_cookiecutter
   __init__.py
   workflows
       __init__.py
       databricks.py

abundant-hamburger-66584

12/22/2022, 8:09 PM

This is probably just a general fast register help question

abundant-hamburger-66584

12/22/2022, 8:21 PM

I think I just need to figure out the correct dest directory

glamorous-carpet-83516

12/22/2022, 8:39 PM

did you use

for dest directory?

glamorous-carpet-83516

12/22/2022, 8:40 PM

pyflyte register --destination-dir .

abundant-hamburger-66584

12/22/2022, 9:00 PM

Oh perfect! I see what that does now 😆

👍 1

abundant-hamburger-66584

12/22/2022, 9:06 PM

@glamorous-carpet-83516 success!

glamorous-carpet-83516

12/22/2022, 9:07 PM

are you running the task on Databricks?

abundant-hamburger-66584

12/22/2022, 9:10 PM

I am!

glamorous-carpet-83516

12/22/2022, 9:18 PM

Nice, awesome!!!

🙌 1

abundant-hamburger-66584

12/22/2022, 9:28 PM

THANK YOU @glamorous-carpet-83516 and @thankful-minister-83577.

abundant-hamburger-66584

12/22/2022, 9:29 PM

I am going to take off for the day, but I will share notes on what I did tomorrow

brave-island-50333

12/22/2022, 9:33 PM

Thanks for the support!

➕ 1

brave-island-50333

12/22/2022, 9:33 PM

Happy Holidays to you guys!

glamorous-carpet-83516

12/22/2022, 9:40 PM

Merry Christmas!

🎄 4

freezing-airport-6809

12/22/2022, 11:12 PM

hey folks, please help spread the word

freezing-airport-6809

12/22/2022, 11:12 PM

@glamorous-carpet-83516 you are a rockstar!

💯 3

salmon-refrigerator-32115

01/05/2023, 7:33 PM

@glamorous-carpet-83516, @brave-island-50333, &@abundant-hamburger-66584, Happy New Year! Thanks for sharing the knowledge of Databricks task. If I want to run a databricks spark job as flyte task to read multiple parquet files stored in a dbfs folder, and return the data as a pandas DF. And the next flyte task is going to use the pandas DF as input. This flyte task should be executed in the flyte managed EKS pod/node outside of Databricks. Will this task get the data automatically (transferred from databricks system to flyte system? Or this scenario is not supported by the new databricks plugin?

glamorous-carpet-83516

01/05/2023, 7:39 PM

yes, the downstream task will get the data automatically. By default, flyte will write dataframe to s3 bucket. Your EKS pod should have access to that bucket as well, so flytekit will download the data before running the job.

salmon-refrigerator-32115

01/05/2023, 7:42 PM

@glamorous-carpet-83516, RE: so flytekit will download the data before running the job. What is the mechanism for the downlaod? Is it done in a single process/thread, or is it parallel? And how efficient is it compared to spark read from S3 directly?

glamorous-carpet-83516

01/05/2023, 8:00 PM

flyte write intermediate data (upstream output) to the s3 bucket, and download stream task will download it in a single process. we use awscli to download the parquet file, and use pandas to read it. However, you can install flytekitplugin-fsspec to get better performance. fsspec will directly read the data from s3 instead of downloading to local disk first. We plan to replace default persistence plugin (awscli) with fsspec in the next release. IIUC, there is little difference in performance between fsspec and Spark, since they both use Arrow under the hood. However, downloading data by using awscli is really slow, that’s why we want to replace it.

thankful-minister-83577

01/05/2023, 8:01 PM

how much data are we talking about btw frank?

thankful-minister-83577

01/05/2023, 8:02 PM

just wondering

abundant-hamburger-66584

01/05/2023, 8:14 PM

@glamorous-carpet-83516 For the spark plugin, it seems like it save sand loads the datasets directly from s3: https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-spark/flytekitplugins/spark/sd_transformers.py#L41. This should work great if that is the case.

abundant-hamburger-66584

01/05/2023, 8:14 PM

Hopefully the output of spark tasks is small enough to work with python tasks

abundant-hamburger-66584

01/05/2023, 8:15 PM

@thankful-minister-83577 I might need 20GBs of data for a model. Not sure what @User is looking at.

salmon-refrigerator-32115

01/05/2023, 8:26 PM

@thankful-minister-83577, I have a job that uses upto 25 GB of input data.

salmon-refrigerator-32115

01/05/2023, 8:32 PM

@glamorous-carpet-83516, I see what you are saying. I will combine the spark read and ML operation on the data in one flyte task.

glamorous-carpet-83516

01/05/2023, 8:56 PM

For the spark plugin, it seems like it save sand loads the datasets directly from s3:

That transformer only works for spark.dataframe. yes, you can directly return spark dataframe in the task as well.

160 Views

Open in Slack

Previous Next