I working through setting up the databricks plugin...
# ask-the-community
I working through setting up the databricks plugin on the demo cluster here: https://github.com/flyteorg/flyte/blob/master/CHANGELOG/CHANGELOG-v1.3.0-b5.md. Could use some debugging help. See 🧵
I am using an existing cluster where I manually installed the packages required for my tasks and for the entrypoint.py.
Copy code
           "run_name": "test databricks",
           "existing_cluster_id": "1220-215617-43ri4502",
           "timeout_seconds": 3600,
           "max_retries": 1,
The entry point is having an issue though and I was hoping to get some help. The task is just the one from the examples.
Also my AWS account and DB account are personal, so it is fine to share.
I’m using an existing cluster also cc @Tanmay Mathur for visibility
@Evan Sadler sorry, could you change line 28 in entrypoint to
_execute_task_cmd.callback(test=False, **args)
All good! Checking now
@Kevin Su I got one step futher 🙂. I am trying to not use a custom docker image and rely on fast register, but it isn’t picking up on the top level module.
Copy code
ModuleNotFoundError: No module named 'flyte_cookiecutter'
This is my folder structure and I have init…
Copy code
This is probably just a general fast register help question
I think I just need to figure out the correct dest directory
did you use
for dest directory?
pyflyte register --destination-dir .
Oh perfect! I see what that does now 😆
@Kevin Su success!
are you running the task on Databricks?
I am!
Nice, awesome!!!
THANK YOU @Kevin Su and @Yee.
I am going to take off for the day, but I will share notes on what I did tomorrow
Thanks for the support!
Happy Holidays to you guys!
Merry Christmas!
hey folks, please help spread the workd
@Kevin Su you are a rockstar!
@Kevin Su, @Tanmay Mathur, &@Evan Sadler, Happy New Year! Thanks for sharing the knowledge of Databricks task. If I want to run a databricks spark job as flyte task to read multiple parquet files stored in a dbfs folder, and return the data as a pandas DF. And the next flyte task is going to use the pandas DF as input. This flyte task should be executed in the flyte managed EKS pod/node outside of Databricks. Will this task get the data automatically (transferred from databricks system to flyte system? Or this scenario is not supported by the new databricks plugin?
yes, the downstream task will get the data automatically. By default, flyte will write dataframe to s3 bucket. Your EKS pod should have access to that bucket as well, so flytekit will download the data before running the job.
@Kevin Su, RE: so flytekit will download the data before running the job. What is the mechanism for the downlaod? Is it done in a single process/thread, or is it parallel? And how efficient is it compared to spark read from S3?
flyte write intermediate data (upstream output) to the s3 bucket, and download stream task will download it in a single process. we use awscli to download the parquet file, and use pandas to read it. However, you can install flytekitplugin-fsspec to get better performance. fsspec will directly read the data from s3 instead of downloading to local disk first. We plan to replace default persistence plugin (awscli) with fsspec in the next release. IIUC, there is little difference in performance between fsspec and Spark, since they both use Arrow under the hood. However, downloading data by using awscli is really slow, that’s why we want to replace it.
how much data are we talking about btw frank?
just wondering
@Kevin Su For the spark plugin, it seems like it save sand loads the datasets directly from s3: https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-spark/flytekitplugins/spark/sd_transformers.py#L41. This should work great if that is the case.
Hopefully the output of spark tasks is small enough to work with python tasks
@Yee I might need 20GBs of data
@Yee, I have a job that uses upto 25 GB of input data.
@Kevin Su, I see what you are saying. I will combine the spark read and ML operation on the data in one flyte task.
For the spark plugin, it seems like it save sand loads the datasets directly from s3:
That transformer only works for spark.dataframe. yes, you can directly return spark dataframe in the task as well.