https://flyte.org logo
#ask-the-community
Title
# ask-the-community
e

Evan Sadler

12/22/2022, 3:42 PM
I working through setting up the databricks plugin on the demo cluster here: https://github.com/flyteorg/flyte/blob/master/CHANGELOG/CHANGELOG-v1.3.0-b5.md. Could use some debugging help. See 🧵
I am using an existing cluster where I manually installed the packages required for my tasks and for the entrypoint.py.
Copy code
@task(
    task_config=Databricks(
        databricks_conf={
           "run_name": "test databricks",
           "existing_cluster_id": "1220-215617-43ri4502",
           "timeout_seconds": 3600,
           "max_retries": 1,
       }
    ),
The entry point is having an issue though and I was hoping to get some help. The task is just the one from the examples.
Also my AWS account and DB account are personal, so it is fine to share.
I’m using an existing cluster also cc @Tanmay Mathur for visibility
k

Kevin Su

12/22/2022, 6:45 PM
@Evan Sadler sorry, could you change line 28 in entrypoint to
_execute_task_cmd.callback(test=False, **args)
e

Evan Sadler

12/22/2022, 6:59 PM
All good! Checking now
@Kevin Su I got one step futher 🙂. I am trying to not use a custom docker image and rely on fast register, but it isn’t picking up on the top level module.
Copy code
ModuleNotFoundError: No module named 'flyte_cookiecutter'
This is my folder structure and I have init…
Copy code
flyte_cookiecutter
   __init__.py
   workflows
       __init__.py
       databricks.py
This is probably just a general fast register help question
I think I just need to figure out the correct dest directory
k

Kevin Su

12/22/2022, 8:39 PM
did you use
.
for dest directory?
pyflyte register --destination-dir .
e

Evan Sadler

12/22/2022, 9:00 PM
Oh perfect! I see what that does now 😆
@Kevin Su success!
k

Kevin Su

12/22/2022, 9:07 PM
are you running the task on Databricks?
e

Evan Sadler

12/22/2022, 9:10 PM
I am!
k

Kevin Su

12/22/2022, 9:18 PM
Nice, awesome!!!
e

Evan Sadler

12/22/2022, 9:28 PM
THANK YOU @Kevin Su and @Yee.
I am going to take off for the day, but I will share notes on what I did tomorrow
t

Tanmay Mathur

12/22/2022, 9:33 PM
Thanks for the support!
Happy Holidays to you guys!
k

Kevin Su

12/22/2022, 9:40 PM
Merry Christmas!
k

Ketan (kumare3)

12/22/2022, 11:12 PM
hey folks, please help spread the workd
@Kevin Su you are a rockstar!
f

Frank Shen

01/05/2023, 7:33 PM
@Kevin Su, @Tanmay Mathur, &@Evan Sadler, Happy New Year! Thanks for sharing the knowledge of Databricks task. If I want to run a databricks spark job as flyte task to read multiple parquet files stored in a dbfs folder, and return the data as a pandas DF. And the next flyte task is going to use the pandas DF as input. This flyte task should be executed in the flyte managed EKS pod/node outside of Databricks. Will this task get the data automatically (transferred from databricks system to flyte system? Or this scenario is not supported by the new databricks plugin?
k

Kevin Su

01/05/2023, 7:39 PM
yes, the downstream task will get the data automatically. By default, flyte will write dataframe to s3 bucket. Your EKS pod should have access to that bucket as well, so flytekit will download the data before running the job.
f

Frank Shen

01/05/2023, 7:42 PM
@Kevin Su, RE: so flytekit will download the data before running the job. What is the mechanism for the downlaod? Is it done in a single process/thread, or is it parallel? And how efficient is it compared to spark read from S3?
k

Kevin Su

01/05/2023, 8:00 PM
flyte write intermediate data (upstream output) to the s3 bucket, and download stream task will download it in a single process. we use awscli to download the parquet file, and use pandas to read it. However, you can install flytekitplugin-fsspec to get better performance. fsspec will directly read the data from s3 instead of downloading to local disk first. We plan to replace default persistence plugin (awscli) with fsspec in the next release. IIUC, there is little difference in performance between fsspec and Spark, since they both use Arrow under the hood. However, downloading data by using awscli is really slow, that’s why we want to replace it.
y

Yee

01/05/2023, 8:01 PM
how much data are we talking about btw frank?
just wondering
e

Evan Sadler

01/05/2023, 8:14 PM
@Kevin Su For the spark plugin, it seems like it save sand loads the datasets directly from s3: https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-spark/flytekitplugins/spark/sd_transformers.py#L41. This should work great if that is the case.
Hopefully the output of spark tasks is small enough to work with python tasks
@Yee I might need 20GBs of data
f

Frank Shen

01/05/2023, 8:26 PM
@Yee, I have a job that uses upto 25 GB of input data.
@Kevin Su, I see what you are saying. I will combine the spark read and ML operation on the data in one flyte task.
k

Kevin Su

01/05/2023, 8:56 PM
For the spark plugin, it seems like it save sand loads the datasets directly from s3:
That transformer only works for spark.dataframe. yes, you can directly return spark dataframe in the task as well.
5 Views