Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Im currently running into alot of issues with spark on flyte. Which integration works best for big data processing on flyte? Ray, Dask ...

Hi Julian! 

What issues are you seeing? 

Personally, i find the dask integration to work well (also in production settings) - disclaimer that I've authored it a while back.

I find dependency resolution very hard. Basically outputting a spark dataframe does not work for me currently. Ill try dask

Feel free to reach out in case you have any questions!

We have large scale deployments of Spark as well , so would love to learn problems 

<https://github.com/julianStreibel/flyte-spark>
This is a minimal example. I added jars fixing this particular issue (see readme). But its cascading.

What is your spark setup? Are you using custom images with pre loaded dependencies? How are you caching dependencies in the cluster? Do you have a shared pvc?

Yes using custom images - why donii to oh need a pvc - why not build a new image 

One problem I see is your cpu/mem is too low - spark is hungry 

I am AFK, will try later today and share the output 

Screenshot 2025-05-08 at 3.09.12 PM.png

I did change the image a bit
``` custom_image = ImageSpec(
-    registry="<http://211125663991.dkr.ecr.us-west-2.amazonaws.com|211125663991.dkr.ecr.us-west-2.amazonaws.com>",
-    name="flytekit",
+    # registry="<http://211125663991.dkr.ecr.us-west-2.amazonaws.com|211125663991.dkr.ecr.us-west-2.amazonaws.com>",
+    registry="<http://ghcr.io/unionai-oss|ghcr.io/unionai-oss>",
+    name="spark",
     packages=[
         # Needed for IPv6 support
         "flytekit",
         "flytekitplugins-spark",
-        "pyspark==3.4.0",
+        # "pyspark==3.4.0",
     ],
-    pip_extra_args="--pre",
-    platform="linux/arm64",
+    # pip_extra_args="--pre",
+    # platform="linux/arm64",
 )```

Thanks for testing!
I currently only have arm nodes in the cluster :/

I have the same problem with reading StructuredDatasets. Can I somehow read in the s3 uri and just read parquets from there?  I tried returning a pandas dataframe in a structuredDataset and read it in the spark task as a structured dataset reading it from the uri but it is None.