Im currently running into alot of issues with spar...
# flyte-support
b
Im currently running into alot of issues with spark on flyte. Which integration works best for big data processing on flyte? Ray, Dask ...
a
Hi Julian! What issues are you seeing? Personally, i find the dask integration to work well (also in production settings) - disclaimer that I've authored it a while back.
b
I find dependency resolution very hard. Basically outputting a spark dataframe does not work for me currently. Ill try dask
a
Feel free to reach out in case you have any questions!
f
Hmm @brainy-carpenter-31280 that’s odd
Can you share an example
We have large scale deployments of Spark as well , so would love to learn problems
b
https://github.com/julianStreibel/flyte-spark This is a minimal example. I added jars fixing this particular issue (see readme). But its cascading.
What is your spark setup? Are you using custom images with pre loaded dependencies? How are you caching dependencies in the cluster? Do you have a shared pvc?
f
Maybe this repo is private
Yes using custom images - why donii to oh need a pvc - why not build a new image
b
yes sorry its public now
f
Is this not working ?
b
yes raises the exception in the readme
f
One problem I see is your cpu/mem is too low - spark is hungry
b
I know but that is not the problem here
f
I am AFK, will try later today and share the output
b
Thanks 🙂
f
@brainy-carpenter-31280 it works
I did change the image a bit
Copy code
custom_image = ImageSpec(
-    registry="<http://211125663991.dkr.ecr.us-west-2.amazonaws.com|211125663991.dkr.ecr.us-west-2.amazonaws.com>",
-    name="flytekit",
+    # registry="<http://211125663991.dkr.ecr.us-west-2.amazonaws.com|211125663991.dkr.ecr.us-west-2.amazonaws.com>",
+    registry="<http://ghcr.io/unionai-oss|ghcr.io/unionai-oss>",
+    name="spark",
     packages=[
         # Needed for IPv6 support
         "flytekit",
         "flytekitplugins-spark",
-        "pyspark==3.4.0",
+        # "pyspark==3.4.0",
     ],
-    pip_extra_args="--pre",
-    platform="linux/arm64",
+    # pip_extra_args="--pre",
+    # platform="linux/arm64",
 )
not much, i removed pyspark 3.4.0
flytekitplugins brings spark
but i can try with 3.4.0 too
trying with 3.4.0
it must be arm
b
Thanks for testing! I currently only have arm nodes in the cluster :/
I have the same problem with reading StructuredDatasets. Can I somehow read in the s3 uri and just read parquets from there? I tried returning a pandas dataframe in a structuredDataset and read it in the spark task as a structured dataset reading it from the uri but it is None.