We are trying to execute SparkJobs written in scal...
# flyte-deployment
We are trying to execute SparkJobs written in scala. One strategy we considered is to run ContainerTasks and spark-submit pointing to a jar file. This hasn’t worked out yet. I see that support for scala is coming soon (https://docs.flyte.org/projects/cookbook/en/stable/auto/integrations/kubernetes/k8s_spark/pyspark_pi.html). We were wondering if anyone uses flyte with spark written in scala and what their setup looks like? Maybe we could use the Java Flytekit and annotate the spark jobs directly? (We just set up deployment on GKE using helm chart and are able to successfully run various flytesnacks examples - thanks for all the help with setup on this channel!)
I'm interested in this too, but don't have any concrete answers for you, just some ideas. Could you include jars built from your scala code in the container image that will be used for the spark task, add those jars on the classpath (using
in the task's spark conf), and then call into your scala spark driver code from the pyspark python code? Something like
Copy code
        # this configuration is applied to the spark cluster
            "spark.driver.extraClassPath": ...,
            "spark.executor.extraClassPath": ...,
def spark_task() -> float:
    sess = flytekit.current_context().spark_session
    return sess.sparkContext._jvm.com.my.scala.package.ScalaDriver.go()
I think
returns the Java
object which you could pass to the scala side too. There's probably thread-local or static references within spark that you could use to get the spark context on the scala side too.
and to be clear, the
is a class name in your scala code and
is a static method. I'm using the java names for things here because I'm not super familiar with scala.
We actually had this at lyft
Using a simple python wrapper and hard in docker image
Or folks open to contribute to Java sdk
Thanks for the input - good to hear we are pushing the boulder in the right direction. Will look into your suggestions today.
We have a hard requirement that the SparkJob must be run as a shell-type task triggering “spark-submit”. It cannot be triggered “in-code”. So trying to make this work without the Spark wrapper.
are you also trying to get Flyte task inputs into this spark job and return values from the spark job as task outputs?
Yeah, we pass input and output as arguments (--input/--output) in the spark-submit command as string urls to gs://…. files. Still figuring out how task inputs and outputs will map to that pattern.