We are trying to execute SparkJobs written in scala One stra Flyte #flyte-deployment

We are trying to execute SparkJobs written in scal...

wooden-sandwich-59360

11/30/2022, 12:12 PM

We are trying to execute SparkJobs written in scala. One strategy we considered is to run ContainerTasks and spark-submit pointing to a jar file. This hasn’t worked out yet. I see that support for scala is coming soon (https://docs.flyte.org/projects/cookbook/en/stable/auto/integrations/kubernetes/k8s_spark/pyspark_pi.html). We were wondering if anyone uses flyte with spark written in scala and what their setup looks like? Maybe we could use the Java Flytekit and annotate the spark jobs directly? (We just set up deployment on GKE using helm chart and are able to successfully run various flytesnacks examples - thanks for all the help with setup on this channel!)

nutritious-london-39005

11/30/2022, 2:16 PM

I'm interested in this too, but don't have any concrete answers for you, just some ideas. Could you include jars built from your scala code in the container image that will be used for the spark task, add those jars on the classpath (using

spark.driver.extraClassPath

and

spark.executor.extraClassPath

in the task's spark conf), and then call into your scala spark driver code from the pyspark python code? Something like

Copy code

@task(
    task_config=Spark(
        # this configuration is applied to the spark cluster
        spark_conf={
            "spark.driver.extraClassPath": ...,
            "spark.executor.extraClassPath": ...,
        }
    ),
)
def spark_task() -> float:
    sess = flytekit.current_context().spark_session
    return sess.sparkContext._jvm.com.my.scala.package.ScalaDriver.go()

nutritious-london-39005

11/30/2022, 2:18 PM

I think

sess.sparkContext._<http://jsc.sc|jsc.sc>()

returns the Java

SparkContext

object which you could pass to the scala side too. There's probably thread-local or static references within spark that you could use to get the spark context on the scala side too.

nutritious-london-39005

11/30/2022, 2:22 PM

and to be clear, the

<http://com.my|com.my>.scala.package.ScalaDriver

sess.sparkContext._<http://jvm.com.my|jvm.com.my>.scala.package.ScalaDriver.go()

is a class name in your scala code and

go

is a static method. I'm using the java names for things here because I'm not super familiar with scala.

freezing-airport-6809

11/30/2022, 3:32 PM

We actually had this at lyft

freezing-airport-6809

11/30/2022, 3:32 PM

Using a simple python wrapper and hard in docker image

freezing-airport-6809

11/30/2022, 3:42 PM

Or folks open to contribute to Java sdk

freezing-airport-6809

11/30/2022, 3:46 PM

So the backend already supports scala Java- just this needs to be set https://github.com/flyteorg/flytekit/blob/46abe919156bd3b5756498e4924182a375be98f5/plugins/flytekit-spark/flytekitplugins/spark/task.py#L101

wooden-sandwich-59360

12/01/2022, 7:49 AM

Thanks for the input - good to hear we are pushing the boulder in the right direction. Will look into your suggestions today.

wooden-sandwich-59360

12/01/2022, 9:41 AM

We have a hard requirement that the SparkJob must be run as a shell-type task triggering “spark-submit”. It cannot be triggered “in-code”. So trying to make this work without the Spark wrapper.

nutritious-london-39005

12/01/2022, 7:40 PM

are you also trying to get Flyte task inputs into this spark job and return values from the spark job as task outputs?

wooden-sandwich-59360

12/02/2022, 10:08 AM

Yeah, we pass input and output as arguments (--input/--output) in the spark-submit command as string urls to gs://…. files. Still figuring out how task inputs and outputs will map to that pattern.

168 Views

Open in Slack

Previous Next