Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Hi folks, I have read [the excellent tutorial about distributed training](<https://docs.flyte.org/projects/cookbook/en/stable/auto/case_studies/ml_training/spark_horovod/keras_spark_rossmann_estimator.html#>).

I'm wondering whether the sentence "'SparkBackend' uses `horovod.spark.run` to execute the distributed training function" means that flyte will launch `num_proc` spark workers on `num_proc` flyte worker? Or it just launches them on **one** spark worker?

hey <@U03KBU5NUSJ> welcome to the community.
There are no `Flyte workers` today. Flyte launches ephemeral Spark clusters using Spark operator or Spark for K8s.
In the horovod case, it simply uses the horovod with spark integration, just make sure the configuration is correct

&gt;  Flyte launches ephemeral Spark clusters using Spark operator or Spark for K8s.
I see. Thanks!

Hi <@UNZB4NW3S>, I have a follow question. Under such a setting the spark worker still fetches data (the flyte literals) using the flyte API from the underlying remote object store(s3) ?

In this case I'm wondering how the flyte's local cache works, as the spark worker is not managed by flyte and the dataframe that passed to the spark shouldn't present in the local cache.

Or, should I understand the case as the local cache is to avoid duplicating fetching data (the dataframe that passed to the spark) from s3, but the first fetch is still needed?

I do not follow- what is local cache? Do you mean Flyte cache for tasks. That is only used to memorize a previous execution and prevent a re run. There is no optimization for spark handling the data. Spark will have to download from object store

Flyte tries to prevent starting the spark cluster itself, if the task is cached

&gt; Flyte tries to prevent starting the spark cluster itself, if the task is cached
Now I can understand that the spark job (launching the spark cluster using spark-operator and submit a job to it) is "a single task" in a flyte workflow.

I have follow-up question. ~Will flyte's scheduler consider the co-locality between two tasks inside a workflow?~

After some experiments now I can see the answer should be "won't". As the flyte propeller launches pods for each execution.

Corrected me if I was wrong.

Ya not today, but this is feature planned

Happy to know that.

May I find any related docs or enhancement proposal draft from somewhere on this topic?

Thanks!

<@U03KBU5NUSJ> phase 2 of this rfc, cc <@USU6W5ATA> can we please split this into 2 rfcs and add to the official rfcs? <https://docs.google.com/document/d/1-695lxz8a-GFz4cFamGkF1NhspKMkDGIDdkY9EDLMz8/edit|https://docs.google.com/document/d/1-695lxz8a-GFz4cFamGkF1NhspKMkDGIDdkY9EDLMz8/edit>

Ya ray is almost integrated and we will eventually work on cluster reuse