Hi folks, I have read [the excellent tutorial abou...
# announcements
t
Hi folks, I have read [the excellent tutorial about distributed training](https://docs.flyte.org/projects/cookbook/en/stable/auto/case_studies/ml_training/spark_horovod/keras_spark_rossmann_estimator.html#). I'm wondering whether the sentence "'SparkBackend' uses
horovod.spark.run
to execute the distributed training function" means that flyte will launch
num_proc
spark workers on
num_proc
flyte worker? Or it just launches them on one spark worker?
k
hey @Tao He welcome to the community. There are no
Flyte workers
today. Flyte launches ephemeral Spark clusters using Spark operator or Spark for K8s. In the horovod case, it simply uses the horovod with spark integration, just make sure the configuration is correct
t
Flyte launches ephemeral Spark clusters using Spark operator or Spark for K8s.
I see. Thanks!
❤️ 1
Hi @Ketan (kumare3), I have a follow question. Under such a setting the spark worker still fetches data (the flyte literals) using the flyte API from the underlying remote object store(s3) ? In this case I'm wondering how the flyte's local cache works, as the spark worker is not managed by flyte and the dataframe that passed to the spark shouldn't present in the local cache. Or, should I understand the case as the local cache is to avoid duplicating fetching data (the dataframe that passed to the spark) from s3, but the first fetch is still needed?
k
I do not follow- what is local cache? Do you mean Flyte cache for tasks. That is only used to memorize a previous execution and prevent a re run. There is no optimization for spark handling the data. Spark will have to download from object store Flyte tries to prevent starting the spark cluster itself, if the task is cached
t
Flyte tries to prevent starting the spark cluster itself, if the task is cached
Now I can understand that the spark job (launching the spark cluster using spark-operator and submit a job to it) is "a single task" in a flyte workflow.
I have follow-up question. Will flyte's scheduler consider the co-locality between two tasks inside a workflow? After some experiments now I can see the answer should be "won't". As the flyte propeller launches pods for each execution. Corrected me if I was wrong.
k
Ya not today, but this is feature planned
t
Happy to know that. May I find any related docs or enhancement proposal draft from somewhere on this topic? Thanks!
k
Yes will share the enhancement proposal
t
Looking forward to it!
k
@Tao He phase 2 of this rfc, cc @Kevin Su can we please split this into 2 rfcs and add to the official rfcs? https://docs.google.com/document/d/1-695lxz8a-GFz4cFamGkF1NhspKMkDGIDdkY9EDLMz8/edit
👀 1
t
Thanks for sharing ❤️
k
Absolutely
Ya ray is almost integrated and we will eventually work on cluster reuse
👀 1
160 Views