Thread
#announcements
    Tao He

    Tao He

    1 month ago
    Hi folks, I have read [the excellent tutorial about distributed training](https://docs.flyte.org/projects/cookbook/en/stable/auto/case_studies/ml_training/spark_horovod/keras_spark_rossmann_estimator.html#). I'm wondering whether the sentence "'SparkBackend' uses
    horovod.spark.run
    to execute the distributed training function" means that flyte will launch
    num_proc
    spark workers on
    num_proc
    flyte worker? Or it just launches them on one spark worker?
    Ketan (kumare3)

    Ketan (kumare3)

    1 month ago
    hey @Tao He welcome to the community. There are no
    Flyte workers
    today. Flyte launches ephemeral Spark clusters using Spark operator or Spark for K8s. In the horovod case, it simply uses the horovod with spark integration, just make sure the configuration is correct
    Tao He

    Tao He

    1 month ago
    Flyte launches ephemeral Spark clusters using Spark operator or Spark for K8s.
    I see. Thanks!
    Hi @Ketan (kumare3), I have a follow question. Under such a setting the spark worker still fetches data (the flyte literals) using the flyte API from the underlying remote object store(s3) ? In this case I'm wondering how the flyte's local cache works, as the spark worker is not managed by flyte and the dataframe that passed to the spark shouldn't present in the local cache. Or, should I understand the case as the local cache is to avoid duplicating fetching data (the dataframe that passed to the spark) from s3, but the first fetch is still needed?
    Ketan (kumare3)

    Ketan (kumare3)

    1 month ago
    I do not follow- what is local cache? Do you mean Flyte cache for tasks. That is only used to memorize a previous execution and prevent a re run. There is no optimization for spark handling the data. Spark will have to download from object store Flyte tries to prevent starting the spark cluster itself, if the task is cached
    Tao He

    Tao He

    1 month ago
    Flyte tries to prevent starting the spark cluster itself, if the task is cached
    Now I can understand that the spark job (launching the spark cluster using spark-operator and submit a job to it) is "a single task" in a flyte workflow.
    I have follow-up question. Will flyte's scheduler consider the co-locality between two tasks inside a workflow? After some experiments now I can see the answer should be "won't". As the flyte propeller launches pods for each execution. Corrected me if I was wrong.
    Ketan (kumare3)

    Ketan (kumare3)

    1 month ago
    Ya not today, but this is feature planned
    Tao He

    Tao He

    1 month ago
    Happy to know that. May I find any related docs or enhancement proposal draft from somewhere on this topic? Thanks!
    Ketan (kumare3)

    Ketan (kumare3)

    1 month ago
    Yes will share the enhancement proposal
    Tao He

    Tao He

    1 month ago
    Looking forward to it!
    Ketan (kumare3)

    Ketan (kumare3)

    1 month ago
    @Tao He phase 2 of this rfc, cc @Kevin Su can we please split this into 2 rfcs and add to the official rfcs? https://docs.google.com/document/d/1-695lxz8a-GFz4cFamGkF1NhspKMkDGIDdkY9EDLMz8/edit
    Tao He

    Tao He

    1 month ago
    Thanks for sharing ❤️
    Ketan (kumare3)

    Ketan (kumare3)

    1 month ago
    Absolutely
    Ya ray is almost integrated and we will eventually work on cluster reuse