• Tao He

    Tao He

    1 week ago
    Hi folks, I have read [the excellent tutorial about distributed training](https://docs.flyte.org/projects/cookbook/en/stable/auto/case_studies/ml_training/spark_horovod/keras_spark_rossmann_estimator.html#). I'm wondering whether the sentence "'SparkBackend' uses
    horovod.spark.run
    to execute the distributed training function" means that flyte will launch
    num_proc
    spark workers on
    num_proc
    flyte worker? Or it just launches them on one spark worker?
  • Ketan (kumare3)

    Ketan (kumare3)

    1 week ago
    hey @Tao He welcome to the community. There are no
    Flyte workers
    today. Flyte launches ephemeral Spark clusters using Spark operator or Spark for K8s. In the horovod case, it simply uses the horovod with spark integration, just make sure the configuration is correct
  • Tao He

    Tao He

    1 week ago
    Flyte launches ephemeral Spark clusters using Spark operator or Spark for K8s. I see. Thanks!
  • Hi @Ketan (kumare3), I have a follow question. Under such a setting the spark worker still fetches data (the flyte literals) using the flyte API from the underlying remote object store(s3) ? In this case I'm wondering how the flyte's local cache works, as the spark worker is not managed by flyte and the dataframe that passed to the spark shouldn't present in the local cache. Or, should I understand the case as the local cache is to avoid duplicating fetching data (the dataframe that passed to the spark) from s3, but the first fetch is still needed?
  • Ketan (kumare3)

    Ketan (kumare3)

    1 week ago
    I do not follow- what is local cache? Do you mean Flyte cache for tasks. That is only used to memorize a previous execution and prevent a re run. There is no optimization for spark handling the data. Spark will have to download from object store Flyte tries to prevent starting the spark cluster itself, if the task is cached
  • Tao He

    Tao He

    1 week ago
    Flyte tries to prevent starting the spark cluster itself, if the task is cached Now I can understand that the spark job (launching the spark cluster using spark-operator and submit a job to it) is "a single task" in a flyte workflow.
  • I have follow-up question. Will flyte's scheduler consider the co-locality between two tasks inside a workflow? After some experiments now I can see the answer should be "won't". As the flyte propeller launches pods for each execution. Corrected me if I was wrong.
  • Ketan (kumare3)

    Ketan (kumare3)

    1 week ago
    Ya not today, but this is feature planned
  • Tao He

    Tao He

    1 week ago
    Happy to know that. May I find any related docs or enhancement proposal draft from somewhere on this topic? Thanks!
  • Ketan (kumare3)

    Ketan (kumare3)

    1 week ago
    Yes will share the enhancement proposal
  • Tao He

    Tao He

    1 week ago
    Looking forward to it!
  • Ketan (kumare3)

    Ketan (kumare3)

    1 week ago
    @Tao He phase 2 of this rfc, cc @Kevin Su can we please split this into 2 rfcs and add to the official rfcs? https://docs.google.com/document/d/1-695lxz8a-GFz4cFamGkF1NhspKMkDGIDdkY9EDLMz8/edit
  • Tao He

    Tao He

    1 week ago
    Thanks for sharing ❤️
  • Ketan (kumare3)

    Ketan (kumare3)

    1 week ago
    Absolutely
  • Ya ray is almost integrated and we will eventually work on cluster reuse