At a high level I m curious why do we need to use the Kubefl Flyte #flyte-support

At a high level, I’m curious why do we need to use...

curved-whale-1505

04/25/2025, 1:21 AM

At a high level, I’m curious why do we need to use the Kubeflow Training Operator to start a PyTorchJob for multi node multi gpu support in flyte, rather than supporting this directly as a multi node “Python Task” managed by flyte? Is this something that can be handed by the new JobSet API? https://kubernetes.io/blog/2025/03/23/introducing-jobset/

freezing-airport-6809

04/25/2025, 4:03 AM

The training operator exists before the jobset api

freezing-airport-6809

04/25/2025, 4:03 AM

We intend to add support for jobset soon

freezing-airport-6809

04/25/2025, 4:03 AM

Contributions welcome

freezing-airport-6809

04/25/2025, 4:04 AM

Cc @cool-lifeguard-49380 would love to collaborate maybe early summer on this

cool-lifeguard-49380

04/27/2025, 11:57 AM

Hey @curved-whale-1505, I agree that it would be nice if JobSet was supported in Flyte, I'm in theory also happy to help with the integration 🙂 Before we go there, I have some questions about how you as a user would expect to use jobset in flyte. Let's take your example of torch distributed: When using

@task(task_config=Pytorch(...))

@task(task_config=Elastic(...))

which under the hood today translate to a kubeflow

PyTorchJob

, the required env vars for torch distributed like

MASTER_ADDR

RANK

WORLD_SIZE

, ... will be set automatically by the kubeflow training operator. Since jobset is not pytorch specific, in this Jobset example these env vars are configured "manually". (Being framework agnostic is of course one of JobSet's main selling points). If we translate this example to an imaginary flyte jobset plugin in the canonic way the other k8s backend plugins are built, this could look roughly like this:

Copy code

@task(
    task_config=JobSet(
        jobs=[
            ReplicatedJob(
                parallelism: 4,
                worker=Worker(
                    env={...}  # Configure torch distributed env vars "manually"
                               # It's not immediately clear how the `MASTER_ADDR` could be set since it depends on the resulting pod name which is not known at registration time
                )
            )
        ]
        
    )
)

The main question I'm looking for feedback on is whether it is your expectation that flyte provides a generic

JobSet

task config via a plugin and that you as the user would configure this

JobSet

task config in a way that it will work for the ML framework you want to use. In this case this means being responsible that the right torch distributed env vars are set. Or would you rather expect that flyte has a torch distributed (+ tf, mpi, ...) task type that under the hood uses jobset instead of kubeflow? (Considering how other job CRDs are integrated into flyte, I'm not sure I personally would like this idea.)

cool-lifeguard-49380

04/27/2025, 11:59 AM

Curious what you think about this 🙂

freezing-airport-6809

04/27/2025, 3:09 PM

@cool-lifeguard-49380 from my pov the later. I am not interested in 1, as that’s like a k8s wrapper and you writing some yaml in python without simplifying users life

cool-lifeguard-49380

04/28/2025, 5:34 PM

Does this mean your goal is that the flytekit plugins like kf-mpi, kf-pytorch, and others stay as is but that in the propeller config can you switch the backend plugin executing the respective task to jobset?

Copy code

enabled_plugins:
    tasks:
      task-plugins:
        enabled-plugins:
          - jobset
          - ...
        default-for-task-types:
          container: container
          sidecar: sidecar
          container_array: k8s-array
          pytorch: jobset                # replace backend plugin

Or would you have a jobset option in the existing flytekit plugins? Or would you have a new jobset flytekit plugin that has task configs for the different ml frameworks?

freezing-airport-6809

04/28/2025, 6:01 PM

i would ideally want to drop support for training operator

freezing-airport-6809

04/28/2025, 6:01 PM

and move to a k8s native jobset with flyte native management of "set"

3 Views

Open in Slack

Previous Next