At a high level, I’m curious why do we need to use...
# flyte-support
c
At a high level, I’m curious why do we need to use the Kubeflow Training Operator to start a PyTorchJob for multi node multi gpu support in flyte, rather than supporting this directly as a multi node “Python Task” managed by flyte? Is this something that can be handed by the new JobSet API? https://kubernetes.io/blog/2025/03/23/introducing-jobset/
f
The training operator exists before the jobset api
We intend to add support for jobset soon
Contributions welcome
Cc @cool-lifeguard-49380 would love to collaborate maybe early summer on this
c
Hey @curved-whale-1505, I agree that it would be nice if JobSet was supported in Flyte, I'm in theory also happy to help with the integration 🙂 Before we go there, I have some questions about how you as a user would expect to use jobset in flyte. Let's take your example of torch distributed: When using
@task(task_config=Pytorch(...))
or
@task(task_config=Elastic(...))
which under the hood today translate to a kubeflow
PyTorchJob
, the required env vars for torch distributed like
MASTER_ADDR
,
RANK
or
WORLD_SIZE
, ... will be set automatically by the kubeflow training operator. Since jobset is not pytorch specific, in this Jobset example these env vars are configured "manually". (Being framework agnostic is of course one of JobSet's main selling points). If we translate this example to an imaginary flyte jobset plugin in the canonic way the other k8s backend plugins are built, this could look roughly like this:
Copy code
@task(
    task_config=JobSet(
        jobs=[
            ReplicatedJob(
                parallelism: 4,
                worker=Worker(
                    env={...}  # Configure torch distributed env vars "manually"
                               # It's not immediately clear how the `MASTER_ADDR` could be set since it depends on the resulting pod name which is not known at registration time
                )
            )
        ]
        
    )
)
The main question I'm looking for feedback on is whether it is your expectation that flyte provides a generic
JobSet
task config via a plugin and that you as the user would configure this
JobSet
task config in a way that it will work for the ML framework you want to use. In this case this means being responsible that the right torch distributed env vars are set. Or would you rather expect that flyte has a torch distributed (+ tf, mpi, ...) task type that under the hood uses jobset instead of kubeflow? (Considering how other job CRDs are integrated into flyte, I'm not sure I personally would like this idea.)
Curious what you think about this 🙂
f
@cool-lifeguard-49380 from my pov the later. I am not interested in 1, as that’s like a k8s wrapper and you writing some yaml in python without simplifying users life
c
Does this mean your goal is that the flytekit plugins like kf-mpi, kf-pytorch, and others stay as is but that in the propeller config can you switch the backend plugin executing the respective task to jobset?
Copy code
enabled_plugins:
    tasks:
      task-plugins:
        enabled-plugins:
          - jobset
          - ...
        default-for-task-types:
          container: container
          sidecar: sidecar
          container_array: k8s-array
          pytorch: jobset                # replace backend plugin
Or would you have a jobset option in the existing flytekit plugins? Or would you have a new jobset flytekit plugin that has task configs for the different ml frameworks?
f
i would ideally want to drop support for training operator
and move to a k8s native jobset with flyte native management of "set"