curved-whale-1505
04/25/2025, 1:21 AMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
cool-lifeguard-49380
04/27/2025, 11:57 AM@task(task_config=Pytorch(...))
or @task(task_config=Elastic(...))
which under the hood today translate to a kubeflow PyTorchJob
, the required env vars for torch distributed like MASTER_ADDR
, RANK
or WORLD_SIZE
, ... will be set automatically by the kubeflow training operator.
Since jobset is not pytorch specific, in this Jobset example these env vars are configured "manually". (Being framework agnostic is of course one of JobSet's main selling points).
If we translate this example to an imaginary flyte jobset plugin in the canonic way the other k8s backend plugins are built, this could look roughly like this:
@task(
task_config=JobSet(
jobs=[
ReplicatedJob(
parallelism: 4,
worker=Worker(
env={...} # Configure torch distributed env vars "manually"
# It's not immediately clear how the `MASTER_ADDR` could be set since it depends on the resulting pod name which is not known at registration time
)
)
]
)
)
The main question I'm looking for feedback on is whether it is your expectation that flyte provides a generic JobSet
task config via a plugin and that you as the user would configure this JobSet
task config in a way that it will work for the ML framework you want to use. In this case this means being responsible that the right torch distributed env vars are set.
Or would you rather expect that flyte has a torch distributed (+ tf, mpi, ...) task type that under the hood uses jobset instead of kubeflow? (Considering how other job CRDs are integrated into flyte, I'm not sure I personally would like this idea.)cool-lifeguard-49380
04/27/2025, 11:59 AMfreezing-airport-6809
cool-lifeguard-49380
04/28/2025, 5:34 PMenabled_plugins:
tasks:
task-plugins:
enabled-plugins:
- jobset
- ...
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
pytorch: jobset # replace backend plugin
Or would you have a jobset option in the existing flytekit plugins?
Or would you have a new jobset flytekit plugin that has task configs for the different ml frameworks?freezing-airport-6809
freezing-airport-6809