Fabio Grätz
04/20/2023, 4:17 PMtorchrun
allows the user to set --nnodes
which could e.g. be 2
but also be "1:2"
which means min 1 max 2. Currently this is what iour new task_config=Elastic()
exposes as well.
The kubeflow PytorchJob allows setting minReplicas
, maxReplicas
(which by default are both None), and replicas
(see here). In theory you could say min 2, max 4, replicas 3 (without going into how much sense this makes).
If a user specifies 2:3
we currently set min to 2 and max and replicas to 3.
To summarize: Should we expose nnodes
like torchrun or min_replicas
, max_replicas
, and replicas
like the pytorchjob to the user?Ketan (kumare3)
Fabio Grätz
04/21/2023, 7:14 AM3:5
, we set maxReplicas
but also Replicas
to 5. In theory this doesn’t have to be the case in the pytorchjob manifest.
I’ll change it to the more explicit version 👍Ketan (kumare3)