One other thing about which I’m interested in your...
# torch-elastic
f
One other thing about which I’m interested in your opinion:
torchrun
allows the user to set
--nnodes
which could e.g. be
2
but also be
"1:2"
which means min 1 max 2. Currently this is what iour new
task_config=Elastic()
exposes as well. The kubeflow PytorchJob allows setting
minReplicas
,
maxReplicas
(which by default are both None), and
replicas
(see here). In theory you could say min 2, max 4, replicas 3 (without going into how much sense this makes). If a user specifies
2:3
we currently set min to 2 and max and replicas to 3. To summarize: Should we expose
nnodes
like torchrun or
min_replicas
,
max_replicas
, and
replicas
like the pytorchjob to the user?
k
ohh is that a question?
i like min and max
isnt it the same? but more explicit?
f
Currently we make the assumption that when user specifies
3:5
, we set
maxReplicas
but also
Replicas
to 5. In theory this doesn’t have to be the case in the pytorchjob manifest. I’ll change it to the more explicit version 👍
k
Aah got it
152 Views