cool-lifeguard-49380
04/20/2023, 4:17 PMtorchrun allows the user to set --nnodes which could e.g. be 2 but also be "1:2" which means min 1 max 2. Currently this is what iour new  task_config=Elastic() exposes as well.
The kubeflow PytorchJob allows setting minReplicas, maxReplicas (which by default are both None), and replicas  (see here). In theory you could say min 2, max 4, replicas 3 (without going into how much sense this makes).
If a user specifies 2:3 we currently set min to 2 and max and replicas to 3.
To summarize: Should we expose nnodes like torchrun or min_replicas, max_replicas, and replicas like the pytorchjob to the user?freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
cool-lifeguard-49380
04/21/2023, 7:14 AM3:5, we set  maxReplicas but also Replicas to 5. In theory this doesn’t have to be the case in the pytorchjob manifest.
I’ll change it to the more explicit version 👍freezing-airport-6809