I haven’t used it much but ~1.5 years ago “I got a...
# torch-elastic
f
I haven’t used it much but ~1.5 years ago “I got an example to train with it” on k8s (which they don’t explicitly mentioned as supported in the docs). Ultimately under the hood it also just uses
torch.distributed.init_process_group()
, see here. Back then I just created a kubeflow PytorchJob to run it which worked. Image needed
nvidia-cuda-toolkit
. To summarize, at the state of ~1.5 years ago I think it would already have been supported.
118 Views