Nan Qin
06/30/2023, 5:37 PMRendezvousTimeoutError
when launching ddp on eks. It happens when some workers started running while others are waiting for resources to be available. After investigating the logs and pytorch code we believe it is due to join_timeout parameter which defaults to 600s, as the RendezvousTimeoutError
shows up exactly 600s after the pod starts running.
not sure what is the best workaround is, but seems adding something like rdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},
to the LaunchConfig could probably solve it.
Please lmk if this is the right approach. would love to contributeFabio Grätz
06/30/2023, 5:59 PMtorch.distributed.init_process_group()
allows to change this (docs)?It happens when some workers started running while others are waiting for resources to be available.Yes, this is expected. Because this is very annoying we actually configured the kubeflow training operator to use the scheduling plugins scheduler to do gang scheduling. This has the effect that the pods only start if all of them can start.
torch.distributed.init_process_group(timeout=)
doesn’t do the trick.Nan Qin
06/30/2023, 6:04 PMtorch.distributed.init_process_group
is different (defaults to 30mins instead of 10mins)Fabio Grätz
06/30/2023, 6:10 PM/manager --gang-scheduler-name=scheduler-plugins
• Your flyte tasks need this scheduler name: schedulerName: scheduler-plugins-schedule
would love to contributeIf you want to explore this, I’m happy to help/sparr.
Nan Qin
06/30/2023, 6:15 PMNiels Bantilan
07/01/2023, 1:14 AMKetan (kumare3)
Fabio Grätz
07/03/2023, 7:00 AMrdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},
as suggested by @Nan Qin, to the user via the task_config is still reasonable I’d say. Not everyone wants to run a different scheduler in order to do distributed training.Nan Qin
07/03/2023, 1:24 PM@flytekit.task(
task_config=kfpytorch.Elastic(nnodes=NNODES, nproc_per_node=NPROC_PER_NODE),
cache=CACHE,
cache_version=CACHE_VERSION,
requests=flytekit.Resources(gpu=GPU, cpu=CPU, mem=MEM),
)
Fabio Grätz
07/03/2023, 2:57 PMNan Qin
07/12/2023, 3:58 PMFabio Grätz
07/13/2023, 7:03 AM