Nan Qin
06/30/2023, 5:37 PMRendezvousTimeoutError
when launching ddp on eks. It happens when some workers started running while others are waiting for resources to be available. After investigating the logs and pytorch code we believe it is due to join_timeout parameter which defaults to 600s, as the RendezvousTimeoutError
shows up exactly 600s after the pod starts running.
not sure what is the best workaround is, but seems adding something like rdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},
to the LaunchConfig could probably solve it.
Please lmk if this is the right approach. would love to contributeFabio Grätz
06/30/2023, 5:59 PMtorch.distributed.init_process_group()
allows to change this (docs)?Fabio Grätz
06/30/2023, 6:01 PMIt happens when some workers started running while others are waiting for resources to be available.Yes, this is expected. Because this is very annoying we actually configured the kubeflow training operator to use the scheduling plugins scheduler to do gang scheduling. This has the effect that the pods only start if all of them can start.
Fabio Grätz
06/30/2023, 6:02 PMtorch.distributed.init_process_group(timeout=)
doesn’t do the trick.Fabio Grätz
06/30/2023, 6:02 PMNan Qin
06/30/2023, 6:04 PMtorch.distributed.init_process_group
is different (defaults to 30mins instead of 10mins)Nan Qin
06/30/2023, 6:05 PMFabio Grätz
06/30/2023, 6:10 PM/manager --gang-scheduler-name=scheduler-plugins
• Your flyte tasks need this scheduler name: schedulerName: scheduler-plugins-schedule
Fabio Grätz
06/30/2023, 6:12 PMFabio Grätz
06/30/2023, 6:12 PMFabio Grätz
06/30/2023, 6:12 PMFabio Grätz
06/30/2023, 6:13 PMFabio Grätz
06/30/2023, 6:14 PMwould love to contributeIf you want to explore this, I’m happy to help/sparr.
Fabio Grätz
06/30/2023, 6:14 PMNan Qin
06/30/2023, 6:15 PMNan Qin
06/30/2023, 11:29 PMNiels Bantilan
07/01/2023, 1:14 AMKetan (kumare3)
Fabio Grätz
07/03/2023, 7:00 AMrdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},
as suggested by @Nan Qin, to the user via the task_config is still reasonable I’d say. Not everyone wants to run a different scheduler in order to do distributed training.Fabio Grätz
07/03/2023, 7:01 AMNan Qin
07/03/2023, 1:24 PMNan Qin
07/03/2023, 1:41 PM@flytekit.task(
task_config=kfpytorch.Elastic(nnodes=NNODES, nproc_per_node=NPROC_PER_NODE),
cache=CACHE,
cache_version=CACHE_VERSION,
requests=flytekit.Resources(gpu=GPU, cpu=CPU, mem=MEM),
)
Fabio Grätz
07/03/2023, 2:57 PMFabio Grätz
07/12/2023, 9:41 AMNan Qin
07/12/2023, 3:58 PMFabio Grätz
07/13/2023, 7:03 AM