shy-accountant-549
06/30/2023, 5:37 PMRendezvousTimeoutError
when launching ddp on eks. It happens when some workers started running while others are waiting for resources to be available. After investigating the logs and pytorch code we believe it is due to join_timeout parameter which defaults to 600s, as the RendezvousTimeoutError
shows up exactly 600s after the pod starts running.
not sure what is the best workaround is, but seems adding something like rdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},
to the LaunchConfig could probably solve it.
Please lmk if this is the right approach. would love to contributecool-lifeguard-49380
06/30/2023, 5:59 PMtorch.distributed.init_process_group()
allows to change this (docs)?cool-lifeguard-49380
06/30/2023, 6:01 PMIt happens when some workers started running while others are waiting for resources to be available.Yes, this is expected. Because this is very annoying we actually configured the kubeflow training operator to use the scheduling plugins scheduler to do gang scheduling. This has the effect that the pods only start if all of them can start.
cool-lifeguard-49380
06/30/2023, 6:02 PMtorch.distributed.init_process_group(timeout=)
doesn’t do the trick.cool-lifeguard-49380
06/30/2023, 6:02 PMshy-accountant-549
06/30/2023, 6:04 PMtorch.distributed.init_process_group
is different (defaults to 30mins instead of 10mins)shy-accountant-549
06/30/2023, 6:05 PMcool-lifeguard-49380
06/30/2023, 6:10 PM/manager --gang-scheduler-name=scheduler-plugins
• Your flyte tasks need this scheduler name: schedulerName: scheduler-plugins-schedule
cool-lifeguard-49380
06/30/2023, 6:12 PMcool-lifeguard-49380
06/30/2023, 6:12 PMcool-lifeguard-49380
06/30/2023, 6:12 PMcool-lifeguard-49380
06/30/2023, 6:13 PMcool-lifeguard-49380
06/30/2023, 6:14 PMwould love to contributeIf you want to explore this, I’m happy to help/sparr.
cool-lifeguard-49380
06/30/2023, 6:14 PMshy-accountant-549
06/30/2023, 6:15 PMshy-accountant-549
06/30/2023, 11:29 PMbroad-monitor-993
07/01/2023, 1:14 AMfreezing-airport-6809
cool-lifeguard-49380
07/03/2023, 7:00 AMrdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},
as suggested by @shy-accountant-549, to the user via the task_config is still reasonable I’d say. Not everyone wants to run a different scheduler in order to do distributed training.cool-lifeguard-49380
07/03/2023, 7:01 AMshy-accountant-549
07/03/2023, 1:24 PMshy-accountant-549
07/03/2023, 1:41 PM@flytekit.task(
task_config=kfpytorch.Elastic(nnodes=NNODES, nproc_per_node=NPROC_PER_NODE),
cache=CACHE,
cache_version=CACHE_VERSION,
requests=flytekit.Resources(gpu=GPU, cpu=CPU, mem=MEM),
)
cool-lifeguard-49380
07/03/2023, 2:57 PMcool-lifeguard-49380
07/12/2023, 9:41 AMshy-accountant-549
07/12/2023, 3:58 PMcool-lifeguard-49380
07/13/2023, 7:03 AM