we are getting `RendezvousTimeoutError` when launching ddp o Flyte #torch-elastic

we are getting `RendezvousTimeoutError` when launc...

shy-accountant-549

06/30/2023, 5:37 PM

we are getting

RendezvousTimeoutError

when launching ddp on eks. It happens when some workers started running while others are waiting for resources to be available. After investigating the logs and pytorch code we believe it is due to join_timeout parameter which defaults to 600s, as the

RendezvousTimeoutError

shows up exactly 600s after the pod starts running. not sure what is the best workaround is, but seems adding something like

rdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},

to the LaunchConfig could probably solve it. Please lmk if this is the right approach. would love to contribute

cool-lifeguard-49380

06/30/2023, 5:59 PM

Could you please check whether the timeout arg for

torch.distributed.init_process_group()

allows to change this (docs)?

cool-lifeguard-49380

06/30/2023, 6:01 PM

It happens when some workers started running while others are waiting for resources to be available.

Yes, this is expected. Because this is very annoying we actually configured the kubeflow training operator to use the scheduling plugins scheduler to do gang scheduling. This has the effect that the pods only start if all of them can start.

cool-lifeguard-49380

06/30/2023, 6:02 PM

Your approach looks like a good solution if

torch.distributed.init_process_group(timeout=)

doesn’t do the trick.

cool-lifeguard-49380

06/30/2023, 6:02 PM

If you would like to use gang scheduling, I can help you get this to work.

shy-accountant-549

06/30/2023, 6:04 PM

yeah the timeout in

torch.distributed.init_process_group

is different (defaults to 30mins instead of 10mins)

shy-accountant-549

06/30/2023, 6:05 PM

I like the idea of gang scheduling. do you have a working example?

cool-lifeguard-49380

06/30/2023, 6:10 PM

• You need to install this helm chart. Use VERSION=‘v0.24.9’ because they changed the api version in a CRD and kubeflow training operator hasn’t been updated yet • The cmd of the kubeflow training operator deployment needs to be modified with this command:

/manager --gang-scheduler-name=scheduler-plugins

• Your flyte tasks need this scheduler name:

schedulerName: scheduler-plugins-schedule

cool-lifeguard-49380

06/30/2023, 6:12 PM

In case you want to go for this, I’m happy to help getting this to work/answer questions. Especially when we are close to our GPU quotas this helps a lot because it often happens to us that our quotas allow for e.g. 18 GPUs and we have to 16 worker pytorch tasks. If both of them take 8 GPUs, neither of them can start. With gang scheduling, at least 1 of them can start.

cool-lifeguard-49380

06/30/2023, 6:12 PM

This is an alternative to your solution though, yours would work as well I guess.

cool-lifeguard-49380

06/30/2023, 6:12 PM

I assume you want to add the timeout here.

cool-lifeguard-49380

06/30/2023, 6:13 PM

And expose to the user here. For me the main question would be: can we manage to expose this to the user via the dataclass but avoid sending it to flytepropeller/adding to flyteidl?

cool-lifeguard-49380

06/30/2023, 6:14 PM

would love to contribute

If you want to explore this, I’m happy to help/sparr.

cool-lifeguard-49380

06/30/2023, 6:14 PM

Going to sign off though now, 8pm here 🙂 have a nice weekend

shy-accountant-549

06/30/2023, 6:15 PM

gratitude thank you

shy-accountant-549

06/30/2023, 11:29 PM

gang-scheduling seems working as expected. Thanks! @cool-lifeguard-49380

broad-monitor-993

07/01/2023, 1:14 AM

For the record I believe this is the error I was seeing @cool-lifeguard-49380 . @shy-accountant-549 does your Elastic config look like?

freezing-airport-6809

07/01/2023, 9:00 AM

@broad-monitor-993 we need a kubernetes scheduler that you need to use to launch a gang

cool-lifeguard-49380

07/03/2023, 7:00 AM

Exposing something like

rdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},

as suggested by @shy-accountant-549, to the user via the task_config is still reasonable I’d say. Not everyone wants to run a different scheduler in order to do distributed training.

cool-lifeguard-49380

07/03/2023, 7:01 AM

@shy-accountant-549 would you like to work on this with my help? Otherwise I’m also happy to pick this up

shy-accountant-549

07/03/2023, 1:24 PM

Yeah I would love to work on it. I am on vacation this week, will have more bandwidth next week.

shy-accountant-549

07/03/2023, 1:41 PM

@broad-monitor-993

Copy code

@flytekit.task(
    task_config=kfpytorch.Elastic(nnodes=NNODES, nproc_per_node=NPROC_PER_NODE),
    cache=CACHE,
    cache_version=CACHE_VERSION,
    requests=flytekit.Resources(gpu=GPU, cpu=CPU, mem=MEM),
)

cool-lifeguard-49380

07/03/2023, 2:57 PM

Cool, wish you a nice vacation then! Ping me next week when you are back and maybe we can have a short call to discuss how to implement this?

cool-lifeguard-49380

07/12/2023, 9:41 AM

Hey @shy-accountant-549, hope you had a nice vacation! Do you have a few min to talk about this this week or next?

shy-accountant-549

07/12/2023, 3:58 PM

sure. Are you available tomorrow or Friday 2-4pm your time?

cool-lifeguard-49380

07/13/2023, 7:03 AM

Yes, today and tomorrow both works

1647 Views

Open in Slack