https://flyte.org logo
#torch-elastic
Title
# torch-elastic
n

Nan Qin

06/30/2023, 5:37 PM
we are getting
RendezvousTimeoutError
when launching ddp on eks. It happens when some workers started running while others are waiting for resources to be available. After investigating the logs and pytorch code we believe it is due to join_timeout parameter which defaults to 600s, as the
RendezvousTimeoutError
shows up exactly 600s after the pod starts running. not sure what is the best workaround is, but seems adding something like
rdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},
to the LaunchConfig could probably solve it. Please lmk if this is the right approach. would love to contribute
f

Fabio Grätz

06/30/2023, 5:59 PM
Could you please check whether the timeout arg for
torch.distributed.init_process_group()
allows to change this (docs)?
It happens when some workers started running while others are waiting for resources to be available.
Yes, this is expected. Because this is very annoying we actually configured the kubeflow training operator to use the scheduling plugins scheduler to do gang scheduling. This has the effect that the pods only start if all of them can start.
Your approach looks like a good solution if
torch.distributed.init_process_group(timeout=)
doesn’t do the trick.
If you would like to use gang scheduling, I can help you get this to work.
n

Nan Qin

06/30/2023, 6:04 PM
yeah the timeout in
torch.distributed.init_process_group
is different (defaults to 30mins instead of 10mins)
I like the idea of gang scheduling. do you have a working example?
f

Fabio Grätz

06/30/2023, 6:10 PM
• You need to install this helm chart. Use VERSION=‘v0.24.9’ because they changed the api version in a CRD and kubeflow training operator hasn’t been updated yet • The cmd of the kubeflow training operator deployment needs to be modified with this command:
/manager --gang-scheduler-name=scheduler-plugins
• Your flyte tasks need this scheduler name:
schedulerName: scheduler-plugins-schedule
In case you want to go for this, I’m happy to help getting this to work/answer questions. Especially when we are close to our GPU quotas this helps a lot because it often happens to us that our quotas allow for e.g. 18 GPUs and we have to 16 worker pytorch tasks. If both of them take 8 GPUs, neither of them can start. With gang scheduling, at least 1 of them can start.
This is an alternative to your solution though, yours would work as well I guess.
I assume you want to add the timeout here.
And expose to the user here. For me the main question would be: can we manage to expose this to the user via the dataclass but avoid sending it to flytepropeller/adding to flyteidl?
would love to contribute
If you want to explore this, I’m happy to help/sparr.
Going to sign off though now, 8pm here 🙂 have a nice weekend
n

Nan Qin

06/30/2023, 6:15 PM
gratitude thank you
gang-scheduling seems working as expected. Thanks! @Fabio Grätz
n

Niels Bantilan

07/01/2023, 1:14 AM
For the record I believe this is the error I was seeing @Fabio Grätz . @Nan Qin does your Elastic config look like?
k

Ketan (kumare3)

07/01/2023, 9:00 AM
@Niels Bantilan we need a kubernetes scheduler that you need to use to launch a gang
f

Fabio Grätz

07/03/2023, 7:00 AM
Exposing something like
rdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},
as suggested by @Nan Qin, to the user via the task_config is still reasonable I’d say. Not everyone wants to run a different scheduler in order to do distributed training.
@Nan Qin would you like to work on this with my help? Otherwise I’m also happy to pick this up
n

Nan Qin

07/03/2023, 1:24 PM
Yeah I would love to work on it. I am on vacation this week, will have more bandwidth next week.
@Niels Bantilan
Copy code
@flytekit.task(
    task_config=kfpytorch.Elastic(nnodes=NNODES, nproc_per_node=NPROC_PER_NODE),
    cache=CACHE,
    cache_version=CACHE_VERSION,
    requests=flytekit.Resources(gpu=GPU, cpu=CPU, mem=MEM),
)
f

Fabio Grätz

07/03/2023, 2:57 PM
Cool, wish you a nice vacation then! Ping me next week when you are back and maybe we can have a short call to discuss how to implement this?
Hey @Nan Qin, hope you had a nice vacation! Do you have a few min to talk about this this week or next?
n

Nan Qin

07/12/2023, 3:58 PM
sure. Are you available tomorrow or Friday 2-4pm your time?
f

Fabio Grätz

07/13/2023, 7:03 AM
Yes, today and tomorrow both works
356 Views