flyte-org #torch-elastic

I haven’t used it much but ~1.5 years ago “I got an example to train with it” on k8s (which they don’t explicitly mentioned as supported in the docs). Ultimately under the hood it also just uses

torch.distributed.init_process_group()

, see here. Back then I just created a kubeflow PytorchJob to run it which worked. Image needed

nvidia-cuda-toolkit

. To summarize, at the state of ~1.5 years ago I think it would already have been supported.

Fabio Grätz

04/17/2023, 8:00 AM

Still draft PRs because I will add more tests and docs: • https://github.com/flyteorg/flytekit/pull/1583 • https://github.com/flyteorg/flyteplugins/pull/343 • https://github.com/flyteorg/flyteidl/pull/394 But torch elastic task now works for me when executing locally, with

nnodes=1

in a single pod, and with

nnodes>1

with the pytorch operator. I think we could try with alpaca now 🦙 The problems with rendezvous flakiness I mentioned in the call on Thursday were actually related to network config on my notebook (no ipv6 enabled).

[W socket.cpp:601] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49651) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).

I have one question about the

execute

method I copied from

PythonFunctionTask

: We don’t need the else case here for dynamic even though the original docstring hints one should implement it as well, right?

Ketan (kumare3)

04/20/2023, 4:11 AM

cc @Niels Bantilan did you end up trying alpaca?

Ketan (kumare3)

04/20/2023, 4:12 AM

@Fabio Grätz what do you think should we merge the idl PR?

Ketan (kumare3)

04/20/2023, 4:12 AM

how can i help take them over the finish line?

Fabio Grätz

04/20/2023, 9:17 AM

I will add tests and documentation on the weekend. Then I’ll request PR reviews You could help by testing it, so far I have only run minimal working examples (e.g. this one) that don’t do much more other than making sure that the process group can be initialized.

Niels Bantilan

04/20/2023, 12:54 PM

the code works, successfully ran the workflow on the

facebook/opt-125m

, currently trying to get to work on a pre-existing llama model on huggingface

Niels Bantilan

04/20/2023, 12:55 PM

also still need to test it on multiple cpus/gpus

Ketan (kumare3)

04/20/2023, 1:58 PM

@Niels Bantilan I can help with multi core example

Niels Bantilan

04/20/2023, 2:02 PM

cool, I just updated our fork/branch with my changes: https://github.com/unionai-oss/stanford_alpaca/tree/flytekit-alpaca

Fabio Grätz

04/20/2023, 4:17 PM

One other thing about which I’m interested in your opinion:

torchrun

allows the user to set

--nnodes

which could e.g. be

but also be

"1:2"

which means min 1 max 2. Currently this is what iour new

task_config=Elastic()

exposes as well. The kubeflow PytorchJob allows setting

minReplicas

maxReplicas

(which by default are both None), and

replicas

(see here). In theory you could say min 2, max 4, replicas 3 (without going into how much sense this makes). If a user specifies

2:3

we currently set min to 2 and max and replicas to 3. To summarize: Should we expose

nnodes

like torchrun or

min_replicas

max_replicas

, and

replicas

like the pytorchjob to the user?

Fabio Grätz

04/23/2023, 12:11 PM

Ready for review from my side:

Fabio Grätz

04/23/2023, 12:12 PM

• https://github.com/flyteorg/flyte/issues/3614 • https://github.com/flyteorg/flytekit/pull/1603 • https://github.com/flyteorg/flyteidl/pull/394 • https://github.com/flyteorg/flyteplugins/pull/343 • https://github.com/flyteorg/flytesnacks/pull/987

Fabio Grätz

04/23/2023, 12:13 PM

How does the merge process typically look like when idl is changed? Tests in flytekit and flyteplugins fail since idl changes are not there yet

Ketan (kumare3)

04/23/2023, 6:13 PM

cc +@Byron Hsu

Ketan (kumare3)

04/23/2023, 6:13 PM

@Byron Hsu we are enabling torch-elastic in flytekit now

Ketan (kumare3)

04/23/2023, 6:14 PM

@Fabio Grätz / @Byron Hsu seems these instructions are no longer valid - https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html#deployment-plugin-setup-k8s - as we have one training operator now. cc @Kevin Su / @Yuvraj

Ketan (kumare3)

04/24/2023, 4:36 AM

Also, @Fabio Grätz do you folks use - https://github.com/libffcv/ffcv?

Ketan (kumare3)

04/24/2023, 4:36 AM

@Niels Bantilan / @James Sutton / @Evan Sadler

Fabio Grätz

05/03/2023, 6:45 AM

Thanks for finishing the Pr and merging 🚀

Fabio Grätz

06/19/2023, 7:25 AM

https://github.com/flyteorg/flytekit/pull/1677 Need feedback on this fix, thx 🙂 Maybe @Kevin Su @Niels Bantilan @Eduardo Apolinario (eapolinario)? Doesn’t have time pressure

Nan Qin

06/30/2023, 5:37 PM

we are getting

RendezvousTimeoutError

when launching ddp on eks. It happens when some workers started running while others are waiting for resources to be available. After investigating the logs and pytorch code we believe it is due to join_timeout parameter which defaults to 600s, as the

RendezvousTimeoutError

shows up exactly 600s after the pod starts running. not sure what is the best workaround is, but seems adding something like

rdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},

to the LaunchConfig could probably solve it. Please lmk if this is the right approach. would love to contribute