Fabio Grätz
04/04/2023, 4:31 PMKetan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Fabio Grätz
04/05/2023, 7:27 AMKetan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Fabio Grätz
04/13/2023, 6:06 PMtorch.distributed.init_process_group()
, see here. Back then I just created a kubeflow PytorchJob to run it which worked. Image needed nvidia-cuda-toolkit
. To summarize, at the state of ~1.5 years ago I think it would already have been supported.Fabio Grätz
04/17/2023, 8:00 AMnnodes=1
in a single pod, and with nnodes>1
with the pytorch operator.
I think we could try with alpaca now đŚ
The problems with rendezvous flakiness I mentioned in the call on Thursday were actually related to network config on my notebook (no ipv6 enabled).
I have one question about the[W socket.cpp:601] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49651) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).
execute
method I copied from PythonFunctionTask
: We donât need the else case here for dynamic even though the original docstring hints one should implement it as well, right?Ketan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Fabio Grätz
04/20/2023, 9:17 AMNiels Bantilan
04/20/2023, 12:54 PMfacebook/opt-125m
, currently trying to get to work on a pre-existing llama model on huggingfaceNiels Bantilan
04/20/2023, 12:55 PMKetan (kumare3)
Niels Bantilan
04/20/2023, 2:02 PMFabio Grätz
04/20/2023, 4:17 PMtorchrun
allows the user to set --nnodes
which could e.g. be 2
but also be "1:2"
which means min 1 max 2. Currently this is what iour new task_config=Elastic()
exposes as well.
The kubeflow PytorchJob allows setting minReplicas
, maxReplicas
(which by default are both None), and replicas
(see here). In theory you could say min 2, max 4, replicas 3 (without going into how much sense this makes).
If a user specifies 2:3
we currently set min to 2 and max and replicas to 3.
To summarize: Should we expose nnodes
like torchrun or min_replicas
, max_replicas
, and replicas
like the pytorchjob to the user?Fabio Grätz
04/23/2023, 12:11 PMFabio Grätz
04/23/2023, 12:12 PMFabio Grätz
04/23/2023, 12:13 PMKetan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Fabio Grätz
05/03/2023, 6:45 AMFabio Grätz
06/19/2023, 7:25 AMNan Qin
06/30/2023, 5:37 PMRendezvousTimeoutError
when launching ddp on eks. It happens when some workers started running while others are waiting for resources to be available. After investigating the logs and pytorch code we believe it is due to join_timeout parameter which defaults to 600s, as the RendezvousTimeoutError
shows up exactly 600s after the pod starts running.
not sure what is the best workaround is, but seems adding something like rdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},
to the LaunchConfig could probably solve it.
Please lmk if this is the right approach. would love to contribute