Ketan (kumare3)
04/04/2023, 1:41 PMKetan (kumare3)
04/04/2023, 1:41 PMFabio Grätz
04/04/2023, 4:31 PMKetan (kumare3)
04/04/2023, 5:56 PMKetan (kumare3)
04/05/2023, 12:38 AMKetan (kumare3)
04/05/2023, 12:39 AMFabio Grätz
04/05/2023, 7:27 AMKetan (kumare3)
04/10/2023, 4:00 AMKetan (kumare3)
04/12/2023, 4:07 PMKetan (kumare3)
04/12/2023, 4:07 PMFabio Grätz
04/13/2023, 6:06 PMtorch.distributed.init_process_group()
, see here. Back then I just created a kubeflow PytorchJob to run it which worked. Image needed nvidia-cuda-toolkit
. To summarize, at the state of ~1.5 years ago I think it would already have been supported.Fabio Grätz
04/17/2023, 8:00 AMnnodes=1
in a single pod, and with nnodes>1
with the pytorch operator.
I think we could try with alpaca now đŚ
The problems with rendezvous flakiness I mentioned in the call on Thursday were actually related to network config on my notebook (no ipv6 enabled).
I have one question about the[W socket.cpp:601] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49651) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).
execute
method I copied from PythonFunctionTask
: We donât need the else case here for dynamic even though the original docstring hints one should implement it as well, right?Ketan (kumare3)
04/20/2023, 4:11 AMKetan (kumare3)
04/20/2023, 4:12 AMKetan (kumare3)
04/20/2023, 4:12 AMFabio Grätz
04/20/2023, 9:17 AMNiels Bantilan
04/20/2023, 12:54 PMfacebook/opt-125m
, currently trying to get to work on a pre-existing llama model on huggingfaceNiels Bantilan
04/20/2023, 12:55 PMKetan (kumare3)
04/20/2023, 1:58 PMNiels Bantilan
04/20/2023, 2:02 PMFabio Grätz
04/20/2023, 4:17 PMtorchrun
allows the user to set --nnodes
which could e.g. be 2
but also be "1:2"
which means min 1 max 2. Currently this is what iour new task_config=Elastic()
exposes as well.
The kubeflow PytorchJob allows setting minReplicas
, maxReplicas
(which by default are both None), and replicas
(see here). In theory you could say min 2, max 4, replicas 3 (without going into how much sense this makes).
If a user specifies 2:3
we currently set min to 2 and max and replicas to 3.
To summarize: Should we expose nnodes
like torchrun or min_replicas
, max_replicas
, and replicas
like the pytorchjob to the user?Fabio Grätz
04/23/2023, 12:11 PMFabio Grätz
04/23/2023, 12:12 PMFabio Grätz
04/23/2023, 12:13 PMKetan (kumare3)
04/23/2023, 6:13 PMKetan (kumare3)
04/23/2023, 6:13 PMKetan (kumare3)
04/23/2023, 6:14 PMKetan (kumare3)
04/24/2023, 4:36 AMKetan (kumare3)
04/24/2023, 4:36 AMFabio Grätz
05/03/2023, 6:45 AM