Ketan (kumare3)
pickle
instead of cloudpickle
this works
from flytekit import task
import cloudpickle
@task(cache=True, cache_version="1.0")
def foo(i: int) -> int:
return i
foo(i=10)
p = cloudpickle.dumps(foo)
f = cloudpickle.loads(p)
f(i=10)
Ketan (kumare3)
dill
or even cloudpickle
Ketan (kumare3)
Ketan (kumare3)
cloudpickle
for tasks worksKetan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Fabio Grätz
04/04/2023, 7:03 AMimport multiprocessing
def start_processes(fn, args=(), nprocs=1, join=True, daemon=False, start_method='spawn'):
mp = multiprocessing.get_context(start_method)
...
process.start()
Or do you mean using my trick but with cloudpickle?Ketan (kumare3)
Ketan (kumare3)
Fabio Grätz
04/04/2023, 4:31 PMKetan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Fabio Grätz
04/05/2023, 7:27 AMKetan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Fabio Grätz
04/13/2023, 6:06 PMtorch.distributed.init_process_group()
, see here. Back then I just created a kubeflow PytorchJob to run it which worked. Image needed nvidia-cuda-toolkit
. To summarize, at the state of ~1.5 years ago I think it would already have been supported.Fabio Grätz
04/17/2023, 8:00 AMnnodes=1
in a single pod, and with nnodes>1
with the pytorch operator.
I think we could try with alpaca now 🦙
The problems with rendezvous flakiness I mentioned in the call on Thursday were actually related to network config on my notebook (no ipv6 enabled).
I have one question about the[W socket.cpp:601] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49651) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).
execute
method I copied from PythonFunctionTask
: We don’t need the else case here for dynamic even though the original docstring hints one should implement it as well, right?Ketan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Fabio Grätz
04/20/2023, 9:17 AMNiels Bantilan
04/20/2023, 12:54 PMfacebook/opt-125m
, currently trying to get to work on a pre-existing llama model on huggingfaceNiels Bantilan
04/20/2023, 12:55 PMKetan (kumare3)
Niels Bantilan
04/20/2023, 2:02 PMFabio Grätz
04/20/2023, 4:17 PMtorchrun
allows the user to set --nnodes
which could e.g. be 2
but also be "1:2"
which means min 1 max 2. Currently this is what iour new task_config=Elastic()
exposes as well.
The kubeflow PytorchJob allows setting minReplicas
, maxReplicas
(which by default are both None), and replicas
(see here). In theory you could say min 2, max 4, replicas 3 (without going into how much sense this makes).
If a user specifies 2:3
we currently set min to 2 and max and replicas to 3.
To summarize: Should we expose nnodes
like torchrun or min_replicas
, max_replicas
, and replicas
like the pytorchjob to the user?Fabio Grätz
04/23/2023, 12:11 PM