Still draft PRs because I will add more tests and ...
# torch-elastic
f
Still draft PRs because I will add more tests and docs: • https://github.com/flyteorg/flytekit/pull/1583https://github.com/flyteorg/flyteplugins/pull/343https://github.com/flyteorg/flyteidl/pull/394 But torch elastic task now works for me when executing locally, with
nnodes=1
in a single pod, and with
nnodes>1
with the pytorch operator. I think we could try with alpaca now 🦙 The problems with rendezvous flakiness I mentioned in the call on Thursday were actually related to network config on my notebook (no ipv6 enabled).
[W socket.cpp:601] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49651) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).
I have one question about the
execute
method I copied from
PythonFunctionTask
: We don’t need the else case here for dynamic even though the original docstring hints one should implement it as well, right?
107 Views