https://flyte.org logo
#ask-the-community
Title
# ask-the-community
k

Kyle Mylonakis

06/27/2023, 8:31 PM
Hello everyone. I am trying to use Flyte to launch a dynamic workflow of Pytorch Elastic tasks and keep on running into the following error when using more than a single Node
Copy code
Workflow[...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[dn0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: number of worker should be more then 0
Is this something people have seen before? Is there a more appropriate place to ask about this?
n

Niels Bantilan

06/27/2023, 9:55 PM
Hi Kyle, can you share your task specification?
k

Kyle Mylonakis

06/27/2023, 10:03 PM
I'm not sure what that is. I am new to using Flyte
Can you be a bit more specific or give an example?
n

Nan Qin

06/27/2023, 10:29 PM
@Kyle Mylonakis I believe Niels is asking about how the task function is decorated with
flytekit.task
@Niels Bantilan I found the
flyteagent
image is still on 1.6.2b1 although I deployed the 1.7.0 helm chart and
flyte-binary-release
is on v1.7.0 which is correct. Could it be related?
n

Niels Bantilan

06/27/2023, 11:22 PM
To rephrase, can you share a snippet of your task/workflow code?
k

Kyle Mylonakis

06/27/2023, 11:27 PM
It looks roughly something like this
Copy code
@flytekit.dynamic(cache=CACHE, cache_version=CACHE_VERSION)
def dynamic_task(
    redacted_args
) -> WORKFLOW_RETURN_TYPE:
    results: WORKFLOW_RETURN_TYPE = {}  # type: ignore
    for hyper_params in [HyperParams(*args) for args in itertools.product(redacted_args lists)]:
        best_paths = train(various_args=various_args)

        key = "some_arg_dependent_string"
        results[key] = best_paths  # type: ignore
    return results
Consider
train
a task with an
Elastic
config
k

Ketan (kumare3)

06/28/2023, 12:41 AM
we are infact interested in the
Elastic
config
k

Kyle Mylonakis

06/28/2023, 12:45 AM
Copy code
@flytekit.task(
    task_config=kfpytorch.Elastic(nnodes=NNODES, nproc_per_node=NPROC_PER_NODE),
    cache=CACHE,
    cache_version=CACHE_VERSION,
    requests=flytekit.Resources(gpu=GPU, cpu=CPU, mem=MEM),
)
n

Nan Qin

06/28/2023, 3:25 AM
@Kyle Mylonakis to confirm, the error only happens when the two following conditions are true at the same time, right? 1. Elastic task was inside a dynamic task 2. nnodes>1
k

Kyle Mylonakis

06/28/2023, 3:25 AM
I am seeing the same error right now for a single task
or rather a single task inside a workflow of two tasks, (not dynamic or map)
k

Ketan (kumare3)

06/28/2023, 3:26 AM
Hmm does not make sense
What is nnodes?
k

Kyle Mylonakis

06/28/2023, 3:27 AM
an integer greater than 1
For example it fails with the error on 2
k

Ketan (kumare3)

06/28/2023, 3:27 AM
When it’s one does it work?
k

Kyle Mylonakis

06/28/2023, 3:27 AM
it works with 1
n

Nan Qin

06/28/2023, 3:28 AM
you were able to run nnodes>1 when we have flyte 1.6.2 deployed, right?
k

Kyle Mylonakis

06/28/2023, 3:28 AM
yes
Should we downgrade?
k

Ketan (kumare3)

06/28/2023, 4:03 AM
Ohh is that right - I know @Fabio Grätz made a change?
Downgrade flytekit
k

Kyle Mylonakis

06/28/2023, 10:51 AM
Would love to talk about them in detail about the bug or the change they made.
k

Ketan (kumare3)

06/28/2023, 2:01 PM
@Fabio Grätz can you tal
f

Fabio Grätz

06/28/2023, 2:27 PM
Copy code
Workflow[...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[dn0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: number of worker should be more then 0
I think this might be related to the recent refactoring of the kubeflow plugin by @Yubo Wang. Are you running a new version of propeller? I think Yubo fixed this here if I’m not mistaken. Can you please also check whether you use a flytekit version that contains this change or not?
k

Ketan (kumare3)

06/28/2023, 2:50 PM
Uh ho,
k

Kyle Mylonakis

06/28/2023, 3:39 PM
@Nan Qin Can you look into this?
y

Yubo Wang

06/28/2023, 7:30 PM
yeah sorry for the mess, Fabio is correct, the PR to fix it is not yet included in the that version of the flyte release. If you can, upgrade both flytekit and flytepropeller to the latest master branch should solve the issue.
n

Nan Qin

06/28/2023, 7:32 PM
does the latest flyte-binary helm chart include the fix?
y

Yubo Wang

06/28/2023, 7:33 PM
let me check
no, it does not
3 Views