Hello everyone. I am trying to use Flyte to launch...
# ask-the-community
k
Hello everyone. I am trying to use Flyte to launch a dynamic workflow of Pytorch Elastic tasks and keep on running into the following error when using more than a single Node
Copy code
Workflow[...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[dn0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: number of worker should be more then 0
Is this something people have seen before? Is there a more appropriate place to ask about this?
n
Hi Kyle, can you share your task specification?
k
I'm not sure what that is. I am new to using Flyte
Can you be a bit more specific or give an example?
n
@Kyle Mylonakis I believe Niels is asking about how the task function is decorated with
flytekit.task
@Niels Bantilan I found the
flyteagent
image is still on 1.6.2b1 although I deployed the 1.7.0 helm chart and
flyte-binary-release
is on v1.7.0 which is correct. Could it be related?
n
To rephrase, can you share a snippet of your task/workflow code?
k
It looks roughly something like this
Copy code
@flytekit.dynamic(cache=CACHE, cache_version=CACHE_VERSION)
def dynamic_task(
    redacted_args
) -> WORKFLOW_RETURN_TYPE:
    results: WORKFLOW_RETURN_TYPE = {}  # type: ignore
    for hyper_params in [HyperParams(*args) for args in itertools.product(redacted_args lists)]:
        best_paths = train(various_args=various_args)

        key = "some_arg_dependent_string"
        results[key] = best_paths  # type: ignore
    return results
Consider
train
a task with an
Elastic
config
k
we are infact interested in the
Elastic
config
k
Copy code
@flytekit.task(
    task_config=kfpytorch.Elastic(nnodes=NNODES, nproc_per_node=NPROC_PER_NODE),
    cache=CACHE,
    cache_version=CACHE_VERSION,
    requests=flytekit.Resources(gpu=GPU, cpu=CPU, mem=MEM),
)
n
@Kyle Mylonakis to confirm, the error only happens when the two following conditions are true at the same time, right? 1. Elastic task was inside a dynamic task 2. nnodes>1
k
I am seeing the same error right now for a single task
or rather a single task inside a workflow of two tasks, (not dynamic or map)
k
Hmm does not make sense
What is nnodes?
k
an integer greater than 1
For example it fails with the error on 2
k
When it’s one does it work?
k
it works with 1
n
you were able to run nnodes>1 when we have flyte 1.6.2 deployed, right?
k
yes
Should we downgrade?
k
Ohh is that right - I know @Fabio Grätz made a change?
Downgrade flytekit
k
Would love to talk about them in detail about the bug or the change they made.
k
@Fabio Grätz can you tal
f
Copy code
Workflow[...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[dn0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: number of worker should be more then 0
I think this might be related to the recent refactoring of the kubeflow plugin by @Yubo Wang. Are you running a new version of propeller? I think Yubo fixed this here if I’m not mistaken. Can you please also check whether you use a flytekit version that contains this change or not?
k
Uh ho,
k
@Nan Qin Can you look into this?
y
yeah sorry for the mess, Fabio is correct, the PR to fix it is not yet included in the that version of the flyte release. If you can, upgrade both flytekit and flytepropeller to the latest master branch should solve the issue.
n
does the latest flyte-binary helm chart include the fix?
y
let me check
no, it does not
153 Views