Kyle Mylonakis
06/27/2023, 8:31 PMWorkflow[...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[dn0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: number of worker should be more then 0
Is this something people have seen before? Is there a more appropriate place to ask about this?Niels Bantilan
06/27/2023, 9:55 PMKyle Mylonakis
06/27/2023, 10:03 PMNan Qin
06/27/2023, 10:29 PMflytekit.task
flyteagent
image is still on 1.6.2b1 although I deployed the 1.7.0 helm chart and flyte-binary-release
is on v1.7.0 which is correct. Could it be related?Niels Bantilan
06/27/2023, 11:22 PMKyle Mylonakis
06/27/2023, 11:27 PM@flytekit.dynamic(cache=CACHE, cache_version=CACHE_VERSION)
def dynamic_task(
redacted_args
) -> WORKFLOW_RETURN_TYPE:
results: WORKFLOW_RETURN_TYPE = {} # type: ignore
for hyper_params in [HyperParams(*args) for args in itertools.product(redacted_args lists)]:
best_paths = train(various_args=various_args)
key = "some_arg_dependent_string"
results[key] = best_paths # type: ignore
return results
train
a task with an Elastic
configKetan (kumare3)
Elastic
configKyle Mylonakis
06/28/2023, 12:45 AM@flytekit.task(
task_config=kfpytorch.Elastic(nnodes=NNODES, nproc_per_node=NPROC_PER_NODE),
cache=CACHE,
cache_version=CACHE_VERSION,
requests=flytekit.Resources(gpu=GPU, cpu=CPU, mem=MEM),
)
Nan Qin
06/28/2023, 3:25 AMKyle Mylonakis
06/28/2023, 3:25 AMKetan (kumare3)
Kyle Mylonakis
06/28/2023, 3:27 AMKetan (kumare3)
Kyle Mylonakis
06/28/2023, 3:27 AMNan Qin
06/28/2023, 3:28 AMKyle Mylonakis
06/28/2023, 3:28 AMKetan (kumare3)
Kyle Mylonakis
06/28/2023, 10:51 AMKetan (kumare3)
Fabio Grätz
06/28/2023, 2:27 PMI think this might be related to the recent refactoring of the kubeflow plugin by @Yubo Wang. Are you running a new version of propeller? I think Yubo fixed this here if I’m not mistaken. Can you please also check whether you use a flytekit version that contains this change or not?Copy codeWorkflow[...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[dn0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [pytorch]: number of worker should be more then 0
Ketan (kumare3)
Kyle Mylonakis
06/28/2023, 3:39 PMYubo Wang
06/28/2023, 7:30 PMNan Qin
06/28/2023, 7:32 PMYubo Wang
06/28/2023, 7:33 PM