Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

<https://github.com/flyteorg/flytekit/pull/1739|#1739 Fix: Improve error handling in elastic tasks>
Pull request opened by <https://github.com/fg91|fg91>
*TL;DR*

• This PR fixes a small bug that made it more difficult to find the original exception in the stack trace that made an elastic task worker process crash.
• The PR also adds a warning that users should to use the retry mechanism of torch elastic launch instead of the retry mechanism of flyte when it comes to exceptions raised within the worker processes.

*Type*

☑︎ Bug Fix
☐ Feature
☐ Plugin

*Are all requirements met?*

☑︎ Code completed
☑︎ Smoke tested
☐ Unit tests added
☐ Code documentation added
☑︎ Any pending items have an associated Issue

*Complete description*

Let's consider this minimal task:

```
@task(
    task_config=Elastic(
        nnodes=1,
        nproc_per_node=1,
    ),
)
def train():
    raise FooException("foo")
```

This currently results in this stack trace:

```
FlyteScopedUserException:
…
      raise FooException("foo")
  FooException: foo

============================================================

During handling of the above exception, another exception occurred:



…
 729  return exception_scopes.user_entry_point(self._workflow_function)(**kwargs) 

 /home/fabiogratz/miniconda3/envs/flyte-dev/lib/python3.10/site-packages/flytekit/exceptions/scopes.py:202 in user_entry_point     

 202  raise exc.type(f"Error encountered while executing '{fn_name}':\n  {exc.                               

TypeError: Encountered error while executing workflow 'wf':
  ChildFailedError.__init__() missing 1 required positional argument: 'failures'
```

The original `FooException` is not the last exception the user sees!

Reason:

• The worker process crashes with `FooException`
• <https://github.com/flyteorg/flytekit/blob/480c41e9c7c66598ff3d7c8b4e7becf610ded241/plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py#L343|`elastic_launch` in the elastic task> raises `ChildFailedError`
• Since all this is executed within `exception_scopes.user_entry_point(self._execute)(**kwargs)` <https://github.com/flyteorg/flytekit/blob/480c41e9c7c66598ff3d7c8b4e7becf610ded241/plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py#L363C16-L363C74|here>, we re-raise the `ChildFailedError` using `raise exc.type(f"Error encountered while executing '{fn_name}':\n {exc.value}") from exc` <https://github.com/flyteorg/flytekit/blob/480c41e9c7c66598ff3d7c8b4e7becf610ded241/flytekit/exceptions/scopes.py#L201C17-L201C106|here>.
• However, `ChildFailedError` <https://github.com/pytorch/pytorch/blob/979f826015cbd2b353f02e93865a9b9a8877b414/torch/distributed/elastic/multiprocessing/errors/__init__.py#L223|needs an additional argument called `failures`> so `exc.type("some message")` fails.

In this PR, I therefore catch the `ChildFailedError` and re-raise a `RuntimeError` with the message of the original `ChildFailedError`.

* * *

Unfortunately torch elastic launch gives us access to the exception in the child process only as a message in string format. We don't know the type of the original exception in the child process (or would have to try to parse this from the string). This means we don't know whether the exception in the child process is recoverable. Therefore I add a warning to the user that they should use the `Elastic(..., max_retries)` argument to control retries for elastic tasks. (This means that not the pod is restarted but the worker processes within the elastic task while the main agent process doesn't crash.)

*Tracking Issue*

_NA_

*Follow-up issue*

_NA_
<https://github.com/flyteorg/flytekit|flyteorg/flytekit>
:white_check_mark: All checks have passed
30/30 successful checks

<https://github.com/flyteorg/flytekit/pull/1739|#1739 Fix: Improve error handling in elastic tasks>
Pull request merged by <https://github.com/fg91|fg91>
<https://github.com/flyteorg/flytekit|flyteorg/flytekit>