acoustic-carpenter-78188
07/13/2023, 11:26 AM@task(
task_config=Elastic(
nnodes=1,
nproc_per_node=1,
),
)
def train():
raise FooException("foo")
This currently results in this stack trace:
FlyteScopedUserException:
…
raise FooException("foo")
FooException: foo
============================================================
During handling of the above exception, another exception occurred:
…
729 return exception_scopes.user_entry_point(self._workflow_function)(**kwargs)
/home/fabiogratz/miniconda3/envs/flyte-dev/lib/python3.10/site-packages/flytekit/exceptions/scopes.py:202 in user_entry_point
202 raise exc.type(f"Error encountered while executing '{fn_name}':\n {exc.
TypeError: Encountered error while executing workflow 'wf':
ChildFailedError.__init__() missing 1 required positional argument: 'failures'
The original FooException
is not the last exception the user sees!
Reason:
• The worker process crashes with FooException
• `elastic_launch` in the elastic task raises ChildFailedError
• Since all this is executed within exception_scopes.user_entry_point(self._execute)(**kwargs)
here, we re-raise the ChildFailedError
using raise exc.type(f"Error encountered while executing '{fn_name}':\n {exc.value}") from exc
here.
• However, ChildFailedError
needs an additional argument called `failures` so exc.type("some message")
fails.
In this PR, I therefore catch the ChildFailedError
and re-raise a RuntimeError
with the message of the original ChildFailedError
.
* * *
Unfortunately torch elastic launch gives us access to the exception in the child process only as a message in string format. We don't know the type of the original exception in the child process (or would have to try to parse this from the string). This means we don't know whether the exception in the child process is recoverable. Therefore I add a warning to the user that they should use the Elastic(..., max_retries)
argument to control retries for elastic tasks. (This means that not the pod is restarted but the worker processes within the elastic task while the main agent process doesn't crash.)
Tracking Issue
NA
Follow-up issue
NA
flyteorg/flytekit
✅ All checks have passed
30/30 successful checksacoustic-carpenter-78188
07/14/2023, 3:09 PM