<#1739 Fix: Improve error handling in elastic task...
# flyte-github
a
#1739 Fix: Improve error handling in elastic tasks Pull request opened by fg91 TL;DR • This PR fixes a small bug that made it more difficult to find the original exception in the stack trace that made an elastic task worker process crash. • The PR also adds a warning that users should to use the retry mechanism of torch elastic launch instead of the retry mechanism of flyte when it comes to exceptions raised within the worker processes. Type ☑︎ Bug Fix ☐ Feature ☐ Plugin Are all requirements met? ☑︎ Code completed ☑︎ Smoke tested ☐ Unit tests added ☐ Code documentation added ☑︎ Any pending items have an associated Issue Complete description Let's consider this minimal task:
Copy code
@task(
    task_config=Elastic(
        nnodes=1,
        nproc_per_node=1,
    ),
)
def train():
    raise FooException("foo")
This currently results in this stack trace:
Copy code
FlyteScopedUserException:
…
      raise FooException("foo")
  FooException: foo

============================================================

During handling of the above exception, another exception occurred:



…
 729  return exception_scopes.user_entry_point(self._workflow_function)(**kwargs) 

 /home/fabiogratz/miniconda3/envs/flyte-dev/lib/python3.10/site-packages/flytekit/exceptions/scopes.py:202 in user_entry_point     

 202  raise exc.type(f"Error encountered while executing '{fn_name}':\n  {exc.                               

TypeError: Encountered error while executing workflow 'wf':
  ChildFailedError.__init__() missing 1 required positional argument: 'failures'
The original
FooException
is not the last exception the user sees! Reason: • The worker process crashes with
FooException
`elastic_launch` in the elastic task raises
ChildFailedError
• Since all this is executed within
exception_scopes.user_entry_point(self._execute)(**kwargs)
here, we re-raise the
ChildFailedError
using
raise exc.type(f"Error encountered while executing '{fn_name}':\n {exc.value}") from exc
here. • However,
ChildFailedError
needs an additional argument called `failures` so
exc.type("some message")
fails. In this PR, I therefore catch the
ChildFailedError
and re-raise a
RuntimeError
with the message of the original
ChildFailedError
. * * * Unfortunately torch elastic launch gives us access to the exception in the child process only as a message in string format. We don't know the type of the original exception in the child process (or would have to try to parse this from the string). This means we don't know whether the exception in the child process is recoverable. Therefore I add a warning to the user that they should use the
Elastic(..., max_retries)
argument to control retries for elastic tasks. (This means that not the pod is restarted but the worker processes within the elastic task while the main agent process doesn't crash.) Tracking Issue NA Follow-up issue NA flyteorg/flytekit All checks have passed 30/30 successful checks