Brandon Segal05/24/2023, 2:07 AM
Brandon Segal05/24/2023, 11:41 AM
in a way styx can identify. This will result in workflows erroneously being labeled as having unknown errors when the team may be raising error codes known to the styx service but are not recognized due to the RetriesExhaused string prepended to it. A possible remediation to this to allow dynamic workflows to raise specific styx errors is to remove the "RetriesExhaused|" String prior to matching it to any of the known error codes.
Sonja Ericsson05/25/2023, 9:38 AM
Or is this enough https://github.com/spotify/styx/pull/1084/files ? Our scheduler service is identifying
final String flyteCode = execution.getClosure().getError().getCode(); final String flyteCodesSplitted = flyteCode.split("\\|"); final String lastCode = flyteCodesSplitted[flyteCodesSplitted.length-1];
flyte errors to translate it to a code which makes our scheduler retry the execution
Dan Rammer (hamersaw)05/25/2023, 12:45 PM
state it means that some task in the subworkflow already exhausted all of it's retries. I think the only retryable failures here should be when dynamic tasks fail internally (ie. copying inputs / writing outputs, etc). I am going to run a ton of tests today to validate this to myself.
prefix would be the best way to identify the actual error code.
Sonja Ericsson05/25/2023, 1:00 PM
Brandon Segal05/25/2023, 1:02 PM
Sonja Ericsson05/25/2023, 1:09 PM
Dan Rammer (hamersaw)05/25/2023, 1:17 PM
both the dynamic workflow and the tasks spun up have retries=0So fixing the issue I linked should fix this problem correct? If the dynamic task does not return a RetryableFailure then the
will not be prepended to the error code.
Is it possible to configure a task to retry only on certain errors and not others raised by the task?This is a point of confusion in a lot of places. Flyte has differentiation between retryable and non-retryable errors (called recoverable in the flytekit API). So when throwing an error from flytekit, I believe they are non-recoverable by default (@Yee can you confirm?), but you can use a try / catch in flytekit and wrap the error in a
flag to ensure retries. So in this case, it seems the non-recoverable error from flytekit is retried because of the bug in the dynamic task implementation (will fix in the next few days). The initial reason for having retries on tasks was for system-level failures. For example, k8s deleting and cleaning up a Pod in the background, or similarly premeptible instances, or things like blobstore read / write failures that aren't involved in actual task executions. All of these are "recoverable" errors. So Tl;DR if the error is a non-recoverable it will not retry regardless of the number of retries specified on the task.
Sonja Ericsson05/25/2023, 1:31 PM
Brandon Segal05/25/2023, 3:28 PM
error code which is exactly what we need and probably is the expected behavior