Brandon Segal
05/24/2023, 2:07 AMKetan (kumare3)
Brandon Segal
05/24/2023, 11:41 AMUser:NotReady
in a way styx can identify. This will result in workflows erroneously being labeled as having unknown errors when the team may be raising error codes known to the styx service but are not recognized due to the RetriesExhaused string prepended to it.
A possible remediation to this to allow dynamic workflows to raise specific styx errors is to remove the "RetriesExhaused|" String prior to matching it to any of the known error codes.Samhita Alla
Sonja Ericsson
05/25/2023, 9:38 AMfinal String flyteCode = execution.getClosure().getError().getCode();
final String[] flyteCodesSplitted = flyteCode.split("\\|");
final String lastCode = flyteCodesSplitted[flyteCodesSplitted.length-1];
Or is this enough https://github.com/spotify/styx/pull/1084/files ?
Our scheduler service is identifying USER:NotReady
flyte errors to translate it to a code which makes our scheduler retry the executionDan Rammer (hamersaw)
05/25/2023, 12:45 PMFailed
state it means that some task in the subworkflow already exhausted all of it's retries. I think the only retryable failures here should be when dynamic tasks fail internally (ie. copying inputs / writing outputs, etc). I am going to run a ton of tests today to validate this to myself.RetriesExhausted|
prefix would be the best way to identify the actual error code.Sonja Ericsson
05/25/2023, 1:00 PMBrandon Segal
05/25/2023, 1:02 PMSonja Ericsson
05/25/2023, 1:09 PM@task(retries=3)
Dan Rammer (hamersaw)
05/25/2023, 1:17 PMboth the dynamic workflow and the tasks spun up have retries=0So fixing the issue I linked should fix this problem correct? If the dynamic task does not return a RetryableFailure then the
RetriesExhausted|
will not be prepended to the error code.
Is it possible to configure a task to retry only on certain errors and not others raised by the task?This is a point of confusion in a lot of places. Flyte has differentiation between retryable and non-retryable errors (called recoverable in the flytekit API). So when throwing an error from flytekit, I believe they are non-recoverable by default (@Yee can you confirm?), but you can use a try / catch in flytekit and wrap the error in a
recoverable
flag to ensure retries. So in this case, it seems the non-recoverable error from flytekit is retried because of the bug in the dynamic task implementation (will fix in the next few days).
The initial reason for having retries on tasks was for system-level failures. For example, k8s deleting and cleaning up a Pod in the background, or similarly premeptible instances, or things like blobstore read / write failures that aren't involved in actual task executions. All of these are "recoverable" errors. So Tl;DR if the error is a non-recoverable it will not retry regardless of the number of retries specified on the task.Sonja Ericsson
05/25/2023, 1:31 PMBrandon Segal
05/25/2023, 3:28 PMUser:NotReady
error code which is exactly what we need and probably is the expected behaviorYee