Is there a way to have my dynamic workflow raise the last er Flyte #flyte-support

Is there a way to have my dynamic workflow raise t...

bored-laptop-29637

05/24/2023, 2:07 AM

Is there a way to have my dynamic workflow raise the last error from one of its invoked tasks directly? I want to make sure that the correct error and error code are the outer most errors coming from the workflow

freezing-airport-6809

05/24/2023, 3:07 AM

Wdym last error

freezing-airport-6809

05/24/2023, 3:07 AM

Ohh you have a failure and continue going

bored-laptop-29637

05/24/2023, 11:41 AM

Oh no sorry. I have an error with a specific exit code raised from one of the tasks started by the dynamic workflow. Will the dynamic workflow raise that same error in the workflow if I was to look at the error from the workflow?

bored-laptop-29637

05/24/2023, 12:27 PM

The last error has a errorcode of User:NotReady but I see that the error being raised by the workflow containing the dynamic workflow is RetriesExhaused|User:NotReady. Ultimately I want to know if I can raise the errorcode of User:NotReady as the error for the workflow?

bored-laptop-29637

05/24/2023, 12:54 PM

Or if it is possible to try catch the errors from the task and raise them from the dynamic workflow?

bored-laptop-29637

05/24/2023, 8:34 PM

Ahh found the cause I think of the prepending to the error code in the flytepropeller code: The flyte propeller code will mark a dynamic workflow with a retryable failure if it finds that one of the dynamic nodes failed. The flyte propeller will then prepend the error code with "RetriesExhaused|" in front of the dynamic node's original error code. The impact of this is that any dynamic workflow will not be able to raise a

User:NotReady

in a way styx can identify. This will result in workflows erroneously being labeled as having unknown errors when the team may be raising error codes known to the styx service but are not recognized due to the RetriesExhaused string prepended to it. A possible remediation to this to allow dynamic workflows to raise specific styx errors is to remove the "RetriesExhaused|" String prior to matching it to any of the known error codes.

tall-lock-23197

05/25/2023, 6:10 AM

cc @hallowed-mouse-14616

colossal-solstice-11091

05/25/2023, 9:38 AM

How should we treat this @hallowed-mouse-14616? Can this happen in other occurrences than RetriesExhaused so it makes sense to do

Copy code

final String flyteCode = execution.getClosure().getError().getCode();
final String[] flyteCodesSplitted = flyteCode.split("\\|");
final String lastCode = flyteCodesSplitted[flyteCodesSplitted.length-1];

Or is this enough https://github.com/spotify/styx/pull/1084/files ? Our scheduler service is identifying

USER:NotReady

flyte errors to translate it to a code which makes our scheduler retry the execution

hallowed-mouse-14616

05/25/2023, 12:45 PM

Oh this is a difficult one 😅. So I'm actually working on an issue right now about retrying dynamic tasks when it shouldn't - https://github.com/flyteorg/flyte/issues/3606. So my first question is, are you expecting the retries? If the user returns a non-recoverable error the dynamic should just propagate that up - which is what the fix to the aforementioned issue will do. Basically, I think the section you linked should be returning permanent failures instead in almost every case, because if the call to RecursvieNodeHandler returns a

Failed

state it means that some task in the subworkflow already exhausted all of it's retries. I think the only retryable failures here should be when dynamic tasks fail internally (ie. copying inputs / writing outputs, etc). I am going to run a ton of tests today to validate this to myself.

hallowed-mouse-14616

05/25/2023, 12:47 PM

If the retries are expected it does seem that removing the

RetriesExhausted|

prefix would be the best way to identify the actual error code.

colossal-solstice-11091

05/25/2023, 1:00 PM

yes @bored-laptop-29637 I don’t think we expect retries right? It says Attempt 01 for the node execution that failed.

bored-laptop-29637

05/25/2023, 1:02 PM

That is correct @colossal-solstice-11091 both the dynamic workflow and the tasks spun up have retries=0

colossal-solstice-11091

05/25/2023, 1:09 PM

@hallowed-mouse-14616 Is it possible to configure a task to retry only on certain errors and not others raised by the task?

Copy code

@task(retries=3)

hallowed-mouse-14616

05/25/2023, 1:17 PM

both the dynamic workflow and the tasks spun up have retries=0

So fixing the issue I linked should fix this problem correct? If the dynamic task does not return a RetryableFailure then the

RetriesExhausted|

will not be prepended to the error code.

Is it possible to configure a task to retry only on certain errors and not others raised by the task?

This is a point of confusion in a lot of places. Flyte has differentiation between retryable and non-retryable errors (called recoverable in the flytekit API). So when throwing an error from flytekit, I believe they are non-recoverable by default (@thankful-minister-83577 can you confirm?), but you can use a try / catch in flytekit and wrap the error in a

recoverable

flag to ensure retries. So in this case, it seems the non-recoverable error from flytekit is retried because of the bug in the dynamic task implementation (will fix in the next few days). The initial reason for having retries on tasks was for system-level failures. For example, k8s deleting and cleaning up a Pod in the background, or similarly premeptible instances, or things like blobstore read / write failures that aren't involved in actual task executions. All of these are "recoverable" errors. So Tl;DR if the error is a non-recoverable it will not retry regardless of the number of retries specified on the task.

🙏 2

colossal-solstice-11091

05/25/2023, 1:31 PM

I see. Thanks for the great explanation! We’ll lookout for that fix

bored-laptop-29637

05/25/2023, 3:28 PM

Aw that would be great Dan. That would definitely fix our issue that we are seeing because Since none of the generated tasks return recoverable errors, the workflow would then just fail with a

User:NotReady

error code which is exactly what we need and probably is the expected behavior

thankful-minister-83577

05/25/2023, 3:29 PM

yeah there’s an open request to also add a switch to change the error handling default behavior as well.

thankful-minister-83577

05/25/2023, 3:29 PM

by default it’s non-recoverable, you’ll be able to select by-default-recoverable

155 Views

Open in Slack

Previous Next