Is there a way to have my dynamic workflow raise t...
# ask-the-community
b
Is there a way to have my dynamic workflow raise the last error from one of its invoked tasks directly? I want to make sure that the correct error and error code are the outer most errors coming from the workflow
k
Wdym last error
Ohh you have a failure and continue going
b
Oh no sorry. I have an error with a specific exit code raised from one of the tasks started by the dynamic workflow. Will the dynamic workflow raise that same error in the workflow if I was to look at the error from the workflow?
The last error has a errorcode of User:NotReady but I see that the error being raised by the workflow containing the dynamic workflow is RetriesExhaused|User:NotReady. Ultimately I want to know if I can raise the errorcode of User:NotReady as the error for the workflow?
Or if it is possible to try catch the errors from the task and raise them from the dynamic workflow?
Ahh found the cause I think of the prepending to the error code in the flytepropeller code: The flyte propeller code will mark a dynamic workflow with a retryable failure if it finds that one of the dynamic nodes failed. The flyte propeller will then prepend the error code with "RetriesExhaused|" in front of the dynamic node's original error code. The impact of this is that any dynamic workflow will not be able to raise a
User:NotReady
in a way styx can identify. This will result in workflows erroneously being labeled as having unknown errors when the team may be raising error codes known to the styx service but are not recognized due to the RetriesExhaused string prepended to it. A possible remediation to this to allow dynamic workflows to raise specific styx errors is to remove the "RetriesExhaused|" String prior to matching it to any of the known error codes.
s
cc @Dan Rammer (hamersaw)
s
How should we treat this @Dan Rammer (hamersaw)? Can this happen in other occurrences than RetriesExhaused so it makes sense to do
Copy code
final String flyteCode = execution.getClosure().getError().getCode();
final String[] flyteCodesSplitted = flyteCode.split("\\|");
final String lastCode = flyteCodesSplitted[flyteCodesSplitted.length-1];
Or is this enough https://github.com/spotify/styx/pull/1084/files ? Our scheduler service is identifying
USER:NotReady
flyte errors to translate it to a code which makes our scheduler retry the execution
d
Oh this is a difficult one 😅. So I'm actually working on an issue right now about retrying dynamic tasks when it shouldn't - https://github.com/flyteorg/flyte/issues/3606. So my first question is, are you expecting the retries? If the user returns a non-recoverable error the dynamic should just propagate that up - which is what the fix to the aforementioned issue will do. Basically, I think the section you linked should be returning permanent failures instead in almost every case, because if the call to RecursvieNodeHandler returns a
Failed
state it means that some task in the subworkflow already exhausted all of it's retries. I think the only retryable failures here should be when dynamic tasks fail internally (ie. copying inputs / writing outputs, etc). I am going to run a ton of tests today to validate this to myself.
If the retries are expected it does seem that removing the
RetriesExhausted|
prefix would be the best way to identify the actual error code.
s
yes @Brandon Segal I don’t think we expect retries right? It says Attempt 01 for the node execution that failed.
b
That is correct @Sonja Ericsson both the dynamic workflow and the tasks spun up have retries=0
s
@Dan Rammer (hamersaw) Is it possible to configure a task to retry only on certain errors and not others raised by the task?
Copy code
@task(retries=3)
d
both the dynamic workflow and the tasks spun up have retries=0
So fixing the issue I linked should fix this problem correct? If the dynamic task does not return a RetryableFailure then the
RetriesExhausted|
will not be prepended to the error code.
Is it possible to configure a task to retry only on certain errors and not others raised by the task?
This is a point of confusion in a lot of places. Flyte has differentiation between retryable and non-retryable errors (called recoverable in the flytekit API). So when throwing an error from flytekit, I believe they are non-recoverable by default (@Yee can you confirm?), but you can use a try / catch in flytekit and wrap the error in a
recoverable
flag to ensure retries. So in this case, it seems the non-recoverable error from flytekit is retried because of the bug in the dynamic task implementation (will fix in the next few days). The initial reason for having retries on tasks was for system-level failures. For example, k8s deleting and cleaning up a Pod in the background, or similarly premeptible instances, or things like blobstore read / write failures that aren't involved in actual task executions. All of these are "recoverable" errors. So Tl;DR if the error is a non-recoverable it will not retry regardless of the number of retries specified on the task.
s
I see. Thanks for the great explanation! We’ll lookout for that fix
b
Aw that would be great Dan. That would definitely fix our issue that we are seeing because Since none of the generated tasks return recoverable errors, the workflow would then just fail with a
User:NotReady
error code which is exactly what we need and probably is the expected behavior
y
yeah there’s an open request to also add a switch to change the error handling default behavior as well.
by default it’s non-recoverable, you’ll be able to select by-default-recoverable
151 Views