Hello, Is Flyte capable of MANUALLY resuming the w...
# ask-the-community
f
Hello, Is Flyte capable of MANUALLY resuming the workflow from the failed task forward? By manually I mean either from the UI or CLI. I see a rerun button on the UI to the right of the tasks. Will clicking the rerun button resume the workflow or just rerun the failed task? What about manually marking a failed task as SUCCEEDED and resume the rest of the tasks forward?
d
Hi @Frank Shen Regarding resuming after failure, Intratask checkpoints could help you achieve that goal https://docs.flyte.org/projects/cookbook/en/latest/auto/core/control_flow/checkpoint.html#intratask-checkpoints As per manually changing the result of a Task, I'm not sure
f
Sorry @David Espejo (he/him), I forgot to add MANUALLY. I just amended my question. Please advise. Thanks!
Sometimes the task has failed after the configured retries. Then I fixed the bug in that task and I need to resume the flow. I think I have to manually rerun the task to achieve that, correct?
d
Well, if you Relaunch the workflow it should run through the remaining of the DAG, starting by the failed Task
Will clicking the rerun button resume the workflow or just rerun the failed task?
cc @Eduardo Apolinario (eapolinario) / @Dan Rammer (hamersaw) will this require the use of Caching?
d
Hey @Frank Shen, so fixing the failed task and re-registering the workflow will currently require a completely new execution since the versions of everything change. This is an unfortunate byproduct of Flyte's very opinionated stance on execution reproducability, data lineages, etc. As @David Espejo (he/him) mentioned, you can use caching of the previous tasks as a workaround here, but obviously this means that any execution with the same
cache_version
and input values will use the same cache. The "rerun" button the task should only execute the individual task, and not resume the workflow. I should make sure you're aware of the "recover" functionality as well, although it may not directly apply here (ie. task updates). Recovery means that Flyte relaunches the workflow, but all of the nodes that were previously completed will be re-used. So if you have a workflow that had 3 tasks, where the first and second succeeded and the third failed due to pods being OOM killed or something. You could recover the workflow and Flyte would recover task executions 1 and 2 before reattempting the 3rd task execution. So this is basically what you want, but it will use the version of the task associated with the initial workflow rather than the updated one.
f
@Dan Rammer (hamersaw), “recover” functionality is exactly what I wanted. Could you tell / show me how to perform “recover”? I only see “relaunch” button on the workflow level.
Also @Dan Rammer (hamersaw), if re-run button only re-run the failed task, but not resume the rest of the workflow, then what is the purpose of re-running the failed task alone? I think we are missing some important feature here.
f
Thanks a lot @David Espejo (he/him)!
d
Thanks to Dan for the great explanation! @Frank Shen please let us know anything else you may need
f
Ah... I also didn't realize this so far. So the recover behavior is what I expected to happen if I rerun a failed task, i.e. dependent task in the workflow will also run. This is a bit confusing in the UI... If we could also manually abort specific tasks and then recover the whole workflow, that would be exactly what my colleagues are asking for.
And I don't quite understand the difference between clone and relaunch...
s
I believe they're the same.
This is a bit confusing in the UI...
Would you let us know why it isn't clear?
If we could also manually abort specific tasks and then recover the whole workflow, that would be exactly what my colleagues are asking for.
You can terminate a workflow, not the tasks within it. @Yee / @Eduardo Apolinario (eapolinario), is this something we want to support?
f
When you look at a failed task, there is a button "rerun", and to it seemed the natural point to - well - rerun that task and if it succeeds recover the rest of the workflow. But this is not what this rerun button does... (i.e. launch further downstream tasks that are dependent on it). My colleagues also didn't intuitively understand what this rerun does. IMHO it would be much more clear if relaunch and rerun would also be renamed to clone, as this is much closer to what it actually does.
236 Views