Why does Recover rerun map tasks that succeeded? S...
# ask-the-community
l
Why does Recover rerun map tasks that succeeded? Shouldn't it only rerun tasks that failed.
e
This sounds like a bug. What version of flytepropeller and admin are you running, Laura? cc: @Dan Rammer (hamersaw)
l
how do I check. it has helm chart:
flyte-core-v1.2.0
j
did you specify the caching in TaskMetadata, if you are relying on the underlying tasks cache to be on that will be overrided by TaskMetadata i think
d
@Laura Lin are you saying that the entire subtask is successful and recovery fails or that some of the subtasks are successful and the subtasks are not recovered?
l
map task with ~450 tasks, everything succeeded except 2 were stuck in Pending for over an hour so I terminated the workflow. If I run recover, what should the expected behavior have been? Ideally, it would just rerun the two failed tasks right? no caching specified.
d
In the former, the entire map task data should be recoverable. I just ran a test and verified this works - so if you're seeing something else it's a bug. In the later, it's a construct of how map tasks are currently implemented. Right now, these run as a separate plugin, so the recovery process is unable to parse through them. This issue relates to implementing map tasks as a separate node in the Flyte DAG. It is something that's on our roadmap for Q1 '23. This fixes the scenario you mention (recovering subtasks) in addition to enabling map tasks over arbitrary task types (ie. spark, ray, etc).
So yes, you are right. The expected behavior is to run only the 2 tasks, but that is not how it works right now. With the switch to ArrayNode it will work that way.
l
ah, is there a better way to handle these pending tasks without needing to rerun everything in the meantime?
d
do you know why they were stuck in pending?
l
Copy code
(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "29dc9ef6819b16e613927333fbc2a069b819c5346d69628d580e817e9b1cf8d0": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
it happens sometimes when karpenter spins up new nodes, haven't quite figured it out yet either. so it gets stuck on container creating
d
Interesting, so the pods get stuck in a pending state? It would be interesting to see the pod status for these tasks. If the Pod is just stuck and creating a new Pod will fix the issue, you might be able to trigger a retry from the Flyte side. You just need to manually delete the stuck Pod and Flyte will attempt to retry by creating a new one (default 3 retries).
l
I think I've tried that before and the workflow will fail
code:"UnexpectedObjectDeletion" message:"object
or something similar. I don't remember exactly but it was some kind of deleted object error
d
If it exhausts the number of retries it should fail with
SYSTEM::object [flytesnacks-development/f0632d6d37de845a7937-n1-3] terminated in the background, manually
Does that look familiar?
But it should retry a configurable number of times.
l
no, is the retry for a system failure default configured? or I have to set it to some non-zero value
d
Ok, so I think flytekit defaults everything to 0 retries. So to allow multiple attempts you have to explicitly specify it. For map tasks this can be done in a few areas:
Copy code
@task(retries=3)
def foo:
    # omitted

@workflow
def bar:
    mapped_out = map_task(foo)(a=a).with_overrides(retries=3)
    # omitted
e
Just to confirm, tasks get 0 retries by default in flytekit.
l
would something like the pod getting deleted be considered a recoverble failure? saw this on the docs
Recoverable vs. Non-Recoverable failures: Recoverable failures will be retried and counted against the task's retry count. Non-recoverable failures will just fail, i.e., the task isn't retried irrespective of user/system retry configurations. All user exceptions are considered non-recoverable unless the exception is a subclass of FlyteRecoverableException.
d
yes, pods getting deleted are recoverable failures.
d
Hi @Laura Lin, did this workflow finally worked for you? Please let us know if any help is needed
l
yea the retry count for tasks worked.
244 Views