Why does Recover rerun map tasks that succeeded Shouldn t it Flyte #flyte-support

Join Slack

Why does Recover rerun map tasks that succeeded? S...

# flyte-support

numerous-hamburger-7178

12/06/2022, 12:14 AM

Why does Recover rerun map tasks that succeeded? Shouldn't it only rerun tasks that failed.

high-accountant-32689

12/06/2022, 12:44 AM

This sounds like a bug. What version of flytepropeller and admin are you running, Laura? cc: @hallowed-mouse-14616

numerous-hamburger-7178

12/06/2022, 1:12 AM

how do I check. it has helm chart:

flyte-core-v1.2.0

magnificent-teacher-86590

12/06/2022, 1:32 AM

did you specify the caching in TaskMetadata, if you are relying on the underlying tasks cache to be on that will be overrided by TaskMetadata i think

hallowed-mouse-14616

12/06/2022, 2:12 AM

@numerous-hamburger-7178 are you saying that the entire subtask is successful and recovery fails or that some of the subtasks are successful and the subtasks are not recovered?

numerous-hamburger-7178

12/06/2022, 2:13 AM

map task with ~450 tasks, everything succeeded except 2 were stuck in Pending for over an hour so I terminated the workflow. If I run recover, what should the expected behavior have been? Ideally, it would just rerun the two failed tasks right? no caching specified.

hallowed-mouse-14616

12/06/2022, 2:16 AM

In the former, the entire map task data should be recoverable. I just ran a test and verified this works - so if you're seeing something else it's a bug. In the later, it's a construct of how map tasks are currently implemented. Right now, these run as a separate plugin, so the recovery process is unable to parse through them. This issue relates to implementing map tasks as a separate node in the Flyte DAG. It is something that's on our roadmap for Q1 '23. This fixes the scenario you mention (recovering subtasks) in addition to enabling map tasks over arbitrary task types (ie. spark, ray, etc).

hallowed-mouse-14616

12/06/2022, 2:17 AM

So yes, you are right. The expected behavior is to run only the 2 tasks, but that is not how it works right now. With the switch to ArrayNode it will work that way.

numerous-hamburger-7178

12/06/2022, 2:17 AM

ah, is there a better way to handle these pending tasks without needing to rerun everything in the meantime?

hallowed-mouse-14616

12/06/2022, 2:18 AM

do you know why they were stuck in pending?

numerous-hamburger-7178

12/06/2022, 2:23 AM

Copy code

(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "29dc9ef6819b16e613927333fbc2a069b819c5346d69628d580e817e9b1cf8d0": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

it happens sometimes when karpenter spins up new nodes, haven't quite figured it out yet either. so it gets stuck on container creating

hallowed-mouse-14616

12/06/2022, 2:29 AM

Interesting, so the pods get stuck in a pending state? It would be interesting to see the pod status for these tasks. If the Pod is just stuck and creating a new Pod will fix the issue, you might be able to trigger a retry from the Flyte side. You just need to manually delete the stuck Pod and Flyte will attempt to retry by creating a new one (default 3 retries).

numerous-hamburger-7178

12/06/2022, 2:46 AM

I think I've tried that before and the workflow will fail

code:"UnexpectedObjectDeletion" message:"object

or something similar. I don't remember exactly but it was some kind of deleted object error

hallowed-mouse-14616

12/06/2022, 2:47 AM

If it exhausts the number of retries it should fail with

SYSTEM::object [flytesnacks-development/f0632d6d37de845a7937-n1-3] terminated in the background, manually

Does that look familiar?

hallowed-mouse-14616

12/06/2022, 2:47 AM

But it should retry a configurable number of times.

numerous-hamburger-7178

12/06/2022, 3:47 AM

no, is the retry for a system failure default configured? or I have to set it to some non-zero value

hallowed-mouse-14616

12/06/2022, 3:37 PM

Ok, so I think flytekit defaults everything to 0 retries. So to allow multiple attempts you have to explicitly specify it. For map tasks this can be done in a few areas:

Copy code

@task(retries=3)
def foo:
    # omitted

@workflow
def bar:
    mapped_out = map_task(foo)(a=a).with_overrides(retries=3)
    # omitted

high-accountant-32689

12/06/2022, 6:35 PM

Just to confirm, tasks get 0 retries by default in flytekit.

numerous-hamburger-7178

12/06/2022, 6:52 PM

would something like the pod getting deleted be considered a recoverble failure? saw this on the docs

Recoverable vs. Non-Recoverable failures: Recoverable failures will be retried and counted against the task's retry count. Non-recoverable failures will just fail, i.e., the task isn't retried irrespective of user/system retry configurations. All user exceptions are considered non-recoverable unless the exception is a subclass of FlyteRecoverableException.

hallowed-mouse-14616

12/06/2022, 6:53 PM

yes, pods getting deleted are recoverable failures.

👍 1

average-finland-92144

01/19/2023, 4:28 PM

Hi @numerous-hamburger-7178, did this workflow finally worked for you? Please let us know if any help is needed

numerous-hamburger-7178

01/19/2023, 4:31 PM

yea the retry count for tasks worked.

thx 1

310 Views

Open in Slack

Previous Next