Hey all I m back with a quick question I have a situation wh Flyte #flyte-support

Hey all. I'm back with a quick question. I have a ...

wooden-airline-96737

04/27/2023, 7:28 PM

Hey all. I'm back with a quick question. I have a situation where a workflow fans out to bunch of nodes, 100+, at some point we're hitting the image pull qps limit and since it takes too long for the pod to start, flyte marks the node as failed and nukes the workflow. So my question is, is there a config I can set to have flyte wait longer for the pod to spin up before giving up?

tall-lock-23197

04/28/2023, 4:45 AM

cc @hallowed-mouse-14616

hallowed-mouse-14616

04/28/2023, 11:45 AM

@wooden-airline-96737 do you know what the specific error is?

ImagePullBackoff

? or something like

task active timeout [%s] expired

wooden-airline-96737

04/28/2023, 3:58 PM

Sure, it's an imagepullbackoff error when pull QPS exceeded. What I see in Flyte is:

Copy code

[1/1] currentAttempt done. Last Error: USER::[1/1] currentAttempt done. Last Error: USER::containers with unready status: [primary]|Back-off pulling image ...

At this point this node has failed and the whole worfklow failed. But the actual pod will eventually start and do whatever it was supposed to do.

hallowed-mouse-14616

04/28/2023, 4:34 PM

Yeah, so we're fixing the Pod deletion on

ImagePullBackoff

error in this PR. Basically, if Flyte fails a task because of

ImagePullBackoff

it will then delete the Pod. This was particularly troublesome when the

ImagePullBackoff

was because the image did not exist - then the Pod would never start (and therefore never complete) and just stick around taking up resources.

hallowed-mouse-14616

04/28/2023, 4:38 PM

Right now the easiest way to get this working is to add retries on the task. Therefore, attempt 1 fails and Flyte will attempt to execute again. Other solutions may require code updates. For example, we could have a configurable threshold similar to the CreateContainerErrorGradePeriod, this would apply when failing under ImagePullBackoff.

wooden-airline-96737

04/28/2023, 4:39 PM

Interesting, thanks for the link, will follow. If I do retry won't I be basically spawning multiple pods for the same task?

hallowed-mouse-14616

04/28/2023, 4:42 PM

Currently yes, when the linked PR is merged, the first attempt (that failed) will be aborted (so Pod deleted) and then a new Pod will be created for the 2nd attempt. I think the proposal for a grace period configuration is a cleaner solution, but given the lower priority it would likely need to come as a community contribution.

wooden-airline-96737

04/28/2023, 4:43 PM

Excellent, thanks!

wooden-airline-96737

04/28/2023, 4:44 PM

create-container-error-grace-period is 3 min by default, if i increase it to 15, it should help, right?

hallowed-mouse-14616

04/28/2023, 4:46 PM

I don't believe increasing that configuration will currently do anything to fix this issue. I brought it up because we could add another configuration value for something like "image-pull-backoff-grace-period" that is applied similarly. If this is something you would like to see (or better yet, contribute!) feel free to create a github issue and we can track it.

wooden-airline-96737

04/28/2023, 4:47 PM

Aha, ok, makes sense.

wooden-airline-96737

04/28/2023, 4:48 PM

So, no fix right now, when that PR lands, we can up the retries which will fix the problem. Adding a config for that would be a fun side quest, hopefully I can find some time to make it happen. Thanks so much! Really appreciated!

👍 2

crooked-artist-67935

06/08/2023, 11:58 PM

We think on our end the issue was related to kubelet

registryPullQPS

and

registryBurst

being set too low.

181 Views

Open in Slack

Previous Next