Hey all. I'm back with a quick question. I have a ...
# ask-the-community
v
Hey all. I'm back with a quick question. I have a situation where a workflow fans out to bunch of nodes, 100+, at some point we're hitting the image pull qps limit and since it takes too long for the pod to start, flyte marks the node as failed and nukes the workflow. So my question is, is there a config I can set to have flyte wait longer for the pod to spin up before giving up?
s
cc @Dan Rammer (hamersaw)
d
@Viljem Skornik do you know what the specific error is?
ImagePullBackoff
? or something like
task active timeout [%s] expired
?
v
Sure, it's an imagepullbackoff error when pull QPS exceeded. What I see in Flyte is:
Copy code
[1/1] currentAttempt done. Last Error: USER::[1/1] currentAttempt done. Last Error: USER::containers with unready status: [primary]|Back-off pulling image ...
At this point this node has failed and the whole worfklow failed. But the actual pod will eventually start and do whatever it was supposed to do.
d
Yeah, so we're fixing the Pod deletion on
ImagePullBackoff
error in this PR. Basically, if Flyte fails a task because of
ImagePullBackoff
it will then delete the Pod. This was particularly troublesome when the
ImagePullBackoff
was because the image did not exist - then the Pod would never start (and therefore never complete) and just stick around taking up resources.
Right now the easiest way to get this working is to add retries on the task. Therefore, attempt 1 fails and Flyte will attempt to execute again. Other solutions may require code updates. For example, we could have a configurable threshold similar to the CreateContainerErrorGradePeriod, this would apply when failing under ImagePullBackoff.
v
Interesting, thanks for the link, will follow. If I do retry won't I be basically spawning multiple pods for the same task?
d
Currently yes, when the linked PR is merged, the first attempt (that failed) will be aborted (so Pod deleted) and then a new Pod will be created for the 2nd attempt. I think the proposal for a grace period configuration is a cleaner solution, but given the lower priority it would likely need to come as a community contribution.
v
Excellent, thanks!
create-container-error-grace-period is 3 min by default, if i increase it to 15, it should help, right?
d
I don't believe increasing that configuration will currently do anything to fix this issue. I brought it up because we could add another configuration value for something like "image-pull-backoff-grace-period" that is applied similarly. If this is something you would like to see (or better yet, contribute!) feel free to create a github issue and we can track it.
v
Aha, ok, makes sense.
So, no fix right now, when that PR lands, we can up the retries which will fix the problem. Adding a config for that would be a fun side quest, hopefully I can find some time to make it happen. Thanks so much! Really appreciated!
s
We think on our end the issue was related to kubelet
registryPullQPS
and
registryBurst
being set too low.
157 Views