Mike Ossareh

05/19/2023, 4:16 PM
We are regularly hitting the issue mentioned in this resolved ticket: In our case we have a
that uses map `@task`s and spins up 10-100s of pods. Our EKS setup is very elastic and sometimes will bin-pack these pods onto the same node. When it does that there's a high chance that one of the pods in the map
will fail with an error similar to the issue linked above. Our kubelet is currently setup to do serial image pulls. So, in theory, once one of the `@task`s pulls the image it should then be available for all of the other pods. But it seems that's not the case. Initially i thought the fact that flyte is setting
imagePullPolicy: Always
was a problem, but reading the docs more closely it seems that's not the case.
every time the kubelet launches a container, the kubelet queries the container image registry to resolve the name to an image digest. *If the kubelet has a container image with that exact digest cached locally, the kubelet uses its cached image*; otherwise, the kubelet pulls the image with the resolved digest, and uses that image to launch the container.
(emphasis mine) Has anyone observed this issue? Any recommendations?
example error message:
currentAttempt done. Last Error: UNKNOWN::[34]: code:"ContainersNotReady|ImagePullBackOff" message:"containers with unready status: [f2208032e4b24406b9f9-n1-0-dn0-0-dn10-0-34]|Back-off pulling image \"<redacted>/flyte-plaster:23.5.16\""
if I press the recover button any time we hit a failure like above, it does eventually succeed. So might this be a case of the CreateContainerErrorGracePeriod being too tight (3m) for our workloads?
OK, i've bumped the grace period to 10m. Since this is a relative randomly occurrence, so I'll observe this over time.

Ketan (kumare3)

05/19/2023, 6:17 PM
you have to ensure that requests == limit
otherwise kube will kick the container

Mike Ossareh

05/19/2023, 6:33 PM
Interesting; in terms of these tasks, they are
. Is your suggestion that if
limit > requests
that the work loads compete for underlying resource and take longer to start, thus more chance of hitting the graceperiod?
I ask, because we have other tasks that are not
Another thought: I observed that some of our nodes were taking approx 2m to get into a ReadyState (various daemonsets needed to start up). So if the counter starts from the moment the pod is requested we were eating up most of the time in elastically provisioning nodes.