Mike Ossareh05/19/2023, 4:16 PM
that uses map `@task`s and spins up 10-100s of pods. Our EKS setup is very elastic and sometimes will bin-pack these pods onto the same node. When it does that there's a high chance that one of the pods in the map
will fail with an error similar to the issue linked above. Our kubelet is currently setup to do serial image pulls. So, in theory, once one of the `@task`s pulls the image it should then be available for all of the other pods. But it seems that's not the case. Initially i thought the fact that flyte is setting
was a problem, but reading the docs more closely it seems that's not the case.
every time the kubelet launches a container, the kubelet queries the container image registry to resolve the name to an image digest. *If the kubelet has a container image with that exact digest cached locally, the kubelet uses its cached image*; otherwise, the kubelet pulls the image with the resolved digest, and uses that image to launch the container.(emphasis mine) Has anyone observed this issue? Any recommendations?
currentAttempt done. Last Error: UNKNOWN::: code:"ContainersNotReady|ImagePullBackOff" message:"containers with unready status: [f2208032e4b24406b9f9-n1-0-dn0-0-dn10-0-34]|Back-off pulling image \"<redacted>/flyte-plaster:23.5.16\""
Mike Ossareh05/19/2023, 6:33 PM
. Is your suggestion that if
that the work loads compete for underlying resource and take longer to start, thus more chance of hitting the graceperiod?
limit > requests