faint-smartphone-23356
05/19/2023, 4:16 PM@workflow
that uses map `@task`s and spins up 10-100s of pods. Our EKS setup is very elastic and sometimes will bin-pack these pods onto the same node. When it does that there's a high chance that one of the pods in the map @task
will fail with an error similar to the issue linked above.
Our kubelet is currently setup to do serial image pulls. So, in theory, once one of the `@task`s pulls the image it should then be available for all of the other pods. But it seems that's not the case. Initially i thought the fact that flyte is setting imagePullPolicy: Always
was a problem, but reading the docs more closely it seems that's not the case.
Always
every time the kubelet launches a container, the kubelet queries the container image registry to resolve the name to an image digest. *If the kubelet has a container image with that exact digest cached locally, the kubelet uses its cached image*; otherwise, the kubelet pulls the image with the resolved digest, and uses that image to launch the container.(emphasis mine) Has anyone observed this issue? Any recommendations?
faint-smartphone-23356
05/19/2023, 4:20 PMcurrentAttempt done. Last Error: UNKNOWN::[34]: code:"ContainersNotReady|ImagePullBackOff" message:"containers with unready status: [f2208032e4b24406b9f9-n1-0-dn0-0-dn10-0-34]|Back-off pulling image \"<redacted>/flyte-plaster:23.5.16\""
faint-smartphone-23356
05/19/2023, 5:00 PMfaint-smartphone-23356
05/19/2023, 5:21 PMfaint-smartphone-23356
05/19/2023, 5:53 PMfreezing-airport-6809
freezing-airport-6809
faint-smartphone-23356
05/19/2023, 6:33 PM==
.
Is your suggestion that if limit > requests
that the work loads compete for underlying resource and take longer to start, thus more chance of hitting the graceperiod?faint-smartphone-23356
05/19/2023, 6:33 PM==
.faint-smartphone-23356
05/19/2023, 8:20 PM