We are regularly hitting the issue mentioned in this resolve Flyte #flyte-support

We are regularly hitting the issue mentioned in th...

faint-smartphone-23356

05/19/2023, 4:16 PM

We are regularly hitting the issue mentioned in this resolved ticket: https://github.com/flyteorg/flyte/issues/1234 In our case we have a

@workflow

that uses map `@task`s and spins up 10-100s of pods. Our EKS setup is very elastic and sometimes will bin-pack these pods onto the same node. When it does that there's a high chance that one of the pods in the map

@task

will fail with an error similar to the issue linked above. Our kubelet is currently setup to do serial image pulls. So, in theory, once one of the `@task`s pulls the image it should then be available for all of the other pods. But it seems that's not the case. Initially i thought the fact that flyte is setting

imagePullPolicy: Always

was a problem, but reading the docs more closely it seems that's not the case.

Always

every time the kubelet launches a container, the kubelet queries the container image registry to resolve the name to an image digest. *If the kubelet has a container image with that exact digest cached locally, the kubelet uses its cached image*; otherwise, the kubelet pulls the image with the resolved digest, and uses that image to launch the container.

(emphasis mine) Has anyone observed this issue? Any recommendations?

faint-smartphone-23356

05/19/2023, 4:20 PM

example error message:

Copy code

currentAttempt done. Last Error: UNKNOWN::[34]: code:"ContainersNotReady|ImagePullBackOff" message:"containers with unready status: [f2208032e4b24406b9f9-n1-0-dn0-0-dn10-0-34]|Back-off pulling image \"<redacted>/flyte-plaster:23.5.16\""

faint-smartphone-23356

05/19/2023, 5:00 PM

message has been deleted

faint-smartphone-23356

05/19/2023, 5:21 PM

if I press the recover button any time we hit a failure like above, it does eventually succeed. So might this be a case of the CreateContainerErrorGracePeriod being too tight (3m) for our workloads?

faint-smartphone-23356

05/19/2023, 5:53 PM

OK, i've bumped the grace period to 10m. Since this is a relative randomly occurrence, so I'll observe this over time.

freezing-airport-6809

05/19/2023, 6:17 PM

you have to ensure that requests == limit

freezing-airport-6809

05/19/2023, 6:17 PM

otherwise kube will kick the container

faint-smartphone-23356

05/19/2023, 6:33 PM

Interesting; in terms of these tasks, they are

==

. Is your suggestion that if

limit > requests

that the work loads compete for underlying resource and take longer to start, thus more chance of hitting the graceperiod?

faint-smartphone-23356

05/19/2023, 6:33 PM

I ask, because we have other tasks that are not

==

faint-smartphone-23356

05/19/2023, 8:20 PM

Another thought: I observed that some of our nodes were taking approx 2m to get into a ReadyState (various daemonsets needed to start up). So if the counter starts from the moment the pod is requested we were eating up most of the time in elastically provisioning nodes.

214 Views

Open in Slack

Previous Next