Thomas Blom
05/25/2023, 4:45 PM[85]: code:"ContainersNotReady|ImagePullBackOff" message:"containers with unready status: [f0ff4b2cb0d6d44d1907-n1-0-dn0-0-dn10-0-85]|Back-off pulling image \"(sanitized).<http://dkr.ecr.us-east-1.amazonaws.com/flyte-plaster:23.5.19\%22%22|dkr.ecr.us-east-1.amazonaws.com/flyte-plaster:23.5.19\"">
This is part of a workflow that employs a mappable task for one section. The workflow will complete successfully maybe half of the time. The other half, this error occurs specifically for the map task. I can see from logs that some elements of the mapped task are executing successfully, and in this case it is number 85 (out of 100) that failed.
Victor Gustavo da Silva Oliveira
05/25/2023, 7:39 PMThomas Blom
05/25/2023, 8:29 PMSam Eckert
06/05/2023, 7:10 PMThomas Blom
06/07/2023, 2:06 PMVictor Gustavo da Silva Oliveira
06/07/2023, 2:07 PMKetan (kumare3)
Thomas Blom
06/07/2023, 2:15 PMKetan (kumare3)
Dan Rammer (hamersaw)
06/08/2023, 1:08 PMImagePullBackoff
(as indicated by your message), then Flyte marks the task as a retryable failure, then between that time and when Flyte actually cleans up the failed task it starts. The 20 second runtime seems suspect, but it could be the case.
If this is the case we could explore adding a grace period configuration option similar to the CreateContainerError. The idea is that Flyte wouldn't immediately fail, rather it would ensure that N seconds have passed since beginning the task.Ketan (kumare3)
Thomas Blom
06/08/2023, 2:11 PMDan Rammer (hamersaw)
07/05/2023, 7:49 PMImagePullBackoff
resources. So the task would fail, but Flyte would leave the Pod
until completion. I'm wondering if this is the issue here.Thomas Blom
07/05/2023, 9:09 PMDan Rammer (hamersaw)
07/05/2023, 9:12 PMwhich to my understanding has the propeller component integratedcorrect. the single binary has all of the components bundled.
which we've since bumped to 1.6.2so it is likely be that this fix was not in the previous deployment.
Thomas Blom
07/05/2023, 9:13 PMDan Rammer (hamersaw)
07/05/2023, 9:20 PMImagePullBackoff
failure as well (issue pending). Because right now, the first time Flyte sees the ImagePullBackoff
it will fail the task, deleting the Pod, and then if retries are configured launch a new Pod. It probably makes more sense to wait a configurable amount of time on ImagePullBackoff
before declaring a task a failure.Thomas Blom
07/06/2023, 8:41 PMregistryPullQPS
and/or registryBurst
config params for kubelet
to fix the initial "Pull QPS exceeded" which seems to be the root cause.Dan Rammer (hamersaw)
07/06/2023, 9:36 PMThomas Blom
07/06/2023, 9:37 PMWe are certainly going to add a grace period for theDo you have a feel for when the above might make it in? In the meantime I'm going to explore configuring kubelet to allow more throughput on image pulls.failure as well (issue pending).ImagePullBackoff
Dan Rammer (hamersaw)
07/06/2023, 9:43 PMImagePullBackoff
will not work on maptask because the abort is handled differently. but it should fix this issue in dynamics / etc. i'm currently wrapping up a huge effort completely updating how maptasks are executed internally with the ArrayNode work. once that's in (hopefully next few weeks) the Pod deletion on ImagePullBackoff
will work in maptasks. as you suggested though, maybe not the exact behavior you want 😄, but a big relief for a lot of maptask issues.Thomas Blom
07/06/2023, 9:48 PMDan Rammer (hamersaw)
07/06/2023, 9:54 PMThomas Blom
07/07/2023, 2:02 PMDan Rammer (hamersaw)
07/08/2023, 12:31 AMSam Eckert
07/10/2023, 4:39 PMThomas Blom
07/10/2023, 9:46 PMVersion 0.1.10
, AppVersion 1.16.0
and flyteagent.deployment.image.tag
version of 1.62b1
, among others.
I'm not finding any clear-cut "release/version history" for the flyte-binary as I might expect. Can you shed any light on this? I'm trying to understand how to look out for what you refer to above, the "next flyte single binary release".
Thanks!