microscopic-furniture-57275
05/25/2023, 4:45 PM[85]: code:"ContainersNotReady|ImagePullBackOff" message:"containers with unready status: [f0ff4b2cb0d6d44d1907-n1-0-dn0-0-dn10-0-85]|Back-off pulling image \"(sanitized).<http://dkr.ecr.us-east-1.amazonaws.com/flyte-plaster:23.5.19\%22%22|dkr.ecr.us-east-1.amazonaws.com/flyte-plaster:23.5.19\"">
This is part of a workflow that employs a mappable task for one section. The workflow will complete successfully maybe half of the time. The other half, this error occurs specifically for the map task. I can see from logs that some elements of the mapped task are executing successfully, and in this case it is number 85 (out of 100) that failed.
microscopic-furniture-57275
05/25/2023, 7:27 PMdry-ability-69144
05/25/2023, 7:39 PMdry-ability-69144
05/25/2023, 7:39 PMmicroscopic-furniture-57275
05/25/2023, 8:29 PMcrooked-artist-67935
06/05/2023, 7:10 PMmicroscopic-furniture-57275
06/07/2023, 2:06 PMdry-ability-69144
06/07/2023, 2:07 PMfreezing-airport-6809
microscopic-furniture-57275
06/07/2023, 2:15 PMmicroscopic-furniture-57275
06/07/2023, 2:16 PMmicroscopic-furniture-57275
06/07/2023, 2:18 PMmicroscopic-furniture-57275
06/07/2023, 2:21 PMfreezing-airport-6809
hallowed-mouse-14616
06/08/2023, 1:08 PMImagePullBackoff
(as indicated by your message), then Flyte marks the task as a retryable failure, then between that time and when Flyte actually cleans up the failed task it starts. The 20 second runtime seems suspect, but it could be the case.
If this is the case we could explore adding a grace period configuration option similar to the CreateContainerError. The idea is that Flyte wouldn't immediately fail, rather it would ensure that N seconds have passed since beginning the task.freezing-airport-6809
microscopic-furniture-57275
06/08/2023, 2:11 PMmicroscopic-furniture-57275
07/03/2023, 8:35 PMhallowed-mouse-14616
07/05/2023, 7:49 PMImagePullBackoff
resources. So the task would fail, but Flyte would leave the Pod
until completion. I'm wondering if this is the issue here.microscopic-furniture-57275
07/05/2023, 9:09 PMhallowed-mouse-14616
07/05/2023, 9:12 PMwhich to my understanding has the propeller component integratedcorrect. the single binary has all of the components bundled.
which we've since bumped to 1.6.2so it is likely be that this fix was not in the previous deployment.
microscopic-furniture-57275
07/05/2023, 9:13 PMhallowed-mouse-14616
07/05/2023, 9:20 PMImagePullBackoff
failure as well (issue pending). Because right now, the first time Flyte sees the ImagePullBackoff
it will fail the task, deleting the Pod, and then if retries are configured launch a new Pod. It probably makes more sense to wait a configurable amount of time on ImagePullBackoff
before declaring a task a failure.microscopic-furniture-57275
07/06/2023, 8:41 PMregistryPullQPS
and/or registryBurst
config params for kubelet
to fix the initial "Pull QPS exceeded" which seems to be the root cause.hallowed-mouse-14616
07/06/2023, 9:36 PMmicroscopic-furniture-57275
07/06/2023, 9:37 PMmicroscopic-furniture-57275
07/06/2023, 9:38 PMmicroscopic-furniture-57275
07/06/2023, 9:40 PMWe are certainly going to add a grace period for theDo you have a feel for when the above might make it in? In the meantime I'm going to explore configuring kubelet to allow more throughput on image pulls.failure as well (issue pending).ImagePullBackoff
hallowed-mouse-14616
07/06/2023, 9:43 PMhallowed-mouse-14616
07/06/2023, 9:47 PMImagePullBackoff
will not work on maptask because the abort is handled differently. but it should fix this issue in dynamics / etc. i'm currently wrapping up a huge effort completely updating how maptasks are executed internally with the ArrayNode work. once that's in (hopefully next few weeks) the Pod deletion on ImagePullBackoff
will work in maptasks. as you suggested though, maybe not the exact behavior you want 😄, but a big relief for a lot of maptask issues.microscopic-furniture-57275
07/06/2023, 9:48 PMhallowed-mouse-14616
07/06/2023, 9:54 PMhallowed-mouse-14616
07/07/2023, 7:07 AMmicroscopic-furniture-57275
07/07/2023, 2:02 PMhallowed-mouse-14616
07/08/2023, 12:31 AMcrooked-artist-67935
07/10/2023, 4:39 PMmicroscopic-furniture-57275
07/10/2023, 9:46 PMVersion 0.1.10
, AppVersion 1.16.0
and flyteagent.deployment.image.tag
version of 1.62b1
, among others.
I'm not finding any clear-cut "release/version history" for the flyte-binary as I might expect. Can you shed any light on this? I'm trying to understand how to look out for what you refer to above, the "next flyte single binary release".
Thanks!