Thomas Blom05/25/2023, 4:45 PM
This is part of a workflow that employs a mappable task for one section. The workflow will complete successfully maybe half of the time. The other half, this error occurs specifically for the map task. I can see from logs that some elements of the mapped task are executing successfully, and in this case it is number 85 (out of 100) that failed.
: code:"ContainersNotReady|ImagePullBackOff" message:"containers with unready status: [f0ff4b2cb0d6d44d1907-n1-0-dn0-0-dn10-0-85]|Back-off pulling image \"(sanitized).<http://dkr.ecr.us-east-1.amazonaws.com/flyte-plaster:23.5.19\%22%22|dkr.ecr.us-east-1.amazonaws.com/flyte-plaster:23.5.19\"">
Victor Gustavo da Silva Oliveira05/25/2023, 7:39 PM
Thomas Blom05/25/2023, 8:29 PM
Sam Eckert06/05/2023, 7:10 PM
Thomas Blom06/07/2023, 2:06 PM
Victor Gustavo da Silva Oliveira06/07/2023, 2:07 PM
Thomas Blom06/07/2023, 2:15 PM
Dan Rammer (hamersaw)06/08/2023, 1:08 PM
(as indicated by your message), then Flyte marks the task as a retryable failure, then between that time and when Flyte actually cleans up the failed task it starts. The 20 second runtime seems suspect, but it could be the case. If this is the case we could explore adding a grace period configuration option similar to the CreateContainerError. The idea is that Flyte wouldn't immediately fail, rather it would ensure that N seconds have passed since beginning the task.
Thomas Blom06/08/2023, 2:11 PM
Dan Rammer (hamersaw)07/05/2023, 7:49 PM
resources. So the task would fail, but Flyte would leave the
until completion. I'm wondering if this is the issue here.
Thomas Blom07/05/2023, 9:09 PM
Dan Rammer (hamersaw)07/05/2023, 9:12 PM
which to my understanding has the propeller component integratedcorrect. the single binary has all of the components bundled.
which we've since bumped to 1.6.2so it is likely be that this fix was not in the previous deployment.
Thomas Blom07/05/2023, 9:13 PM
Dan Rammer (hamersaw)07/05/2023, 9:20 PM
failure as well (issue pending). Because right now, the first time Flyte sees the
it will fail the task, deleting the Pod, and then if retries are configured launch a new Pod. It probably makes more sense to wait a configurable amount of time on
before declaring a task a failure.
Thomas Blom07/06/2023, 8:41 PM
config params for
to fix the initial "Pull QPS exceeded" which seems to be the root cause.
Dan Rammer (hamersaw)07/06/2023, 9:36 PM
Thomas Blom07/06/2023, 9:37 PM
We are certainly going to add a grace period for theDo you have a feel for when the above might make it in? In the meantime I'm going to explore configuring kubelet to allow more throughput on image pulls.failure as well (issue pending).
Dan Rammer (hamersaw)07/06/2023, 9:43 PM
will not work on maptask because the abort is handled differently. but it should fix this issue in dynamics / etc. i'm currently wrapping up a huge effort completely updating how maptasks are executed internally with the ArrayNode work. once that's in (hopefully next few weeks) the Pod deletion on
will work in maptasks. as you suggested though, maybe not the exact behavior you want 😄, but a big relief for a lot of maptask issues.
Thomas Blom07/06/2023, 9:48 PM
Dan Rammer (hamersaw)07/06/2023, 9:54 PM
Thomas Blom07/07/2023, 2:02 PM
Dan Rammer (hamersaw)07/08/2023, 12:31 AM
Sam Eckert07/10/2023, 4:39 PM
Thomas Blom07/10/2023, 9:46 PM
, among others. I'm not finding any clear-cut "release/version history" for the flyte-binary as I might expect. Can you shed any light on this? I'm trying to understand how to look out for what you refer to above, the "next flyte single binary release". Thanks!