https://flyte.org logo
#ask-the-community
Title
# ask-the-community
t

Thomas Blom

05/25/2023, 4:45 PM
I'm not sure if my question belongs here or in #flytekit ... I'm seeing map-tasks commonly fail with the error:
Copy code
[85]: code:"ContainersNotReady|ImagePullBackOff" message:"containers with unready status: [f0ff4b2cb0d6d44d1907-n1-0-dn0-0-dn10-0-85]|Back-off pulling image \"(sanitized).<http://dkr.ecr.us-east-1.amazonaws.com/flyte-plaster:23.5.19\%22%22|dkr.ecr.us-east-1.amazonaws.com/flyte-plaster:23.5.19\"">
This is part of a workflow that employs a mappable task for one section. The workflow will complete successfully maybe half of the time. The other half, this error occurs specifically for the map task. I can see from logs that some elements of the mapped task are executing successfully, and in this case it is number 85 (out of 100) that failed. These are all executing on the same node (this is by design for mapped tasks, yes?), and they all use the same image, which is clearly already on the node since other elements from the mapped task "list" are executing. Edit: The above was incorrect. I misread/misunderstood the "node" aspect: a map task runs within a single "workflow node", not a single computer-node (e.g. ec2-instance). In fact, the 100 elements of the map task are being executed across 5 different machines. Still, the puzzle remains (see the reply) - the task that supposedly failed due to "ContainersNotRead|ImagePullBackoff" was in fact running when the job failed. So how could the container not be ready, or the image need pulling? Thoughts?
An additional perplexing bit of information is that based on a review of logs, the map-task "element" 85 pod was actually running for 20 seconds doing computations before things were halted due to this error.
v

Victor Gustavo da Silva Oliveira

05/25/2023, 7:39 PM
When you register the tasks, which image tag are you using? If it is not exaclty equals to an image in ECR, you recieve this error
At least works this way for me
t

Thomas Blom

05/25/2023, 8:29 PM
Hi @Victor Gustavo da Silva Oliveira - we use a single image for all tasks, so when a new image is built, a new registration pass is done for all tasks/workflows. This workflow described above works about half the time, so I don't think it is a mismatch in image tag spec, otherwise it would never work.
s

Sam Eckert

06/05/2023, 7:10 PM
We're seeing this exact behavior as well with a single custom image. We've seen it with dynamic tasks what spin up a bunch of child tasks as well as in map task. Did you happen to identify any root causes here?
t

Thomas Blom

06/07/2023, 2:06 PM
We have not, I'm hoping this will catch the attention of some Union engineers. 🙂. I also continue to see this -- just as you report - in map tasks, and also sometimes in @dynamic workflows that execute a number (e.g. 15-20) of "child" tasks.
v

Victor Gustavo da Silva Oliveira

06/07/2023, 2:07 PM
@Ketan (kumare3)
k

Ketan (kumare3)

06/07/2023, 2:15 PM
Is this for map tasks only?
t

Thomas Blom

06/07/2023, 2:15 PM
No, as indicated above, we both see it for non-map tasks as well.
It would appear to be an issue when a "large" (I've seen it with as few as 15) tasks get started one after the other.
In exploring this error message "ContainersNotReady|ImagePullBackOff" I thought perhaps there was an issue with node startup timing out, or image-download-throttling, or something causing a timeout. But in my investigations I've found it fail in situations where the image MUST already be on the node. In the case I mentioned in the OP, based on log files, it even seemed the task was in fact already running!
Unfortunately I have been busy on 10 other things, so have not constructed a simple repro case; these were observed in our production workflows. In the meantime, we've mitigated by stopping the use of the map task, and in the case of the @dynamic that I saw fail, this is a FixedRate task, so the failure case will get re-run anyway. So we're living with it for the moment.
k

Ketan (kumare3)

06/08/2023, 5:12 AM
We have not really seen this, but let me cc @Dan Rammer (hamersaw) on this, he is actually completely reworking maptasks in the back - its a massive refactor and makes map tasks extremely powerful - goal 1: Make it the same (just less buggy) goal 2: Make it more powerful - extend the paradigm to many other task types etc
d

Dan Rammer (hamersaw)

06/08/2023, 1:08 PM
@Thomas Blom can you provide a little more information? Specifically: (1) which logs are you referring to that indicate the task is executing? k8s? flyte? and do you have the messages? (2) do you have the pod yaml dump from k8s? this would help immensely. We handle transitioning k8s pod state to a flyte phase in this function. Basically, there is a mapping of values. What I'm wondering is if there is something like an
ImagePullBackoff
(as indicated by your message), then Flyte marks the task as a retryable failure, then between that time and when Flyte actually cleans up the failed task it starts. The 20 second runtime seems suspect, but it could be the case. If this is the case we could explore adding a grace period configuration option similar to the CreateContainerError. The idea is that Flyte wouldn't immediately fail, rather it would ensure that N seconds have passed since beginning the task.
k

Ketan (kumare3)

06/08/2023, 1:11 PM
The image pullback off could be because of throttling - @Dan Rammer (hamersaw) - great point. I think you might be right
t

Thomas Blom

06/08/2023, 2:11 PM
@Dan Rammer (hamersaw) thanks very much for your thoughtful reply. I'm on a deadline for some unrelated work, so may not be able to provide this information for a few days -- but we need to solve this and I will absolutely get back to you with more logs and investigation results. Your theory sounds plausible and a grace period sounds like a good solution in this case. Also note that @Sam Eckert reported seeing the same things as I have, so perhaps Sam can as well provide more log/repro info. I'll need to try to repro to get the log information you want -- I went back to the execution in the Flyte Console that I ref'd in the OP, but whereas previously the maptask could be unfolded, and I could see execution/log info on each element of the mapped task, it now only shows a summary of the tasks - I'm not sure if this is some UI change or this is behavior after a task is much older and the pods no longer exist. Or that I'm misremembering. 🙂. More later.
This message contains interactive elements.
d

Dan Rammer (hamersaw)

07/05/2023, 7:49 PM
@Thomas Blom do you know what version of flytepropeller you're running. There was an issue, that I fixed, where Flyte didn't correctly clean up
ImagePullBackoff
resources. So the task would fail, but Flyte would leave the
Pod
until completion. I'm wondering if this is the issue here.
t

Thomas Blom

07/05/2023, 9:09 PM
@Dan Rammer (hamersaw) we are using the flyte binary backend, which to my understanding has the propeller component integrated? The flyte-binary version as of those logs above was 1.5.0, which we've since bumped to 1.6.2 a few weeks ago. I haven't made tests since the last upgrade.
d

Dan Rammer (hamersaw)

07/05/2023, 9:12 PM
which to my understanding has the propeller component integrated
correct. the single binary has all of the components bundled.
which we've since bumped to 1.6.2
so it is likely be that this fix was not in the previous deployment.
t

Thomas Blom

07/05/2023, 9:13 PM
ok, let me do some testing with our current versions and create a simple repro case if I see the problem still...
d

Dan Rammer (hamersaw)

07/05/2023, 9:20 PM
We are certainly going to add a grace period for the
ImagePullBackoff
failure as well (issue pending). Because right now, the first time Flyte sees the
ImagePullBackoff
it will fail the task, deleting the Pod, and then if retries are configured launch a new Pod. It probably makes more sense to wait a configurable amount of time on
ImagePullBackoff
before declaring a task a failure.
t

Thomas Blom

07/06/2023, 8:41 PM
Hey @Dan Rammer (hamersaw), I did a test with flytekit 1.7 and I still see the failing map-tasks issue that actually end up running successfully. See attached image indicating two failed map tasks (out of 100) and see logging from one on these that Flyte reports as failed (though it actually completed successfully). It seems Flyte is seeing the initial ImagePull error and marking it as failed, but a retry causes the image to get fetched and the task runs to completion. But my workflow aborted as soon as Flyte decided one of them failed. 😞 I'm not sure what the right answer here is: this could be fixed either by Flyte waiting a bit before calling it failed, or (from what I read) by us bumping up our
registryPullQPS
and/or
registryBurst
config params for
kubelet
to fix the initial "Pull QPS exceeded" which seems to be the root cause.
d

Dan Rammer (hamersaw)

07/06/2023, 9:36 PM
Did you just upgrade flytekit? The mentioned fix will require flytepropeller to be updated.
t

Thomas Blom

07/06/2023, 9:37 PM
We are running v1.6.2 of the flyte single binary.
And, though it would be good for the failure status to be correct, really what we want is for it not to fail. 🙂
We are certainly going to add a grace period for the
ImagePullBackoff
failure as well (issue pending).
Do you have a feel for when the above might make it in? In the meantime I'm going to explore configuring kubelet to allow more throughput on image pulls.
d

Dan Rammer (hamersaw)

07/06/2023, 9:43 PM
re ^ I'll get a PR submitted by tomorrow. will update this thread accordingly.
and now I see what is happening here. the propeller fix for deleting pods on
ImagePullBackoff
will not work on maptask because the abort is handled differently. but it should fix this issue in dynamics / etc. i'm currently wrapping up a huge effort completely updating how maptasks are executed internally with the ArrayNode work. once that's in (hopefully next few weeks) the Pod deletion on
ImagePullBackoff
will work in maptasks. as you suggested though, maybe not the exact behavior you want 😄, but a big relief for a lot of maptask issues.
t

Thomas Blom

07/06/2023, 9:48 PM
Great, thanks for the update!
d

Dan Rammer (hamersaw)

07/06/2023, 9:54 PM
t

Thomas Blom

07/07/2023, 2:02 PM
Thanks @Dan Rammer (hamersaw)! I'll look for this to show up in a subsequent release. Presumably this will be a release of flyte-propeller, in our case bundled in the single flyte binary, since I'd assume all logic related to marking pods as failed exists in that backend code.
d

Dan Rammer (hamersaw)

07/08/2023, 12:31 AM
Exactly, so this was merged in propeller in the v1.1.105 release which should be included in the next flyte single binary release.
s

Sam Eckert

07/10/2023, 4:39 PM
Sorry all, I was out last week. It seems like there is a solution proposed above, which is great. On our end we tuned some k8s params to decrease the amount of node consolidation that was happening and increased our EC2 pull limits which seems to have alleviated the issue for now.
t

Thomas Blom

07/10/2023, 9:46 PM
@Dan Rammer (hamersaw) Above you mention that your grace period for image-pull timeouts will be included in the next flyte single binary release. I've looked around and can't seem to find a release history (or version numbers I understand) for the single-binary. I found the page that documents how to install it via a Helm chart, and the README.md in the helm charts folder refers to various version numbers, including
Version 0.1.10
,
AppVersion 1.16.0
and
flyteagent.deployment.image.tag
version of
1.62b1
, among others. I'm not finding any clear-cut "release/version history" for the flyte-binary as I might expect. Can you shed any light on this? I'm trying to understand how to look out for what you refer to above, the "next flyte single binary release". Thanks!
69 Views