I m not sure if my question belongs here or in < CREL4QVAQ|> Flyte #flyte-support

I'm not sure if my question belongs here or in <#C...

microscopic-furniture-57275

05/25/2023, 4:45 PM

I'm not sure if my question belongs here or in #CREL4QVAQ ... I'm seeing map-tasks commonly fail with the error:

Copy code

[85]: code:"ContainersNotReady|ImagePullBackOff" message:"containers with unready status: [f0ff4b2cb0d6d44d1907-n1-0-dn0-0-dn10-0-85]|Back-off pulling image \"(sanitized).<http://dkr.ecr.us-east-1.amazonaws.com/flyte-plaster:23.5.19\%22%22|dkr.ecr.us-east-1.amazonaws.com/flyte-plaster:23.5.19\"">

This is part of a workflow that employs a mappable task for one section. The workflow will complete successfully maybe half of the time. The other half, this error occurs specifically for the map task. I can see from logs that some elements of the mapped task are executing successfully, and in this case it is number 85 (out of 100) that failed. These are all executing on the same node (this is by design for mapped tasks, yes?), and they all use the same image, which is clearly already on the node since other elements from the mapped task "list" are executing. Edit: The above was incorrect. I misread/misunderstood the "node" aspect: a map task runs within a single "workflow node", not a single computer-node (e.g. ec2-instance). In fact, the 100 elements of the map task are being executed across 5 different machines. Still, the puzzle remains (see the reply) - the task that supposedly failed due to "ContainersNotRead|ImagePullBackoff" was in fact running when the job failed. So how could the container not be ready, or the image need pulling? Thoughts?

microscopic-furniture-57275

05/25/2023, 7:27 PM

An additional perplexing bit of information is that based on a review of logs, the map-task "element" 85 pod was actually running for 20 seconds doing computations before things were halted due to this error.

dry-ability-69144

05/25/2023, 7:39 PM

When you register the tasks, which image tag are you using? If it is not exaclty equals to an image in ECR, you recieve this error

dry-ability-69144

05/25/2023, 7:39 PM

At least works this way for me

microscopic-furniture-57275

05/25/2023, 8:29 PM

Hi @dry-ability-69144 - we use a single image for all tasks, so when a new image is built, a new registration pass is done for all tasks/workflows. This workflow described above works about half the time, so I don't think it is a mismatch in image tag spec, otherwise it would never work.

crooked-artist-67935

06/05/2023, 7:10 PM

We're seeing this exact behavior as well with a single custom image. We've seen it with dynamic tasks what spin up a bunch of child tasks as well as in map task. Did you happen to identify any root causes here?

microscopic-furniture-57275

06/07/2023, 2:06 PM

We have not, I'm hoping this will catch the attention of some Union engineers. 🙂. I also continue to see this -- just as you report - in map tasks, and also sometimes in @dynamic workflows that execute a number (e.g. 15-20) of "child" tasks.

dry-ability-69144

06/07/2023, 2:07 PM

@freezing-airport-6809

freezing-airport-6809

06/07/2023, 2:15 PM

Is this for map tasks only?

microscopic-furniture-57275

06/07/2023, 2:15 PM

No, as indicated above, we both see it for non-map tasks as well.

microscopic-furniture-57275

06/07/2023, 2:16 PM

It would appear to be an issue when a "large" (I've seen it with as few as 15) tasks get started one after the other.

microscopic-furniture-57275

06/07/2023, 2:18 PM

In exploring this error message "ContainersNotReady|ImagePullBackOff" I thought perhaps there was an issue with node startup timing out, or image-download-throttling, or something causing a timeout. But in my investigations I've found it fail in situations where the image MUST already be on the node. In the case I mentioned in the OP, based on log files, it even seemed the task was in fact already running!

microscopic-furniture-57275

06/07/2023, 2:21 PM

Unfortunately I have been busy on 10 other things, so have not constructed a simple repro case; these were observed in our production workflows. In the meantime, we've mitigated by stopping the use of the map task, and in the case of the @dynamic that I saw fail, this is a FixedRate task, so the failure case will get re-run anyway. So we're living with it for the moment.

freezing-airport-6809

06/08/2023, 5:12 AM

We have not really seen this, but let me cc @hallowed-mouse-14616 on this, he is actually completely reworking maptasks in the back - its a massive refactor and makes map tasks extremely powerful - goal 1: Make it the same (just less buggy) goal 2: Make it more powerful - extend the paradigm to many other task types etc

hallowed-mouse-14616

06/08/2023, 1:08 PM

@microscopic-furniture-57275 can you provide a little more information? Specifically: (1) which logs are you referring to that indicate the task is executing? k8s? flyte? and do you have the messages? (2) do you have the pod yaml dump from k8s? this would help immensely. We handle transitioning k8s pod state to a flyte phase in this function. Basically, there is a mapping of values. What I'm wondering is if there is something like an

ImagePullBackoff

(as indicated by your message), then Flyte marks the task as a retryable failure, then between that time and when Flyte actually cleans up the failed task it starts. The 20 second runtime seems suspect, but it could be the case. If this is the case we could explore adding a grace period configuration option similar to the CreateContainerError. The idea is that Flyte wouldn't immediately fail, rather it would ensure that N seconds have passed since beginning the task.

freezing-airport-6809

06/08/2023, 1:11 PM

The image pullback off could be because of throttling - @hallowed-mouse-14616 - great point. I think you might be right

microscopic-furniture-57275

06/08/2023, 2:11 PM

@hallowed-mouse-14616 thanks very much for your thoughtful reply. I'm on a deadline for some unrelated work, so may not be able to provide this information for a few days -- but we need to solve this and I will absolutely get back to you with more logs and investigation results. Your theory sounds plausible and a grace period sounds like a good solution in this case. Also note that @crooked-artist-67935 reported seeing the same things as I have, so perhaps Sam can as well provide more log/repro info. I'll need to try to repro to get the log information you want -- I went back to the execution in the Flyte Console that I ref'd in the OP, but whereas previously the maptask could be unfolded, and I could see execution/log info on each element of the mapped task, it now only shows a summary of the tasks - I'm not sure if this is some UI change or this is behavior after a task is much older and the pods no longer exist. Or that I'm misremembering. 🙂. More later.

microscopic-furniture-57275

07/03/2023, 8:35 PM

This message contains interactive elements.

hallowed-mouse-14616

07/05/2023, 7:49 PM

@microscopic-furniture-57275 do you know what version of flytepropeller you're running. There was an issue, that I fixed, where Flyte didn't correctly clean up

ImagePullBackoff

resources. So the task would fail, but Flyte would leave the

Pod

until completion. I'm wondering if this is the issue here.

microscopic-furniture-57275

07/05/2023, 9:09 PM

@hallowed-mouse-14616 we are using the flyte binary backend, which to my understanding has the propeller component integrated? The flyte-binary version as of those logs above was 1.5.0, which we've since bumped to 1.6.2 a few weeks ago. I haven't made tests since the last upgrade.

hallowed-mouse-14616

07/05/2023, 9:12 PM

which to my understanding has the propeller component integrated

correct. the single binary has all of the components bundled.

which we've since bumped to 1.6.2

so it is likely be that this fix was not in the previous deployment.

microscopic-furniture-57275

07/05/2023, 9:13 PM

ok, let me do some testing with our current versions and create a simple repro case if I see the problem still...

🙏 1

hallowed-mouse-14616

07/05/2023, 9:20 PM

We are certainly going to add a grace period for the

ImagePullBackoff

failure as well (issue pending). Because right now, the first time Flyte sees the

ImagePullBackoff

it will fail the task, deleting the Pod, and then if retries are configured launch a new Pod. It probably makes more sense to wait a configurable amount of time on

ImagePullBackoff

before declaring a task a failure.

👍 1

microscopic-furniture-57275

07/06/2023, 8:41 PM

Hey @hallowed-mouse-14616, I did a test with flytekit 1.7 and I still see the failing map-tasks issue that actually end up running successfully. See attached image indicating two failed map tasks (out of 100) and see logging from one on these that Flyte reports as failed (though it actually completed successfully). It seems Flyte is seeing the initial ImagePull error and marking it as failed, but a retry causes the image to get fetched and the task runs to completion. But my workflow aborted as soon as Flyte decided one of them failed. 😞 I'm not sure what the right answer here is: this could be fixed either by Flyte waiting a bit before calling it failed, or (from what I read) by us bumping up our

registryPullQPS

and/or

registryBurst

config params for

kubelet

to fix the initial "Pull QPS exceeded" which seems to be the root cause.

hallowed-mouse-14616

07/06/2023, 9:36 PM

Did you just upgrade flytekit? The mentioned fix will require flytepropeller to be updated.

microscopic-furniture-57275

07/06/2023, 9:37 PM

We are running v1.6.2 of the flyte single binary.

microscopic-furniture-57275

07/06/2023, 9:38 PM

And, though it would be good for the failure status to be correct, really what we want is for it not to fail. 🙂

microscopic-furniture-57275

07/06/2023, 9:40 PM

We are certainly going to add a grace period for the
ImagePullBackoff
failure as well (issue pending).

Do you have a feel for when the above might make it in? In the meantime I'm going to explore configuring kubelet to allow more throughput on image pulls.

hallowed-mouse-14616

07/06/2023, 9:43 PM

re ^ I'll get a PR submitted by tomorrow. will update this thread accordingly.

🙏 1

hallowed-mouse-14616

07/06/2023, 9:47 PM

and now I see what is happening here. the propeller fix for deleting pods on

ImagePullBackoff

will not work on maptask because the abort is handled differently. but it should fix this issue in dynamics / etc. i'm currently wrapping up a huge effort completely updating how maptasks are executed internally with the ArrayNode work. once that's in (hopefully next few weeks) the Pod deletion on

ImagePullBackoff

will work in maptasks. as you suggested though, maybe not the exact behavior you want 😄, but a big relief for a lot of maptask issues.

microscopic-furniture-57275

07/06/2023, 9:48 PM

Great, thanks for the update!

hallowed-mouse-14616

07/06/2023, 9:54 PM

just created an issue - https://github.com/flyteorg/flyte/issues/3843

👍 1

hallowed-mouse-14616

07/07/2023, 7:07 AM

and here's the fix - https://github.com/flyteorg/flyteplugins/pull/370

microscopic-furniture-57275

07/07/2023, 2:02 PM

Thanks @hallowed-mouse-14616! I'll look for this to show up in a subsequent release. Presumably this will be a release of flyte-propeller, in our case bundled in the single flyte binary, since I'd assume all logic related to marking pods as failed exists in that backend code.

hallowed-mouse-14616

07/08/2023, 12:31 AM

Exactly, so this was merged in propeller in the v1.1.105 release which should be included in the next flyte single binary release.

crooked-artist-67935

07/10/2023, 4:39 PM

Sorry all, I was out last week. It seems like there is a solution proposed above, which is great. On our end we tuned some k8s params to decrease the amount of node consolidation that was happening and increased our EC2 pull limits which seems to have alleviated the issue for now.

👍 2

microscopic-furniture-57275

07/10/2023, 9:46 PM

@hallowed-mouse-14616 Above you mention that your grace period for image-pull timeouts will be included in the next flyte single binary release. I've looked around and can't seem to find a release history (or version numbers I understand) for the single-binary. I found the page that documents how to install it via a Helm chart, and the README.md in the helm charts folder refers to various version numbers, including

Version 0.1.10

AppVersion 1.16.0

and

flyteagent.deployment.image.tag

version of

1.62b1

, among others. I'm not finding any clear-cut "release/version history" for the flyte-binary as I might expect. Can you shed any light on this? I'm trying to understand how to look out for what you refer to above, the "next flyte single binary release". Thanks!

162 Views

Open in Slack

Previous Next