https://flyte.org logo
Title
t

Thomas Blom

05/25/2023, 4:45 PM
I'm not sure if my question belongs here or in #flytekit ... I'm seeing map-tasks commonly fail with the error:
[85]: code:"ContainersNotReady|ImagePullBackOff" message:"containers with unready status: [f0ff4b2cb0d6d44d1907-n1-0-dn0-0-dn10-0-85]|Back-off pulling image \"(sanitized).<http://dkr.ecr.us-east-1.amazonaws.com/flyte-plaster:23.5.19\%22%22|dkr.ecr.us-east-1.amazonaws.com/flyte-plaster:23.5.19\"">
This is part of a workflow that employs a mappable task for one section. The workflow will complete successfully maybe half of the time. The other half, this error occurs specifically for the map task. I can see from logs that some elements of the mapped task are executing successfully, and in this case it is number 85 (out of 100) that failed. These are all executing on the same node (this is by design for mapped tasks, yes?), and they all use the same image, which is clearly already on the node since other elements from the mapped task "list" are executing. Edit: The above was incorrect. I misread/misunderstood the "node" aspect: a map task runs within a single "workflow node", not a single computer-node (e.g. ec2-instance). In fact, the 100 elements of the map task are being executed across 5 different machines. Still, the puzzle remains (see the reply) - the task that supposedly failed due to "ContainersNotRead|ImagePullBackoff" was in fact running when the job failed. So how could the container not be ready, or the image need pulling? Thoughts?
An additional perplexing bit of information is that based on a review of logs, the map-task "element" 85 pod was actually running for 20 seconds doing computations before things were halted due to this error.
v

Victor Gustavo da Silva Oliveira

05/25/2023, 7:39 PM
When you register the tasks, which image tag are you using? If it is not exaclty equals to an image in ECR, you recieve this error
At least works this way for me
t

Thomas Blom

05/25/2023, 8:29 PM
Hi @Victor Gustavo da Silva Oliveira - we use a single image for all tasks, so when a new image is built, a new registration pass is done for all tasks/workflows. This workflow described above works about half the time, so I don't think it is a mismatch in image tag spec, otherwise it would never work.