Hello community,
Have you seen similar error?
resource not found, name [bap-development/f138b272e4b3341c282d-n3-0-n3-n1-0-n2-0]. reason: Pod "f138b272e4b3341c282d-n3-0-n3-n1-0-n2-0" not found
Some context to it, I have a workflow and I run it on some videos, in 95% of the cases it runs without an issue, but sometimes I get similar errors that it cannot find some pod for some reason
a
abundant-laptop-64153
06/10/2024, 3:32 PM
if the pod is gone then the node might have scaled down
abundant-laptop-64153
06/10/2024, 3:32 PM
we found this would happen for pod AZ rebalancing in AWS and had to disable that for the nodegroups
b
bored-needle-72209
06/10/2024, 3:42 PM
It means that too many pods were running and some of them got deleted? I can imagine
Is there a way to restore them? Because If I simply relaunch them, then it will end up in the same failure
a
abundant-laptop-64153
06/10/2024, 4:11 PM
see
AZRebalance
in https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html
if the error is consistently reproducible with the same pods then its likely not a rebalance issue. perhaps the job is crashing the node? at one point we had crashes because of too little diskspace, fixed by increasing the diskspace. the point is, if the pod is completely gone then the node is likely gone, so you will need to debug reasons as to why the node was removed/crashed/etc