Hello community, Have you seen similar error? re...
# flyte-support
b
Hello community, Have you seen similar error? resource not found, name [bap-development/f138b272e4b3341c282d-n3-0-n3-n1-0-n2-0]. reason: Pod "f138b272e4b3341c282d-n3-0-n3-n1-0-n2-0" not found Some context to it, I have a workflow and I run it on some videos, in 95% of the cases it runs without an issue, but sometimes I get similar errors that it cannot find some pod for some reason
a
if the pod is gone then the node might have scaled down
we found this would happen for pod AZ rebalancing in AWS and had to disable that for the nodegroups
b
It means that too many pods were running and some of them got deleted? I can imagine Is there a way to restore them? Because If I simply relaunch them, then it will end up in the same failure
a
see
AZRebalance
in https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html if the error is consistently reproducible with the same pods then its likely not a rebalance issue. perhaps the job is crashing the node? at one point we had crashes because of too little diskspace, fixed by increasing the diskspace. the point is, if the pod is completely gone then the node is likely gone, so you will need to debug reasons as to why the node was removed/crashed/etc
b
thanks, I will look into it