Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Hello community,

Have you seen similar error?

_resource not found, name [bap-development/f138b272e4b3341c282d-n3-0-n3-n1-0-n2-0]. reason: Pod "f138b272e4b3341c282d-n3-0-n3-n1-0-n2-0" not found_

Some context to it, I have a workflow and I run it on some videos, in 95% of the cases it runs without an issue, but sometimes I get similar errors that it cannot find some pod for some reason

if the pod is gone then the node might have scaled down

we found this would happen for pod AZ rebalancing in AWS and had to disable that for the nodegroups

It means that too many pods were running and some of them got deleted? I can imagine

Is there a way to restore them? Because If I simply relaunch them, then it will end up in the same failure

see `AZRebalance` in <https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html>
if the error is consistently reproducible with the same pods then its likely not a rebalance issue.  perhaps the job is crashing the node?  at one point we had crashes because of too little diskspace, fixed by increasing the diskspace.  the point is, if the pod is completely gone then the node is likely gone, so you will need to debug reasons as to why the node was removed/crashed/etc