Hi everyone… still getting these errors: ```resour...
# ask-the-community
e
Hi everyone… still getting these errors:
Copy code
resource not found, name [nlp-development/a8vs2x8x8g6wktkbghtn-n0-0]. reason: pods "a8vs2x8x8g6wktkbghtn-n0-0" not found
even having set inject-finalizer: true in proppeler
Copy code
flytepropeller:
  replicaCount: 2
  inject-finalizer: true
  manager: false
Any ideas?
d
@Eduardo Matus are these tasks interruptible? And / or running on spot / preemtible instances?
e
@Dan Rammer (hamersaw) interruptible was not set, will set as false. As for spot/preemtible, the current config is spot_allocation_strategy = “capacity-optimized” os probably this is the issue, the task that we want to run takes a 1-2 hours to complete (sometimes more)
d
Sure, if the Pod is running on a reclaimed spot instance then it will be deleted regardless of finalizers. You do have system retries set so the task in question will just retry and succeed down the line right? You can use intra-task checkpointing too to pick up from a mid point.
e
have something implemented to recover.. but still sucks. What I did was to reduce the batch size so now I have more pods working, but takes less time to complete each one (no pods being deleted for now)
d
have something implemented to recover
can you elaborate here? I'm not sure I'm following. Maybe a breakdown of your use-case would help? It sounds like you're processing a collection of input data and batching into multiple Pods?