https://flyte.org logo
#ask-the-community
Title
# ask-the-community
e

Eduardo Matus

10/30/2023, 2:07 AM
Hi everyone… still getting these errors:
Copy code
resource not found, name [nlp-development/a8vs2x8x8g6wktkbghtn-n0-0]. reason: pods "a8vs2x8x8g6wktkbghtn-n0-0" not found
even having set inject-finalizer: true in proppeler
Copy code
flytepropeller:
  replicaCount: 2
  inject-finalizer: true
  manager: false
Any ideas?
d

Dan Rammer (hamersaw)

10/30/2023, 1:09 PM
@Eduardo Matus are these tasks interruptible? And / or running on spot / preemtible instances?
e

Eduardo Matus

10/30/2023, 3:02 PM
@Dan Rammer (hamersaw) interruptible was not set, will set as false. As for spot/preemtible, the current config is spot_allocation_strategy = “capacity-optimized” os probably this is the issue, the task that we want to run takes a 1-2 hours to complete (sometimes more)
d

Dan Rammer (hamersaw)

10/30/2023, 5:23 PM
Sure, if the Pod is running on a reclaimed spot instance then it will be deleted regardless of finalizers. You do have system retries set so the task in question will just retry and succeed down the line right? You can use intra-task checkpointing too to pick up from a mid point.
e

Eduardo Matus

10/31/2023, 2:06 AM
have something implemented to recover.. but still sucks. What I did was to reduce the batch size so now I have more pods working, but takes less time to complete each one (no pods being deleted for now)
d

Dan Rammer (hamersaw)

10/31/2023, 2:17 PM
have something implemented to recover
can you elaborate here? I'm not sure I'm following. Maybe a breakdown of your use-case would help? It sounds like you're processing a collection of input data and batching into multiple Pods?