@Glime What's going on here, we have finalizers enabled and keep getting messages like this: "ResourceDeletedExternally" message:"resource not found, name [balrog-production/fnmwxgmyisp46y-n0-0-dn6-0-dn0-0-dn1-0-5]. reason: pods \"fnmwxgmyisp46y-n0-0-dn6-0-dn0-0-dn1-0-5\" not found"
02/06/2024, 8:53 PM
Based on the context provided, it seems you are encountering an issue where despite having finalizers enabled to prevent premature deletion of resources, you are still receiving messages indicating that resources are being deleted externally or not found. This issue can be complex, involving interactions between Flyte, Kubernetes, and possibly other cloud provider-specific mechanisms (like AWS EKS node management and Auto Scaling Group (ASG) behaviors).
1. *Finalizers and Kubernetes Behavior*: Finalizers are designed to prevent Kubernetes resources from being immediately deleted, allowing for cleanup or other actions to be taken before the deletion is finalized. However, as noted in the discussions, there are scenarios where Kubernetes or external systems might not respect these finalizers. For example, if a pod is OOM (Out of Memory) killed, Kubernetes marks the pod for deletion, and even with a finalizer, the pod's status can lead to confusion for Flyte when it tries to determine the task state.
2. *External Deletions and Auto Scaling*: One specific scenario mentioned involves interaction with AWS EKS node management and ASG Availability Zone rebalancing. It was observed that eviction requests made by AWS-managed lambda functions for AZ rebalancing could lead to pod evictions that seem to bypass the protection offered by finalizers. This suggests that certain cloud-provider-level operations might not fully honor Kubernetes finalizers in the way one might expect.
3. *Investigation and Mitigation Steps*:
• *Review Kubernetes and Cloud Provider Documentation*: Understanding the specifics of how finalizers are treated by Kubernetes and any cloud-provider-specific behaviors (like EKS's node management and ASG rebalancing) is crucial.
• *Check for Configuration Issues*: Ensure that finalizers are correctly configured and applied to your resources. It might also be worth exploring if there are specific configurations within EKS or the ASG that could be adjusted to reduce the likelihood of these issues.
• *Contact Cloud Provider Support*: If the issue seems to be related to cloud-provider-specific behaviors (like the lambda function evictions during AZ rebalancing), reaching out to AWS support for more insights or potential workarounds could be beneficial.
• *Monitor and Adjust Resource Limits*: For cases where pods are being OOM killed, reviewing and possibly adjusting resource requests and limits might help reduce the frequency of such events.
In summary, while finalizers are a powerful feature for managing the lifecycle of Kubernetes resources, their effectiveness can be influenced by a range of factors including Kubernetes internals, cloud-provider-specific behaviors, and resource configuration issues
This is an AI-generated response, take its recommendations with caution.
Please help us improve the model accuracy by giving this response a thumbs up/thumbs down reaction.
👍 button 👎 button
thanks David and @Glime (do you do follow ups?) I think I've got something different, from the GKE audit logs I see flytepropeller did an "update" to remove the finalizer and a "delete" on this pod (a map task) before calling "get" a couple times 10 seconds later
02/06/2024, 10:26 PM
Based on the information provided and the context from the Flyte community discussions, it seems like the issue you're encountering is related to the behavior of FlytePropeller in handling finalizers for pods, particularly in the context of map tasks. From the GKE audit logs you mentioned, it appears that FlytePropeller performed an "update" operation to remove the finalizer from a pod and then issued a "delete" command before attempting to "get" the pod's status multiple times shortly after. This sequence of actions suggests a few potential areas to investigate:
1. *Finalizer Removal Timing*: The removal of a finalizer is a critical step that signals Kubernetes that it's safe to garbage collect a resource (in this case, a pod). If FlytePropeller is removing the finalizer and then immediately deleting the pod, it might be doing so before any necessary cleanup or final state checks are performed. This could lead to situations where FlytePropeller later attempts to "get" the pod's status for final processing or logging, only to find that the pod has already been removed from the cluster.
2. *Map Task Specific Behavior*: Given that this issue occurred with a map task, it's worth considering whether there are specific behaviors or configurations related to map tasks that could influence how finalizers are managed. Map tasks, which execute multiple instances of a task in parallel, might have different timing or synchronization considerations compared to other task types.
3. *Audit Log Insights*: The sequence of actions captured in the GKE audit logs (update to remove finalizer, delete, then get) provides a useful trail to follow. It would be beneficial to closely examine the timestamps and any associated log messages from both FlytePropeller and Kubernetes to understand the exact sequence of events and any errors or warnings that were logged.
4. *Configuration and Version Check*: Ensure that your Flyte and FlytePropeller configurations are up to date and correctly set up for handling finalizers with map tasks. It's also a good idea to verify that you're using a version of Flyte and FlytePropeller that includes any recent fixes or improvements related to finalizer handling and map tasks.
To address this issue, you may want to:
• Review the Flyte and FlytePropeller documentation and configuration related to finalizers and map tasks.
• Check for any recent updates or patches to Flyte and FlytePropeller that might address this behavior.
• Consider reaching out to the Flyte community with detailed information about your observations, including the specific
👍 button 👎 button