https://flyte.org logo
#ask-ai
Title
# ask-ai
o

Olivier Sevin

02/13/2024, 7:04 PM
@Glime can you explain why sometimes getting "reported success despite being OOMKilled" errors
g

Glime

02/13/2024, 7:05 PM
The issue of tasks being reported as successful despite being OOMKilled (Out of Memory Killed) can be attributed to a few potential reasons based on the context provided: 1. Incorrect Handling of OOMKilled Status: There might be a problem in how the system handles and reports the status of tasks that are OOMKilled. For instance, the system might be marking subtask executions as retryable failures due to OOMKilled but still reporting the overall task as a success. This behavior was noted as odd and under investigation in one of the discussions, indicating that there might be a bug or an oversight in the handling of such errors (source). 2. Resource Configuration Issues: Another reason could be related to the configuration of resources for the tasks. Users have encountered OOMKilled errors even after setting what they believed were sufficient resources for their tasks. This suggests that there might be a mismatch between the resources allocated and the actual requirements of the tasks, leading to OOMKilled errors despite seemingly adequate resource allocation (source). 3. Issues with Flyte's Handling of OOMKilled Tasks: There is an acknowledgment of the need to correctly mark tasks as OOMKilled in Flyte's issue tracker. This indicates that there might be known issues or improvements needed in how Flyte detects and reports OOMKilled tasks (source). 4. Underlying Kubernetes Behavior: The behavior could also be influenced by how Kubernetes handles OOMKilled containers. For example, if a pod is OOMKilled, Kubernetes might restart it, and depending on how Flyte interprets these restarts, it could potentially lead to incorrect reporting of task success. In summary, the issue of tasks being reported as successful despite being OOMKilled could stem from a combination of incorrect error handling within Flyte, resource configuration issues, and potentially the underlying behavior of Kubernetes. It's important for the Flyte team to investigate these incidents to identify and rectify the root causes, ensuring accurate reporting of task statuses.