Maciej Kopczyński02/12/2024, 9:31 PM
After checking k8s audit logs, it seems that Flyte attempts to update the pod after it already completed (status phase of the pod is
Workflow[redacted] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[dn0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [k8s-array]:
Operation cannot be fulfilled on pods "a4pdb2frgszrfnfw95xd-n0-0-n1-0-dn0-0-238": the object has been modified; please apply your changes to the latest version and try again
). What can be the reason?
Glime02/12/2024, 9:32 PM
plugin. This issue arises when Flyte tries to update or interact with a Kubernetes pod that has changed state outside of its control or expectation, leading to a conflict.
This problem can occur due to several reasons:
1. *Concurrency and Timing Issues*: In highly concurrent environments, especially with workflows that involve many parallel tasks (like those using the
plugin), timing issues can arise. Flyte might attempt to update or check the status of a pod that has already completed its execution and changed state, leading to the error you see.
2. *Retries and Kubernetes Eventual Consistency Model*: Kubernetes operates on an eventual consistency model, where the state of resources like pods might not be immediately reflected across the system. If Flyte's actions are based on stale state information, it might attempt operations that are no longer valid, such as updating a succeeded pod.
3. *Plugin-Specific Behavior*: The
plugin, which is used for executing array jobs in Kubernetes through Flyte, might have specific behaviors or bugs that lead to this issue. For example, it might not correctly handle the lifecycle events of pods, leading to attempts to update pods after their completion.
To troubleshoot and potentially resolve this issue, consider the following steps:
• *Ensure Up-to-Date Versions*: Make sure you are using the latest versions of Flyte, flyteplugins, and any related Kubernetes components. Updates often include bug fixes and improvements that might address this issue.
• *Review FlytePropeller and Plugin Configuration*: Check the configuration of FlytePropeller and the
plugin. There might be settings related to retries, concurrency, or synchronization that can be adjusted to mitigate this issue.
• *Check for Known Issues*: Look into the Flyte GitHub repositories and community discussions for any known issues or similar reports. This problem might have been encountered and addressed by others in the community.
• *Contact Flyte Support or Community*: If the issue persists, consider reaching out to the Flyte support team or the community forums with detailed information about your workflow, the Flyte and plugin versions you're using, and any logs or additional details about the error. The community or the developers might provide specific advice or solutions based on the latest developments.
Given the nature of this error, it's also beneficial to monitor the Flyte and Kubernetes project repositories for any updates or patches that address race conditions or synchronization issues with the
plugin or similar components.
This is an AI-generated response, take its recommendations with caution.
Please help us improve the model accuracy by giving this response a thumbs up/thumbs down reaction.
https://discuss.flyte.org/t/2404507/got-an-interesting-panic-in-flyte-propeller-k8s-array-plugin/ 👍 button 👎 button