jolly-florist-66728
12/03/2024, 6:37 PMmap_task
failed out with a very long error message (it hit a SQL error on a large insert), but the containing map_task
(Array Node
on the UI) and the overarching workflow were just stuck in a Running
state until I manually went in and terminated the workflow.
I'm guessing the cause of this issue was the size of the error message, but is there something I should be looking for in the logs to confirm this? Should Flyte be handling this sort of situation better (assuming it is the issue that I'm guessing)?
# Answer
The issue you're experiencing with the workflow being stuck in a "Running" state due to a large error message is a known problem in Flyte. According to Flyte Issue #1613, large error messages can cause failures in FlytePropeller, leading to workflows not progressing as expected. This is because the error message size can exceed the gRPC message size limit, causing the system to fail to update the workflow status properly.
To confirm if this is the issue, you should check the FlytePropeller logs for messages indicating a ResourceExhausted
error or similar, which would suggest that the error message size exceeded the allowed limit. Flyte should ideally handle such situations by truncating error messages to prevent them from exceeding size limits, as discussed in Flyte Issue #5224.
Would you like more detailed information on how to address this issue or how Flyte handles large error messages?
Sources:
- Flyte Issue #1613
- Flyte Issue #5224
Slack Link
https://flyte.slack.com/archives/CP2HDHKE1/p1733251055.872239 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.