New RunLLM response generated >>> Question Hi all I ran into Flyte #runllm-flyte-trial

New RunLLM response generated: >>>*# Question* Hi...

jolly-florist-66728

12/03/2024, 6:37 PM

New RunLLM response generated: >>>*# Question* Hi all -- I ran into an issue this morning in our Flyte instance where an underlying (individual) task in a

map_task

failed out with a very long error message (it hit a SQL error on a large insert), but the containing

map_task

(

Array Node

on the UI) and the overarching workflow were just stuck in a

Running

state until I manually went in and terminated the workflow. I'm guessing the cause of this issue was the size of the error message, but is there something I should be looking for in the logs to confirm this? Should Flyte be handling this sort of situation better (assuming it is the issue that I'm guessing)? # Answer The issue you're experiencing with the workflow being stuck in a "Running" state due to a large error message is a known problem in Flyte. According to Flyte Issue #1613, large error messages can cause failures in FlytePropeller, leading to workflows not progressing as expected. This is because the error message size can exceed the gRPC message size limit, causing the system to fail to update the workflow status properly. To confirm if this is the issue, you should check the FlytePropeller logs for messages indicating a

ResourceExhausted

error or similar, which would suggest that the error message size exceeded the allowed limit. Flyte should ideally handle such situations by truncating error messages to prevent them from exceeding size limits, as discussed in Flyte Issue #5224. Would you like more detailed information on how to address this issue or how Flyte handles large error messages? Sources: - Flyte Issue #1613 - Flyte Issue #5224 Slack Link https://flyte.slack.com/archives/CP2HDHKE1/p1733251055.872239 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

2 Views

Open in Slack

Previous Next