New RunLLM response generated:
>>>*# Question*
Hello! I am investigating some odd behavior in Flyte propeller that seems to happen at high scale and I'm looking for clarification about what is going on. It seems that when we have a large number of concurrent tasks and the propeller queue fills up it gets into a weird state where it doesn't seem to process anything. I have attached some screenshots below.
As pressure on propeller increases you can see the free worker count decrease until it hits 0. At around 23:00 the free workers jump back up near the maximum, the queue depth apparently hits 0, but workflow acceptance begins climbing steadily and at this point nothing is processing.
# Answer
It sounds like you are experiencing a resource saturation issue with FlytePropeller. When the free worker count hits zero, it indicates that all available workers are occupied, which can lead to increased latency and processing delays. This is often due to the number of concurrent workflows exceeding the capacity of the configured workers.
According to the
Flyte documentation, increasing the number of workers can help alleviate this issue. The number of workers should be balanced with the available CPU cores to avoid excessive context switching. Additionally, monitoring the
round_latency
metric can provide insights into the performance bottlenecks. If the queue depth is zero but workflows are not processing, it might indicate a problem with the workflow evaluation loop or a potential bug in the system.
Would you like more detailed steps on how to address this issue or further troubleshooting steps?
Sources:
-
Flyte Documentation on Performance
Slack Link
https://flyte.slack.com/archives/CP2HDHKE1/p1732131434.815739 👍 0 button 👎 0 button 💡
Hint: Mention @runllm in the thread for followups.