Hello! I am investigating some odd behavior in Fly...
# flyte-support
c
Hello! I am investigating some odd behavior in Flyte propeller and I'm looking for clarification about what is going on. It seems that when we have a large number of concurrent tasks and the propeller queue fills up it gets into a weird state where it doesn't seem to process anything. I have attached some screenshots below. As pressure on propeller increases you can see the free worker count decrease until it hits 0 at around 21:00. At around 23:00 the free workers jump back up near the maximum, the queue depth apparently hits 0, nothing seems to be added to the queue, but workflow acceptance begins climbing steadily and at this point nothing is processing. During this whole time the running workflows also steadily increases. During this time CPU usage of flytepropeller also lowers. From looking at the logs there are just insane amounts of
Enqueueing workflow
and not much else
Volume of
Enqueueing workflow
logs (98 million)
Here is the volume of not
Enqueueing workflow
logs. (We rebooted flytepropeller to try and recover things towards the end). The point being that once everything breaks all that is logged is
enqueueing workflow
and nothing else.
b
Hey @clean-glass-36808 we had somewhat of a similar issue. What flytepropeller version you running?
You could have the same issue we were having plus, there was rate limit added into the cache and the defaults can be somewhat low depending on what you're running this is what ultimately fixed our issue + bumping that rate limit https://github.com/flyteorg/flyte/pull/5788
Whenever we rebooted propeller things came back to normal since it would clear the cache queue.
c
We're running a fork branched off of v1.13.0
b
Do you have graphs on the requests done to flyteadmin? What we also saw was a steady increase n of requests to flyteadmin
Then it can be that and/or the rate limit. I would recommend you move to the latest.
c
I can pull the metrics in a moment
I see a little spike about an hour after I see things break but I'm not sure its related.
b
And it dropped after you restarted?
I'm not entirely sure without looking at the logs but the issue does seem somewhat similar. I would give it a shot pushing those changes in 🙂
c
In the graph above we did not reboot. We rebooted at around 8 AM which is out of bounds. When we rebooted some workflows got processed but it got into a broken state again. And then after a second reboot it seems recovered
b
Ok, I'll check tomorrow. Probably it could be just needing to tweak the rate limiting thing. I don't know the config by heart
c
Which rate limit are you referring to?
b
This one
c
I don't think that would be an issue for us. We've configured ours when we did initial scale testing at ~25k concurrent tasks
There are also barely any logs that indicate that propeller node execution is even running in our case
b
Did you do it before or after this change? This was "somewhat recent". What was happening in our case was that our workflow executions where failing too fast (we use a lot of subworkflows) and sometimes the executions in admin would never leave the
Aborting
state in admin. This
Aborting
state item would stay in the workqueue until there was no more space in the queue so we would observe exactly what you're saying, no executions could run. Queue was full, many calls to flyteadmin trying to update that queue item from
Aborting
to abort but it could never update.
I can explain better tomorrow gotta drop.
c
Looking at the logs there was nothing related to aborting in flytepropeller until we tried to recover the system
f
@clean-glass-36808 we have a solution for this in union today. We are running more than 100k workflows concurrently
c
@freezing-airport-6809 Not sure what you mean. I think there is some sort of bug here. The concurrent workflows for this case was < 2000. We have scale tested 25k concurrent workflows before without issue
f
No what I mean is we have a few things at union that help us scale beyond 100k, we can review and let you know
c
I'll have to look into the source code but I don't see how we have 98 million logs about enqueuing workflows but the metrics indicate unprocessed queue depth is 0, the workers are doing seemingly nothing, and acceptance latency grows forever
f
I think it’s a bug in kubeclient libs
Ohh wait - maybe you just have undeleted workflows in your cluster
Can you check kubectl get fly
The gc may not be working
That’s why 98million
You will have to manually clean things up by increasing timeout
c
We rebooted propeller and that recovers things. I would have to reproduce the issue again
b
Yeah if it was a bunch of undeleted workflows, rebooting wouldn't help. I'm still betting some problem with the cache, it's exactly what we've seen on our end and rebooting the propellers makes it run. It's easy to prove @clean-glass-36808 by upgrading Flyte 🙂
g
cc @white-painting-22485
c
I can definitely upgrade Flyte. Its a little more involved since we run a fork and I've got a lot going on at work 🙂
I was generally curious if anyone knew what could cause this since I don't have a mental model for how propeller churns through nodes yet. I'm also not sure if I'm mis-interpreting the metrics.
f
Hmm it will recover for a bit? If you have too many objects in etcd
c
So we rebooted propeller once and it recovered most of the executions but got into a weird state with a subset of executions. I had specific logs of an execution that was kicked off and then never synced (but tons of
Enqueueing workflow
logs) until we rebooted a final second time.
So if every execution node is a CRD that seems to imply some issue with kube client like you said or an issue with the controller that is receiving webhook calls? Presumably
enqueueing workflow
writes a CRD but if it doesn't show up in the queue its not being picked up correctly
FWIW, etcd write latencies did increase but I'm not sure how bad these values are. The time frame between the two humps is when things were pretty broken. Second hump is when I rebooted propeller.
w
@clean-glass-36808 it sounds like you may be running into this issue I just fixed: https://github.com/flyteorg/flyte/commit/ed87fa17b356a7ddac3c2b7180c7e800fbf8ad90 Whenever there are a lot of outstanding executions, propeller will repeatedly add them to the workqueue and end up pushing out the rate limiter far into the future
c
That seems to add up with what we’ve been seeing. Thanks so much!