Hello I am investigating some odd behavior in Flyte propelle Flyte #flyte-support

Hello! I am investigating some odd behavior in Fly...

clean-glass-36808

11/20/2024, 7:37 PM

Hello! I am investigating some odd behavior in Flyte propeller and I'm looking for clarification about what is going on. It seems that when we have a large number of concurrent tasks and the propeller queue fills up it gets into a weird state where it doesn't seem to process anything. I have attached some screenshots below. As pressure on propeller increases you can see the free worker count decrease until it hits 0 at around 21:00. At around 23:00 the free workers jump back up near the maximum, the queue depth apparently hits 0, nothing seems to be added to the queue, but workflow acceptance begins climbing steadily and at this point nothing is processing. During this whole time the running workflows also steadily increases. During this time CPU usage of flytepropeller also lowers. From looking at the logs there are just insane amounts of

Enqueueing workflow

and not much else

clean-glass-36808

11/20/2024, 7:57 PM

Volume of

Enqueueing workflow

logs (98 million)

clean-glass-36808

11/20/2024, 8:00 PM

Here is the volume of not

Enqueueing workflow

logs. (We rebooted flytepropeller to try and recover things towards the end). The point being that once everything breaks all that is logged is

enqueueing workflow

and nothing else.

brief-window-55364

11/20/2024, 10:15 PM

Hey @clean-glass-36808 we had somewhat of a similar issue. What flytepropeller version you running?

brief-window-55364

11/20/2024, 10:17 PM

You could have the same issue we were having plus, there was rate limit added into the cache and the defaults can be somewhat low depending on what you're running this is what ultimately fixed our issue + bumping that rate limit https://github.com/flyteorg/flyte/pull/5788

brief-window-55364

11/20/2024, 10:18 PM

Whenever we rebooted propeller things came back to normal since it would clear the cache queue.

clean-glass-36808

11/20/2024, 10:18 PM

We're running a fork branched off of v1.13.0

brief-window-55364

11/20/2024, 10:19 PM

Do you have graphs on the requests done to flyteadmin? What we also saw was a steady increase n of requests to flyteadmin

brief-window-55364

11/20/2024, 10:20 PM

Then it can be that and/or the rate limit. I would recommend you move to the latest.

clean-glass-36808

11/20/2024, 10:22 PM

I can pull the metrics in a moment

clean-glass-36808

11/20/2024, 10:28 PM

I see a little spike about an hour after I see things break but I'm not sure its related.

brief-window-55364

11/20/2024, 10:29 PM

And it dropped after you restarted?

brief-window-55364

11/20/2024, 10:30 PM

I'm not entirely sure without looking at the logs but the issue does seem somewhat similar. I would give it a shot pushing those changes in 🙂

clean-glass-36808

11/20/2024, 10:30 PM

In the graph above we did not reboot. We rebooted at around 8 AM which is out of bounds. When we rebooted some workflows got processed but it got into a broken state again. And then after a second reboot it seems recovered

brief-window-55364

11/20/2024, 10:31 PM

Ok, I'll check tomorrow. Probably it could be just needing to tweak the rate limiting thing. I don't know the config by heart

clean-glass-36808

11/20/2024, 10:31 PM

Which rate limit are you referring to?

brief-window-55364

11/20/2024, 10:41 PM

https://github.com/flyteorg/flyte/pull/5228/files

brief-window-55364

11/20/2024, 10:41 PM

This one

clean-glass-36808

11/20/2024, 10:46 PM

I don't think that would be an issue for us. We've configured ours when we did initial scale testing at ~25k concurrent tasks

clean-glass-36808

11/20/2024, 10:46 PM

There are also barely any logs that indicate that propeller node execution is even running in our case

brief-window-55364

11/20/2024, 10:53 PM

Did you do it before or after this change? This was "somewhat recent". What was happening in our case was that our workflow executions where failing too fast (we use a lot of subworkflows) and sometimes the executions in admin would never leave the

Aborting

state in admin. This

Aborting

state item would stay in the workqueue until there was no more space in the queue so we would observe exactly what you're saying, no executions could run. Queue was full, many calls to flyteadmin trying to update that queue item from

Aborting

to abort but it could never update.

brief-window-55364

11/20/2024, 10:53 PM

I can explain better tomorrow gotta drop.

clean-glass-36808

11/20/2024, 10:54 PM

Looking at the logs there was nothing related to aborting in flytepropeller until we tried to recover the system

freezing-airport-6809

11/21/2024, 2:40 AM

@clean-glass-36808 we have a solution for this in union today. We are running more than 100k workflows concurrently

clean-glass-36808

11/21/2024, 2:51 AM

@freezing-airport-6809 Not sure what you mean. I think there is some sort of bug here. The concurrent workflows for this case was < 2000. We have scale tested 25k concurrent workflows before without issue

freezing-airport-6809

11/21/2024, 2:53 AM

No what I mean is we have a few things at union that help us scale beyond 100k, we can review and let you know

clean-glass-36808

11/21/2024, 2:54 AM

I'll have to look into the source code but I don't see how we have 98 million logs about enqueuing workflows but the metrics indicate unprocessed queue depth is 0, the workers are doing seemingly nothing, and acceptance latency grows forever

freezing-airport-6809

11/21/2024, 2:54 AM

I think it’s a bug in kubeclient libs

freezing-airport-6809

11/21/2024, 4:48 AM

Ohh wait - maybe you just have undeleted workflows in your cluster

freezing-airport-6809

11/21/2024, 4:48 AM

Can you check kubectl get fly

freezing-airport-6809

11/21/2024, 4:48 AM

The gc may not be working

freezing-airport-6809

11/21/2024, 4:48 AM

That’s why 98million

freezing-airport-6809

11/21/2024, 4:48 AM

You will have to manually clean things up by increasing timeout

clean-glass-36808

11/21/2024, 4:49 AM

We rebooted propeller and that recovers things. I would have to reproduce the issue again

brief-window-55364

11/21/2024, 8:26 AM

Yeah if it was a bunch of undeleted workflows, rebooting wouldn't help. I'm still betting some problem with the cache, it's exactly what we've seen on our end and rebooting the propellers makes it run. It's easy to prove @clean-glass-36808 by upgrading Flyte 🙂

glamorous-carpet-83516

11/21/2024, 6:17 PM

cc @white-painting-22485

clean-glass-36808

11/21/2024, 6:21 PM

I can definitely upgrade Flyte. Its a little more involved since we run a fork and I've got a lot going on at work 🙂

clean-glass-36808

11/21/2024, 6:28 PM

I was generally curious if anyone knew what could cause this since I don't have a mental model for how propeller churns through nodes yet. I'm also not sure if I'm mis-interpreting the metrics.

freezing-airport-6809

11/21/2024, 6:28 PM

Hmm it will recover for a bit? If you have too many objects in etcd

clean-glass-36808

11/21/2024, 6:30 PM

So we rebooted propeller once and it recovered most of the executions but got into a weird state with a subset of executions. I had specific logs of an execution that was kicked off and then never synced (but tons of

Enqueueing workflow

logs) until we rebooted a final second time.

clean-glass-36808

11/21/2024, 6:36 PM

So if every execution node is a CRD that seems to imply some issue with kube client like you said or an issue with the controller that is receiving webhook calls? Presumably

enqueueing workflow

writes a CRD but if it doesn't show up in the queue its not being picked up correctly

clean-glass-36808

11/21/2024, 6:39 PM

FWIW, etcd write latencies did increase but I'm not sure how bad these values are. The time frame between the two humps is when things were pretty broken. Second hump is when I rebooted propeller.

white-painting-22485

11/28/2024, 4:32 PM

@clean-glass-36808 it sounds like you may be running into this issue I just fixed: https://github.com/flyteorg/flyte/commit/ed87fa17b356a7ddac3c2b7180c7e800fbf8ad90 Whenever there are a lot of outstanding executions, propeller will repeatedly add them to the workqueue and end up pushing out the rate limiter far into the future

clean-glass-36808

11/28/2024, 5:20 PM

That seems to add up with what we’ve been seeing. Thanks so much!

27 Views

Open in Slack

Previous Next