Hello. Another question from me :slightly_smiling_...
# ask-the-community
Hello. Another question from me ­čÖé. I think we've solved some of the problems with our crazy fanout workflows and so far they are not failing or getting totally stuck anymore. However its still not running smoothly. It seems like flytepropeller gets slower as the workflows progress. We've noticed multiple hours between pods completing and the status updating on flyte. Also new workflows submitted at this time often remain in unknown state for hours. Looking at the flytepropeller dashboard there are some big spikes in latencies to a few seconds but nothing that looks like it could be causing delays of hours. Interestingly restarting flytepropeller seemed to help a little bit. I'm quite suspicious of something going wrong with how flytepropller is trying to watch pod and/or flyteworkflows statuses.
If we refer to the different latencies described in https://docs.flyte.org/en/latest/concepts/execution_timeline.html#divedeep-execution-timeline I have seen clear examples where Acceptance Latency, Transition Latency and Completion Latency are very slow of the order hours. In the case of slow acceptance latency I can see that the workflow CRD has been created but I think flytepropeller is still yet to run the first round. The CRD has
acceptedAt: "2023-11-21T15:54:14Z"
was not started until
startedAt: "2023-11-21T18:10:00Z"
. In the case of transition latency the state had not updated in flyteadmin or the CRD so I think it must all be k8s observation latency.
cc @Dan Rammer (hamersaw), i'd like to keep this on your radar. ptal after you're back!
I got some metrics which make the problem quite clear
I think I might have worked it out - at least partially. I think its because of a bug in garbage collection on flytepropeller 1.10.4. It seems like this release changes the garbage collector to use a ticker instead of a timer, so garbage collection runs only once each time flytepropeller restarts. That means we have 90,000 old flyteworkflows sitting around and that seems to cause major problems for watching the flyteworkflow CRs.
Copy code
$ kubectl get flyteworkflows -o yaml --watch --namespace flyteexamples-development > watch_crds.yaml
Error from server (Expired): The provided continue parameter is too old to display a consistent list result. You can start a new list without the continue parameter, or use the continue token in this response to retrieve the remainder of the results. Continuing with the provided token results in an inconsistent list - objects that were created, modified, or deleted between the time the first chunk was returned and now may show up in the list.
I get the feeling I should not be using flytepropeller 1.10.4. I think I'll go back to 1.10.0.
I made a comment about this on the PR that made the change https://github.com/flyteorg/flyte/pull/4199#discussion_r1406729862
thank you, @Thomas Newton. Looking at this right now.
@Thomas Newton this would explain why the etcd latencies are extremely large as well. All the times you mentioned (ie. acceptance, transition, completion) indicate that propeller could make progress and it's not. The garbage collection issue you found (btw AMAZING!) is likely the culprit, but I would be interested in seeing some of the other metrics. The two that come to mind are the "free workers" and "round latency", it seems that free workers should be 0 and round latencies should be extremely large?!?!
My prometheus deployment is OOMing (again) so I can't get screenshots right now but there were definitely plenty of occasions where this was happening and there were 100s of free workers. Round latencies where mostly ~100ms occasionally spiking to ~10 seconds.
Ahhh OK. This makes sense to me. Thanks!
Also, I have it on my TODO list to look at the node collapse and remove etcd errors PRs you linked elsewhere, thanks so much!
It seems like fixing garbage collection was not the whole story. I think we were also hitting the rate limits on enqueueing workflows. Changing this line from
seems to solve the acceptance latency problems. Probably we will need to increase the rate limit quite significantly. Please could someone explain the motivation for the rate limits on adding to these queues. I don't really understand why we would want to delay stuff being added to the queue. As I understand it all the same things will get added to the queue eventually, so to me it seems more logical to just add everything to the queue immediately and let the workers run through it a fast as they are able to.
Do you know what your queue configuration is? Should be in propeller config something like this. We recommend relaxing these significantly- and I know we made a change to update the defaults that I now have to double check on. The motivation behind relaxing these is to mitigate throttling of other operations. Workflows are added to the queue in many scenarios, for example every N seconds or everytime a Pod is updated pertaining to the workflow. If we start, for example, 1k pods then we can be evaluating the workflow every 10ms or so, which obviously is not ideal. Depending on the task plugin type we can be hitting external APIs, or sending Flyte events, or k8s apiserver calls everytime the workflow is evaluation. In this scenario, it obviously makes sense to only evaluate the workflow at a less frequent interval.
Updating the queue configuration in propeller can give you very fine control here.
We increased the number of workers to 800 (200X increase compared to the default) so probably we also should have applied a similar adjustment factor to the rate limit.
Yeah, looks like I closed this PR before merging because we changed the deployment model ... and then we changed it right back. I'll revisit this to get it cleaned up.
We increased the number of workers to 800
­čÖĆ Each worker is just a go routine, so this should be simple to handle. CPU utilization is still quite low on the propeller Pod right?
perfect, workers are just go routines as you probably know. so should be able to scale pretty large. also each worker processes a single workflow at a time, so increasing is only useful if the number of workflows is very large.
In my current test which fully disables the rate limiting I see the number of active workers is high much more often.
I can't find much in the way of docs about
. All I can see is
// AddRateLimited adds an item to the workqueue after the rate limiter says it's ok
This seems to suggest that it will just put stuff into a different queue that feeds into the main workqueue as rate limiting allows. If my interpretation is correct then I don't understand why this rate limiting helps. Its just going to lead to ever increasing number of workflows that are waiting to be re-enqueued.
I would be interested to see if end to end workflow evaluations are any faster. Even though active workers is much higher, in our benchmarks they're usually not doing anything useful. At least there are certainly diminishing returns that just chip away at other throttles.
But will
actually drop workflows when the rate limit is hit? My interpretation of the docs is that it won't. That means the same number of workflow updates need to be evaluated regardless. With the rate limiting they will just be spread out over a longer period.
But will
actually drop workflows when the rate limit is hit?
It's been a long time since I've been down the rabbit hole of our queue system. I would have to look.
That means the same number of workflow updates need to be evaluated regardless.
I don't think this is necessarily true. If a workflow is re-enqueued every 10ms if there is no rate limit it will be evaluated every 10ms (as long as there are free workers). However, if it is rate-limited I believe only a single instance can be in the ether. So if there is rate limiting of the 2nd and 3rd re-enqeuue. It will only be added to the queue once. Again, I would have to double-check but I think this is how it works the last time I checked.
Ah, that would make a lot of sense
I think I will increase rate limits on main and sub queues by a large factor (maybe 100X) and probably also increase
to about 10 minutes. That way we reduce unnecessary re-queues and rely more on updates propagating through the informer queues. The other rate limit I can think of is the k8s one. Given that flyte uses the watch API with informer queues for monitoring pod status I think faster rounds would not make too much difference to the rate of API calls to k8s.
Is there a way to have different queues for different workflows or projects? In case we have workflows with plugins hitting external APIs as you mentioned. Or do we need to find a config that works for all workflows, including high throughput use cases like this one?
One way to implement that would be to use sharded propellers for each namespace https://docs.flyte.org/en/latest/deployment/configuration/performance.html#manual-scale-out
Note on the sharded propeller that the docs use propeller manager to automatically handle scale-out based on a sharding configuration. But if you want more fine-grained control (ex. different configuration for each) you may need to manually manage shareded propeller instances using the configuration options. Basically the propeller manager is a lightweight k8s orchestrator (think deployment / replica set / etc) that injects combinations of these options to start mulitple propellers with restricted scopes,