calm-pilot-2010
11/21/2023, 6:11 PMcalm-pilot-2010
11/21/2023, 6:19 PMacceptedAt: "2023-11-21T15:54:14Z"
but n0
was not started until startedAt: "2023-11-21T18:10:00Z"
.
In the case of transition latency the state had not updated in flyteadmin or the CRD so I think it must all be k8s observation latency.tall-lock-23197
calm-pilot-2010
11/23/2023, 1:10 PMcalm-pilot-2010
11/27/2023, 8:50 PM$ kubectl get flyteworkflows -o yaml --watch --namespace flyteexamples-development > watch_crds.yaml
Error from server (Expired): The provided continue parameter is too old to display a consistent list result. You can start a new list without the continue parameter, or use the continue token in this response to retrieve the remainder of the results. Continuing with the provided token results in an inconsistent list - objects that were created, modified, or deleted between the time the first chunk was returned and now may show up in the list.
I get the feeling I should not be using flytepropeller 1.10.4. I think I'll go back to 1.10.0.calm-pilot-2010
11/27/2023, 8:56 PMhigh-accountant-32689
11/27/2023, 9:39 PMhallowed-mouse-14616
11/27/2023, 9:45 PMcalm-pilot-2010
11/27/2023, 9:53 PMhallowed-mouse-14616
11/27/2023, 9:54 PMhallowed-mouse-14616
11/27/2023, 9:55 PMcalm-pilot-2010
11/28/2023, 6:33 PMAddRateLimited
to Add
seems to solve the acceptance latency problems. Probably we will need to increase the rate limit quite significantly.
Please could someone explain the motivation for the rate limits on adding to these queues. I don't really understand why we would want to delay stuff being added to the queue. As I understand it all the same things will get added to the queue eventually, so to me it seems more logical to just add everything to the queue immediately and let the workers run through it a fast as they are able to.hallowed-mouse-14616
11/28/2023, 6:39 PMhallowed-mouse-14616
11/28/2023, 6:40 PMcalm-pilot-2010
11/28/2023, 6:42 PMmaster
https://github.com/flyteorg/flyte/blob/4ce0bf0b617f179b3147706f0f43f71687acab03/charts/flyte-core/values.yaml#L687C6-L700C24calm-pilot-2010
11/28/2023, 6:44 PMhallowed-mouse-14616
11/28/2023, 6:46 PMhallowed-mouse-14616
11/28/2023, 6:47 PMWe increased the number of workers to 800🙏 Each worker is just a go routine, so this should be simple to handle. CPU utilization is still quite low on the propeller Pod right?
calm-pilot-2010
11/28/2023, 6:48 PMhallowed-mouse-14616
11/28/2023, 6:49 PMcalm-pilot-2010
11/28/2023, 6:50 PMcalm-pilot-2010
11/28/2023, 6:50 PMAddRateLimited
. All I can see is // AddRateLimited adds an item to the workqueue after the rate limiter says it's ok
This seems to suggest that it will just put stuff into a different queue that feeds into the main workqueue as rate limiting allows. If my interpretation is correct then I don't understand why this rate limiting helps. Its just going to lead to ever increasing number of workflows that are waiting to be re-enqueued.hallowed-mouse-14616
11/28/2023, 6:52 PMcalm-pilot-2010
11/28/2023, 6:56 PMAddRateLimited
actually drop workflows when the rate limit is hit? My interpretation of the docs is that it won't. That means the same number of workflow updates need to be evaluated regardless. With the rate limiting they will just be spread out over a longer period.hallowed-mouse-14616
11/28/2023, 6:59 PMBut willIt's been a long time since I've been down the rabbit hole of our queue system. I would have to look.actually drop workflows when the rate limit is hit?AddRateLimited
That means the same number of workflow updates need to be evaluated regardless.I don't think this is necessarily true. If a workflow is re-enqueued every 10ms if there is no rate limit it will be evaluated every 10ms (as long as there are free workers). However, if it is rate-limited I believe only a single instance can be in the ether. So if there is rate limiting of the 2nd and 3rd re-enqeuue. It will only be added to the queue once. Again, I would have to double-check but I think this is how it works the last time I checked.
calm-pilot-2010
11/28/2023, 7:02 PMcalm-pilot-2010
11/28/2023, 7:25 PMworkflow-reeval-duration
and downstream-eval-duration
to about 10 minutes. That way we reduce unnecessary re-queues and rely more on updates propagating through the informer queues.
The other rate limit I can think of is the k8s one. Given that flyte uses the watch API with informer queues for monitoring pod status I think faster rounds would not make too much difference to the rate of API calls to k8s.flat-exabyte-79377
11/28/2023, 11:32 PMcalm-pilot-2010
11/28/2023, 11:34 PMhallowed-mouse-14616
11/29/2023, 2:02 PM