We are using flyte helm chart `v1.10.0` and are se...
# ask-the-community
b
We are using flyte helm chart
v1.10.0
and are seeing flyteadmin CPU hover around 99% consistently. We've increased the CPU multiple times, each time resulting in the same behavior. Currently, we have the CPU set to
3
, but what's strange is that we were running before(
v1.9.0
) on less
500m
. Does anyone have any thoughts as to what could be causing this? Should we look into anything in particular to performance tune? Is there a recommended resources config someone could point me to?
Another thing of note is that we can autoscale the admin out to > 1 replica, but only one is ever doing any work. That can be seen via
kubectl top
Copy code
NAME                                      CPU(cores)   MEMORY(bytes)
flyteadmin-c75645575-qprvl                3229m        421Mi
flyteadmin-c75645575-xcq8b                2m           67Mi
k
That is odd, there has to be something else. Cc @Eduardo Apolinario (eapolinario)
b
I’ll get the pprof output tomorrow
k
It should Barely use cpu
e
@Blake Jackson, this is not expected. Can you also collect the logs for this instance of flyteadmin that's doing work? The fact it's only one instance doing work might indicate that it's busy running a database migration, which in itself is a bit odd since we haven't shipped any migration in the latest version.
b
I should not have worded it as the other instance wasn't doing ANY work. It seemed to be running as expected actually. We still saw things in the log, but it was much more that the slow instance continuously showed
SLOW SQL >= 200ms
warning and the logs were much more populated because of that. During this same time period, the DB performed nominally, showing hardly any load and the top waits were minimal on
SELECT * FROM "tags" WHERE ("tags"."artifact_id","tags"."dataset_uuid") IN (($1,$2))
The strangest thing about this whole thing is that we recreated pods multiple times and still every time the new pod used 100% CPU. It gets stranger because as I was profiling, all the CPU finally dropped (see image). I do have one cpu profile from before and one from this morning that I can share, but unfortunately, nothing stands out to me. I'm attaching the flame graphs as images as well.
k
Cc @Yee
y
would you happen to have the queries that consistently showed >200ms?
b
Yea, let me grab a couple
Is it enough to just give you
UPDATE "executions" ...
or do you need to see the entire SQL statement? If the former, I can grab a few more. If the latter, I need to confirm there's nothing in there I can't share
Copy code
SELECT * FROM "task_executions" WHERE "task_executions"."project" = ...

UPDATE "node_executions" SET ...

SELECT * FROM "executions" WHERE "executions"."execution_project"...
It really doesn't look like it's one query that stands out
and it seems like a side effect of the app being slow
I did query planning on these as well, and nothing came back with anything unexpected
Doing more digging, we do see that the high CPU was when calls to
{grpc_method="GetExecutionData", grpc_service="flyteidl.service.AdminService"}
and
{grpc_method="GetExecution", grpc_service="flyteidl.service.AdminService"}
were higher