We are using flyte helm chart `v1 10 0` and are seeing flyte Flyte #flyte-support

We are using flyte helm chart `v1.10.0` and are se...

alert-oil-1341

11/13/2023, 7:15 PM

We are using flyte helm chart

v1.10.0

and are seeing flyteadmin CPU hover around 99% consistently. We've increased the CPU multiple times, each time resulting in the same behavior. Currently, we have the CPU set to

, but what's strange is that we were running before(

v1.9.0

) on less

500m

. Does anyone have any thoughts as to what could be causing this? Should we look into anything in particular to performance tune? Is there a recommended resources config someone could point me to?

alert-oil-1341

11/13/2023, 8:40 PM

Another thing of note is that we can autoscale the admin out to > 1 replica, but only one is ever doing any work. That can be seen via

kubectl top

Copy code

NAME                                      CPU(cores)   MEMORY(bytes)
flyteadmin-c75645575-qprvl                3229m        421Mi
flyteadmin-c75645575-xcq8b                2m           67Mi

freezing-airport-6809

11/14/2023, 2:15 AM

That is odd, there has to be something else. Cc @high-accountant-32689

alert-oil-1341

11/14/2023, 2:17 AM

I’ll get the pprof output tomorrow

freezing-airport-6809

11/14/2023, 2:18 AM

It should Barely use cpu

high-accountant-32689

11/14/2023, 3:22 AM

@alert-oil-1341, this is not expected. Can you also collect the logs for this instance of flyteadmin that's doing work? The fact it's only one instance doing work might indicate that it's busy running a database migration, which in itself is a bit odd since we haven't shipped any migration in the latest version.

alert-oil-1341

11/14/2023, 2:23 PM

I should not have worded it as the other instance wasn't doing ANY work. It seemed to be running as expected actually. We still saw things in the log, but it was much more that the slow instance continuously showed

SLOW SQL >= 200ms

warning and the logs were much more populated because of that. During this same time period, the DB performed nominally, showing hardly any load and the top waits were minimal on

SELECT * FROM "tags" WHERE ("tags"."artifact_id","tags"."dataset_uuid") IN (($1,$2))

The strangest thing about this whole thing is that we recreated pods multiple times and still every time the new pod used 100% CPU. It gets stranger because as I was profiling, all the CPU finally dropped (see image). I do have one cpu profile from before and one from this morning that I can share, but unfortunately, nothing stands out to me. I'm attaching the flame graphs as images as well.

freezing-airport-6809

11/14/2023, 2:57 PM

Cc @thankful-minister-83577

thankful-minister-83577

11/14/2023, 3:07 PM

would you happen to have the queries that consistently showed >200ms?

alert-oil-1341

11/14/2023, 3:09 PM

Yea, let me grab a couple

alert-oil-1341

11/14/2023, 3:11 PM

Is it enough to just give you

UPDATE "executions" ...

or do you need to see the entire SQL statement? If the former, I can grab a few more. If the latter, I need to confirm there's nothing in there I can't share

alert-oil-1341

11/14/2023, 3:12 PM

Copy code

SELECT * FROM "task_executions" WHERE "task_executions"."project" = ...

UPDATE "node_executions" SET ...

SELECT * FROM "executions" WHERE "executions"."execution_project"...

alert-oil-1341

11/14/2023, 3:12 PM

It really doesn't look like it's one query that stands out

alert-oil-1341

11/14/2023, 3:13 PM

and it seems like a side effect of the app being slow

alert-oil-1341

11/14/2023, 3:13 PM

I did query planning on these as well, and nothing came back with anything unexpected

alert-oil-1341

11/14/2023, 5:30 PM

Doing more digging, we do see that the high CPU was when calls to

{grpc_method="GetExecutionData", grpc_service="flyteidl.service.AdminService"}

and

{grpc_method="GetExecution", grpc_service="flyteidl.service.AdminService"}

were higher

Open in Slack

Previous Next