Hi flyte folks, was there any significant performa...
# ask-the-community
j
Hi flyte folks, was there any significant performance change in Flyte console? we recently swtiched to flyte 1.2 milestone version and its correspodning flyte backend. Unfortunately now its taking a lot of time display the workflow tasks as opposed to what previously and from the browser log `
Copy code
vendor-a9fbc36b.js:2          GET https://.../fizlxcwy-n4-0-dn1-0/n1?limit=10000 net::ERR_INSUFFICIENT_RESOURCES
it might have been showing previously as well but never seem to be noticable. it takes solid min or 2 to even show what tasks were executed. previous version was flyteconsole
1.1.6
, propeller
1.1.15
and admin on
1.1.29
any idea on what have changed
j
Hey @Jay Ganbat - could you try pulling the latest from flyteconsole? https://github.com/flyteorg/flyteconsole/releases/tag/v1.4.0 We were in the process of a refactor on those views and
v.1.4.0
includes the full refactor and should load faster than the release version. I should note he refactor does add a slight increase in the initial load of the node list as we combined all the requests into one; the benefit/purpose of this refactor is that now the other views (graph, timeline) are much faster as they no longer require separate requests.
j
hmm i see, yeah its hard to test the latest version on our shared cluster and dev cluster doesnt have the above issue ๐Ÿ˜… but i will try to test it out
I think it also might be flyteadmin, I am seeing this logs
Copy code
2022/11/07 17:36:05 /go/src/github.com/flyteorg/flyteadmin/pkg/repositories/gormimpl/task_execution_repo.go:121 SLOW SQL >= 200ms
[204.218ms] [rows:1] SELECT "task_executions"."id","task_executions"."created_at","task_executions"."updated_at","task_executions"."deleted_at","task_executions"."project","task_executions"."domain","task_executions"."name","task_executions"."version","task_executions"."execution_project","task_executions"."execution_domain","task_executions"."execution_name","task_executions"."node_id","task_executions"."retry_attempt","task_executions"."phase","task_executions"."phase_version","task_executions"."input_uri","task_executions"."closure","task_executions"."started_at","task_executions"."task_execution_created_at","task_executions"."task_execution_updated_at","task_executions"."duration" FROM "task_executions" LEFT JOIN tasks ON task_executions.project = tasks.project AND task_executions.domain = tasks.domain AND task_executions.name = tasks.name AND task_executions.version = tasks.version INNER JOIN node_executions ON task_executions.node_id = node_executions.node_id AND task_executions.execution_project = node_executions.execution_project AND task_executions.execution_domain = node_executions.execution_domain AND task_executions.execution_name = node_executions.execution_name INNER JOIN executions ON node_executions.execution_project = executions.execution_project AND node_executions.execution_domain = executions.execution_domain AND node_executions.execution_name = executions.execution_name WHERE executions.execution_project = 'balrog' AND executions.execution_domain = 'production' AND executions.execution_name = 'f54bcr4c4l3v5s' AND node_executions.node_id = 'n0' LIMIT 10000
maybe how the SQL query generated got changed in recent version
could be related to this issue https://github.com/flyteorg/flyte/issues/2812 Hi @katrina do you know when did the issue described started happening in flyteadmin
k
hey @Jay Ganbat I'm not sure, it's something I observed on our internal deployment - are you running into this now?
j
yeah we have started noticing significant slowdown in flyteconsole to show task executions. I looked through the flyteadmin and seeing bunch of these
SLOW SQL >= 200ms
logs. Previously we were on flyteadmin
1.1.29
and upgraded to
1.1.46
k
oof cc @Eduardo Apolinario (eapolinario) - I can try taking a look at this at some point but not sure if anyone on OSS has bandwidth
j
i see, do you recognize this version:
v1.1.29-hotfix@sha256:4de1d9e93cbb9d93a659b49bf51d7d9cc0ffe8fad8408d4cc16378930bb1de4b
we were using it before but couldnt figure out where it came from, its base was the
name: <http://cr.flyte.org/flyteorg/flyteadmin|cr.flyte.org/flyteorg/flyteadmin>
which is pointing to the flyteโ€™s image registry
k
not sure, https://github.com/flyteorg/flyteadmin/releases/tag/v1.1.29 has to do with execution abort logic and https://github.com/flyteorg/flyteadmin/releases/tag/v1.1.30 has a fix for Dockerfile vulnerabilities - maybe it was a security patch but doesn't appear to be related to any sql changes
hey @Jay Ganbat can you share which queries exactly are slow on your side? is it just the list task executions one you pasted above?
j
yeah pretty much all of the list execution or SELECT queries this one is for node execution
Copy code
2022/11/07 23:19:24 /go/src/github.com/flyteorg/flyteadmin/pkg/repositories/gormimpl/node_execution_repo.go:47 SLOW SQL >= 200ms
[587.680ms] [rows:1] SELECT * FROM "node_executions" WHERE "node_executions"."execution_project" = 'pineapple' AND "node_executions"."execution_domain" = 'development' AND "node_executions"."execution_name" = 'fpz5pyuqbaiuuo' AND "node_executions"."node_id" = 'n0-0-dn1-1-dn82' LIMIT 1
which is odd, this query looks relatively simple
e
this is interesting. Is this database instance shared in any way? Can you double-check that it's not under some sort of resource crunch at the time you're running those queries?
j
hmm there could be something wrong with the backend database ๐Ÿค” i just reverted to our old version and started to seeing the slowness ๐Ÿ˜ฌ ill let you know if i figure it out
e
Ok, let us know what you find out.
j
does Datacatalog has the DB stuff in it? imnot sure what DB flyteadmin uses in the backend
ok i think the issue is coming from flyteconsole. I downgraded flyteconsole to
1.1.6
and no longer have the issue. I think the newer flyteconsole tries to pull everything in one request so if the workflow has a lot of dynamic tasks it slows down considerably. i observed this before when trying to fetch execution using FlyteRemote and using sync operation.
e
Cool, thanks for your patience, @Jay Ganbat. Let me talk to the team about this.
j
thanks, i think it started with v1.2.6 i tested v1.2.5 and that doesnt have the problem. i think below PR might have introduced the issue, looking at the description of the PR https://github.com/flyteorg/flyteconsole/commit/9b10e5f58363e3ece29748677c65bfab7f7812ed
e
thank you, @Jay Ganbat. First of all, sorry for the bad experience. The team had a meeting this morning and we'll prioritize the lazy loading of the nodes going forward. The work to enable this is slated to happen early next week, I'll ping here once that lands.
j
whoohoo ๐ŸŽ‰ thank you for the quick turn around. btw was my guess correct? ๐Ÿ˜… in the meantime we could use 1.25 for now
e
yes, you're absolutely on point.
thank you for understanding. We'll get this fixed as soon as possible.
j
awesome, thank you very much ๐Ÿ™
btw would you recommend we use 1.25 or should we revert back to 1.1.6 version
e
Your pick. Looking at https://github.com/flyteorg/flyteconsole/compare/v1.1.6...v1.2.5 you will probably not see a huge difference. So how about you give 1.2.5 a spin?
j
cool, will do
128 Views