I am seeing a performance bottleneck with the Flyt...
# ask-the-community
I am seeing a performance bottleneck with the Flyte database. With my workload, the query
SELECT * FROM "tags" WHERE ("tags"."artifact_id","tags"."dataset_uuid") IN (($1,$2))
is getting executed frequently (380,000 times in the last day) against the
database. On average, it takes 2 seconds to complete and returns 0 rows. I believe this query is executed as part of the task cache, which I do have enabled (but I expect all cache misses). I am using v1.0.1 of Datacatalog, and v1.1.47 of FlytePropeller. This is a database with 32 vCPU and 64GB of memory, 200GB storage with ~50GB of that used. Please let me know if there are any other details I can provide that would be helpful. cc: @Anna Cunningham
cc @Dan Rammer (hamersaw)
Ideally I want to simplify datacatalog, I think we can make it a key value lookup. Problem we will have a period of cache misses
@Andrew Epstein you are correct, this query is to determine if a task execution have been cached. Can you elaborate a little on your workload? If you have executed 380k cached tasks, this is expected. The 2s runtime sounds quite slow, but I just want to better understand if this is a bug or performance issue.
Imo it's not a bug per se, just a performance issue. The workload I'm executing does have ~380k tasks, so the number of queries makes sense.
If I add an index (
CREATE INDEX tags_dataset_uuid_artifact_id_idx ON tags (dataset_uuid, artifact_id);
), the query gets significantly faster (1000x in my local testing). The postgres planner seems to vary the way it actually executes the query. These screenshots are from a different DB instance under slightly lesser load, but I'm adding them to show the 3 different ways I've seen the planner choose to execute the query.
@Andrew Epstein this is very interesting and thanks for diving here! To elaborate on what Ketan hinted at ^^. The datacatalog microservice was originally engineered to support much more robust data tracing use-cases, hence the differentiation of dataset, artifact, and tag. When used for a simple cache lookup this means there are mutliple RPCs that each require DB queries. Overall the system is much slower than it needs to be. In Q1 2024 we're planning on investing on a simplification here, a
of the datacatalog API, that simplifies this to a single RPC and DB queries which should significantly speed up cache operations. The hope is that we can ship this using a fallback mechanism to make the transition transparent to users - just all of the sudden caches are fast 😆.
That's all a very long discussion to say "this sounds very reasonable, i hope it won't be an issue in the near future" 😄. That being said, would you mind creating a github issue on this ticket to gather a little bit of input? I think it makes sense but DB schema updates (yes, even indexes 😅) can be contentious based on size, etc.
For the short term, can you think of any settings or any configuration changes we can make that might help here? Or are we kind of just stuck with it? (but yes, I can open a github issue)
Unfortunately, I think the new DB index may be your best mitigation right now. As I mentioned a more performant implementation is certainly on the roadmap in the coming months.