Andrew Epstein
12/05/2023, 5:37 PMSELECT * FROM "tags" WHERE ("tags"."artifact_id","tags"."dataset_uuid") IN (($1,$2))
is getting executed frequently (380,000 times in the last day) against the datacatalog
database. On average, it takes 2 seconds to complete and returns 0 rows. I believe this query is executed as part of the task cache, which I do have enabled (but I expect all cache misses). I am using v1.0.1 of Datacatalog, and v1.1.47 of FlytePropeller. This is a database with 32 vCPU and 64GB of memory, 200GB storage with ~50GB of that used. Please let me know if there are any other details I can provide that would be helpful. cc: @Anna CunninghamSamhita Alla
Ketan (kumare3)
Dan Rammer (hamersaw)
12/06/2023, 6:40 PMAndrew Epstein
12/06/2023, 7:01 PMAndrew Epstein
12/06/2023, 7:05 PMCREATE INDEX tags_dataset_uuid_artifact_id_idx ON tags (dataset_uuid, artifact_id);
), the query gets significantly faster (1000x in my local testing). The postgres planner seems to vary the way it actually executes the query. These screenshots are from a different DB instance under slightly lesser load, but I'm adding them to show the 3 different ways I've seen the planner choose to execute the query.Dan Rammer (hamersaw)
12/07/2023, 3:50 PMv2
of the datacatalog API, that simplifies this to a single RPC and DB queries which should significantly speed up cache operations. The hope is that we can ship this using a fallback mechanism to make the transition transparent to users - just all of the sudden caches are fast 😆.Dan Rammer (hamersaw)
12/07/2023, 3:52 PMAndrew Epstein
12/07/2023, 5:48 PMDan Rammer (hamersaw)
12/07/2023, 11:42 PMAndrew Epstein
12/12/2023, 5:52 PM