Hi, we’re running a bunch of large jobs each with ...
# ask-the-community
Hi, we’re running a bunch of large jobs each with thousands of tasks, and noticed our Postgres database was at 100% CPU for many hours. DB only has 4 vCPU and 15GB of memory at the moment. Saw many of these in the datacatalog log as well
Copy code
"textPayload": "2023/03/07 15:38:00 \u001b[32m/go/src/github.com/flyteorg/datacatalog/pkg/repositories/gormimpl/artifact.go:64 \u001b[33mSLOW SQL >= 200ms",
Do you have a recommended size for postgres for large production workloads? I’m also going through this doc to make sure we’re following best practices.
Also seeing many of
Copy code
2023-03-07 16:14:58.742 UTC [1644473]: [4-1] db=datacatalog,user=flyteadmin ERROR:  duplicate key value violates unique constraint "tags_pkey"
Copy code
2023-03-07 16:14:58.874 UTC [1644473]: [7-1] db=datacatalog,user=flyteadmin ERROR:  duplicate key value violates unique constraint "datasets_pkey"
In our postgres logs. Any idea why those are showing up?
@Nicholas LoFaso is this in datacatalog. we have not seen these issues
ya for larger workloads DB can run how - but 100% cpu for datacatalog seems odd
can you share more
happy to hop on a call to help
Yes that was a datacatalog error. The 100% CPU was for our managed postgres instance not the data manager itself
I’m on the east coast available for a call any time tomorrow
are you using same postgres for both datacatalog and admin?
yes same for both
We significantly increased the size of postgres and that resolved the current bottleneck. I will be digging into our Flyte performance over the next couple of weeks so will likely have additional questions but for now I plan to setup prometheus to gather the data plane metrics
but it still seems odd that postgres was hammered
we would love to understand your workflow pattern
We would love to share and improve our pattern / configuration. Maybe we can setup a call late next week after I’ve had a chance to gather some metrics?