Hi, we’re running a bunch of large jobs each with ...
# ask-the-community
n
Hi, we’re running a bunch of large jobs each with thousands of tasks, and noticed our Postgres database was at 100% CPU for many hours. DB only has 4 vCPU and 15GB of memory at the moment. Saw many of these in the datacatalog log as well
Copy code
"textPayload": "2023/03/07 15:38:00 \u001b[32m/go/src/github.com/flyteorg/datacatalog/pkg/repositories/gormimpl/artifact.go:64 \u001b[33mSLOW SQL >= 200ms",
Do you have a recommended size for postgres for large production workloads? I’m also going through this doc to make sure we’re following best practices.
Also seeing many of
Copy code
2023-03-07 16:14:58.742 UTC [1644473]: [4-1] db=datacatalog,user=flyteadmin ERROR:  duplicate key value violates unique constraint "tags_pkey"
And
Copy code
2023-03-07 16:14:58.874 UTC [1644473]: [7-1] db=datacatalog,user=flyteadmin ERROR:  duplicate key value violates unique constraint "datasets_pkey"
In our postgres logs. Any idea why those are showing up?
k
@Nicholas LoFaso is this in datacatalog. we have not seen these issues
ya for larger workloads DB can run how - but 100% cpu for datacatalog seems odd
can you share more
happy to hop on a call to help
n
Yes that was a datacatalog error. The 100% CPU was for our managed postgres instance not the data manager itself
I’m on the east coast available for a call any time tomorrow
k
are you using same postgres for both datacatalog and admin?
n
yes same for both
We significantly increased the size of postgres and that resolved the current bottleneck. I will be digging into our Flyte performance over the next couple of weeks so will likely have additional questions but for now I plan to setup prometheus to gather the data plane metrics
k
yup
but it still seems odd that postgres was hammered
we would love to understand your workflow pattern
n
We would love to share and improve our pattern / configuration. Maybe we can setup a call late next week after I’ve had a chance to gather some metrics?
152 Views