Hi Team I observed a behaviour of flyte binary pod When the Flyte #flyte-support

Hi Team. I observed a behaviour of flyte-binary po...

clever-exabyte-82294

07/29/2024, 1:18 PM

Hi Team. I observed a behaviour of flyte-binary pod. When the system was under heavy load. the memory consumption of the pod kept on increasing from 500 MB to 2 GB , after which it got OOM Killed. I am not sure if this is due to internal caching / dataCatalog or if it loads all Workflows in memory. If all workflows are kept in memory, is there a feature to delete the completed workflows?

freezing-airport-6809

07/29/2024, 2:20 PM

@clever-exabyte-82294 you can control cache behavior- please configure accordingly

freezing-airport-6809

07/29/2024, 2:20 PM

It will load things on memory to make things go faster

freezing-airport-6809

07/29/2024, 2:22 PM

Why not give it 4 CPUs and 4+gb ram. The default is like a raspberry pi ( @average-finland-92144 this has affected a lot of people maybe we should make it higher, we currently have it at 0.1 core)

freezing-airport-6809

07/29/2024, 2:22 PM

Also the oil killer does not only kill if it goes higher there are many other reasons

clever-exabyte-82294

07/29/2024, 3:01 PM

I can increase to 4GB, but then what if that is also not sufficient. we are expecting around 10,000 pipeline triggers per day. Will look into cache control and if it controls memory scaling.

average-finland-92144

07/29/2024, 4:59 PM

@clever-exabyte-82294 are you setting

cache.max-size_mbs

or similar parameters? This page should give you some hints on where to look and what to tweak. For such a load, monitoring will be very important (check out the Grafana dashboards) I'm also proposing more generous defaults for flyte-binary (#5602) as Ketan said

❤️ 2

clever-exabyte-82294

07/30/2024, 11:50 AM

yes. Grafana Dashboards is a great idea. Created The PodMonitor to scrape the port 10254 for metrics cache values were :

Copy code

storage:
      cache:
        max_size_mbs: 10
        target_gc_percent: 100

not sure about the Maths which makes it a GB.

clever-exabyte-82294

08/02/2024, 2:52 AM

Something is getting cached or there is a memory leak. I have set cache to 0. with current settings Memory increases to 8 GB and then OOM kill, and restarts from 500 MB. gladly , no workload has failed.

clever-exabyte-82294

08/02/2024, 3:15 AM

image.png

clever-exabyte-82294

08/02/2024, 3:15 AM

@glamorous-rainbow-77959

average-finland-92144

08/02/2024, 4:20 PM

@clever-exabyte-82294 what version of flyte are you running?

clever-exabyte-82294

08/03/2024, 8:56 AM

v1.13.0 @average-finland-92144

freezing-airport-6809

08/03/2024, 2:30 PM

There are a lot of caches all over the place in Flyte

freezing-airport-6809

08/03/2024, 2:30 PM

it is definitely getting cached.

freezing-airport-6809

08/03/2024, 2:31 PM

Can you tell me, why you want to restrict it to 0.1 cpu - 0.5 cpu? You are running a prod service - what is the problem of running with more cpu / memory? anything blocking you folks? Is this some restricted environment?

clever-exabyte-82294

08/03/2024, 3:32 PM

Copy code

deployment:
  resources:
    requests:
      cpu: "4"
      memory: "4G"
    limits:
      cpu: "8"
      memory: "10G"

current settings

clever-exabyte-82294

08/03/2024, 3:37 PM

https://github.com/flyteorg/flyte/issues/5606 I see there is a issue with similar sympton. however I haven't been able to confirm if this is same.

average-finland-92144

08/05/2024, 10:14 AM

@clever-exabyte-82294 that is also possible. We could confirm if you downgrade to 1.11

high-accountant-32689

08/29/2024, 3:29 PM

@clever-exabyte-82294, please take a look at https://github.com/flyteorg/flyte/issues/3991#issuecomment-2317974771.

🙌 1

🙌🏽 1

👍 1

clever-exabyte-82294

08/29/2024, 3:56 PM

nice work man. I see that #3991 got reported 1 year ago.. Quite a story

🙇 1

freezing-airport-6809

09/05/2024, 4:06 AM

great

clever-exabyte-82294

09/30/2024, 1:42 PM

an update on this problem. Under our production workload. It is not crashing every few hours. However it is still crashing every 2 days.

clever-exabyte-82294

09/30/2024, 1:43 PM

Memory here is around 8 GB.

clever-exabyte-82294

09/30/2024, 1:44 PM

@high-accountant-32689 added an observation above.. might be useful.

average-finland-92144

09/30/2024, 3:04 PM

@clever-exabyte-82294 do you have metrics on resource usage from the Pod? how many executions?

clever-exabyte-82294

09/30/2024, 3:05 PM

I can get that.

clever-exabyte-82294

09/30/2024, 3:16 PM

image.png

clever-exabyte-82294

09/30/2024, 3:21 PM

Let me know if any other metric is needed.

average-finland-92144

09/30/2024, 3:26 PM

with the amount of executions and what you expect (around 10k triggers a day) I'd say you should consider moving to

flyte-core

as there are additional mechanisms available to scale out

clever-exabyte-82294

09/30/2024, 3:27 PM

ok thanks.. We have a temporary solution. and We will start build a POC of

Flyte-core

high-accountant-32689

10/04/2024, 1:25 PM

@clever-exabyte-82294, Flyte 1.13.2 is out and that contains a fix for this memory issue you're seeing in

flyte-binary

glamorous-rainbow-77959

10/04/2024, 1:26 PM

@clever-exabyte-82294 could we plan an update on Monday?

clever-exabyte-82294

10/05/2024, 5:12 AM

ohh my bad. I thought the fix got release in 1.13.1

10 Views

Open in Slack

Previous Next