Hi Team. I observed a behaviour of flyte-binary po...
# flyte-support
c
Hi Team. I observed a behaviour of flyte-binary pod. When the system was under heavy load. the memory consumption of the pod kept on increasing from 500 MB to 2 GB , after which it got OOM Killed. I am not sure if this is due to internal caching / dataCatalog or if it loads all Workflows in memory. If all workflows are kept in memory, is there a feature to delete the completed workflows?
f
@clever-exabyte-82294 you can control cache behavior- please configure accordingly
It will load things on memory to make things go faster
Why not give it 4 CPUs and 4+gb ram. The default is like a raspberry pi ( @average-finland-92144 this has affected a lot of people maybe we should make it higher, we currently have it at 0.1 core)
Also the oil killer does not only kill if it goes higher there are many other reasons
c
I can increase to 4GB, but then what if that is also not sufficient. we are expecting around 10,000 pipeline triggers per day. Will look into cache control and if it controls memory scaling.
a
@clever-exabyte-82294 are you setting
cache.max-size_mbs
or similar parameters? This page should give you some hints on where to look and what to tweak. For such a load, monitoring will be very important (check out the Grafana dashboards) I'm also proposing more generous defaults for flyte-binary (#5602) as Ketan said
❤️ 2
c
yes. Grafana Dashboards is a great idea. Created The PodMonitor to scrape the port 10254 for metrics cache values were :
Copy code
storage:
      cache:
        max_size_mbs: 10
        target_gc_percent: 100
not sure about the Maths which makes it a GB.
Something is getting cached or there is a memory leak. I have set cache to 0. with current settings Memory increases to 8 GB and then OOM kill, and restarts from 500 MB. gladly , no workload has failed.
image.png
@glamorous-rainbow-77959
a
@clever-exabyte-82294 what version of flyte are you running?
c
v1.13.0 @average-finland-92144
f
There are a lot of caches all over the place in Flyte
it is definitely getting cached.
Can you tell me, why you want to restrict it to 0.1 cpu - 0.5 cpu? You are running a prod service - what is the problem of running with more cpu / memory? anything blocking you folks? Is this some restricted environment?
c
Copy code
deployment:
  resources:
    requests:
      cpu: "4"
      memory: "4G"
    limits:
      cpu: "8"
      memory: "10G"
current settings
https://github.com/flyteorg/flyte/issues/5606 I see there is a issue with similar sympton. however I haven't been able to confirm if this is same.
a
@clever-exabyte-82294 that is also possible. We could confirm if you downgrade to 1.11
h
@clever-exabyte-82294, please take a look at https://github.com/flyteorg/flyte/issues/3991#issuecomment-2317974771.
🙌 1
🙌🏽 1
👍 1
c
nice work man. I see that #3991 got reported 1 year ago.. Quite a story
🙇 1
f
great
c
an update on this problem. Under our production workload. It is not crashing every few hours. However it is still crashing every 2 days.
Memory here is around 8 GB.
@high-accountant-32689 added an observation above.. might be useful.
a
@clever-exabyte-82294 do you have metrics on resource usage from the Pod? how many executions?
c
I can get that.
image.png
Let me know if any other metric is needed.
a
with the amount of executions and what you expect (around 10k triggers a day) I'd say you should consider moving to
flyte-core
as there are additional mechanisms available to scale out
c
ok thanks.. We have a temporary solution. and We will start build a POC of
Flyte-core
h
@clever-exabyte-82294, Flyte 1.13.2 is out and that contains a fix for this memory issue you're seeing in
flyte-binary
.
g
@clever-exabyte-82294 could we plan an update on Monday?
c
ohh my bad. I thought the fix got release in 1.13.1