Hello, I'm trying to get a better understanding of...
# flyte-deployment
s
Hello, I'm trying to get a better understanding of the flyte propeller. (I've read a bit into this doc already, but not in all details yet) I would like to better understand the following case: We have Flyte 1.9 (yes, updating soon) deployed in AWS EKS with helm flyte-core charts and control and data plane in separate clusters. On this particular data plane every 15 minutes a simple workflow is executed. What we've noticed is that the propeller container of a fresh cluster starts growing in memory usage over the first 24 hours and then starting to run into OOM and restarts on the long run. Sometimes it's also not able to recover from it's own and continuously runs immediately into OOM. I'm aware that I could just increase the Memory limits to like 500MB, but I feel like the growing would just continue as I'm missing another crucial part here. Any hints which feature of the propeller might cause this, that I should look more into? (Second image the dashed line shows the restarts of the container with axis on the right)
a
@some-grass-84903 what type of workflow is running on that data plane? (I mean, map tasks, dynamic, etc) Could you get logs from the propeller pod?
s
Workflow with plain tasks only resulting in 21 nodes. Highest resource limits are 4 CPUs and 4 Gi for one task. No cache. For the logs I need a bit time to filter out some internal stuff 😅
With all the internal stuff plainly filtered out this one looks pretty boring tbh let me check if I can get more insights, also around the container that I can share
Those should be all the logs grouped into pattern of one container
Describe of the propeller pod, main hint is the
Last State
bit with
OOMKilled
that I'm originating from
Interesting 🤔 I just set the replicaset of propeller to 2 and with the same memory both spawned pods run into OOM and crashloopbackoffs. Obviously I will go with memory increment next. On second thought: I guess they just have to do an initial fetch of past executions or something and therefore both crash. (Sidenote: With that I also updated that data plane from 1.9 to 1.12)
image.png
Another small addition: We are running it now with a replicaset of 2 and 400Mi memory limits and the pod where the memory goes down rapidly ran into OOM.
extract-2024-10-17T08_32_15.812Z.csv,flytepropeller_pod_info.yml
a
@some-grass-84903 what happens with flyteadmin when propeller crashes? does it work ok? From the error it looks like at some point propeller isn't able to post events to flyteadmin via the EventSink Using the Grafana dashboard one could observe if there's any pattern that leads to the OOM (this bug for example was isolated using that dashboard)
There are some settings available for the EventSink but trying to refrain from changing things arbitrarily
s
Nothing special in flyteadmin I would say. We currently don't have grafana setup and are only using datadog. I will have to check that out a bit later unfortunately. I will also double check the bug later.
@average-finland-92144 but based on what you've seen so far, this is not an expected behavior?
a
definitely it's not expected behavior