Hello I m trying to get a better understanding of the flyte Flyte #flyte-deployment

Hello, I'm trying to get a better understanding of...

some-grass-84903

10/15/2024, 3:34 PM

Hello, I'm trying to get a better understanding of the flyte propeller. (I've read a bit into this doc already, but not in all details yet) I would like to better understand the following case: We have Flyte 1.9 (yes, updating soon) deployed in AWS EKS with helm flyte-core charts and control and data plane in separate clusters. On this particular data plane every 15 minutes a simple workflow is executed. What we've noticed is that the propeller container of a fresh cluster starts growing in memory usage over the first 24 hours and then starting to run into OOM and restarts on the long run. Sometimes it's also not able to recover from it's own and continuously runs immediately into OOM. I'm aware that I could just increase the Memory limits to like 500MB, but I feel like the growing would just continue as I'm missing another crucial part here. Any hints which feature of the propeller might cause this, that I should look more into? (Second image the dashed line shows the restarts of the container with axis on the right)

average-finland-92144

10/16/2024, 10:45 AM

@some-grass-84903 what type of workflow is running on that data plane? (I mean, map tasks, dynamic, etc) Could you get logs from the propeller pod?

some-grass-84903

10/16/2024, 11:24 AM

Workflow with plain tasks only resulting in 21 nodes. Highest resource limits are 4 CPUs and 4 Gi for one task. No cache. For the logs I need a bit time to filter out some internal stuff 😅

some-grass-84903

10/16/2024, 11:33 AM

With all the internal stuff plainly filtered out this one looks pretty boring tbh let me check if I can get more insights, also around the container that I can share

extract-2024-10-16T11_21_02.562Z.csv

some-grass-84903

10/16/2024, 11:40 AM

Those should be all the logs grouped into pattern of one container

extract-2024-10-16T11_35_31.811Z.csv

some-grass-84903

10/16/2024, 11:49 AM

Describe of the propeller pod, main hint is the

Last State

bit with

OOMKilled

that I'm originating from

flytepropeller_pod_info.yml

some-grass-84903

10/16/2024, 1:13 PM

Interesting 🤔 I just set the replicaset of propeller to 2 and with the same memory both spawned pods run into OOM and crashloopbackoffs. Obviously I will go with memory increment next. On second thought: I guess they just have to do an initial fetch of past executions or something and therefore both crash. (Sidenote: With that I also updated that data plane from 1.9 to 1.12)

some-grass-84903

10/16/2024, 2:10 PM

image.png

some-grass-84903

10/17/2024, 8:29 AM

Another small addition: We are running it now with a replicaset of 2 and 400Mi memory limits and the pod where the memory goes down rapidly ran into OOM.

some-grass-84903

10/17/2024, 8:35 AM

extract-2024-10-17T08_32_15.812Z.csv,flytepropeller_pod_info.yml

flytepropeller_pod_info.yml extract-2024-10-17T08_32_15.812Z.csv

average-finland-92144

10/17/2024, 8:14 PM

@some-grass-84903 what happens with flyteadmin when propeller crashes? does it work ok? From the error it looks like at some point propeller isn't able to post events to flyteadmin via the EventSink Using the Grafana dashboard one could observe if there's any pattern that leads to the OOM (this bug for example was isolated using that dashboard)

average-finland-92144

10/17/2024, 8:16 PM

There are some settings available for the EventSink but trying to refrain from changing things arbitrarily

some-grass-84903

10/18/2024, 5:51 AM

Nothing special in flyteadmin I would say. We currently don't have grafana setup and are only using datadog. I will have to check that out a bit later unfortunately. I will also double check the bug later.

extract-2024-10-18T05_47_57.186Z.csv

some-grass-84903

10/18/2024, 5:52 AM

@average-finland-92144 but based on what you've seen so far, this is not an expected behavior?

average-finland-92144

10/22/2024, 12:02 AM

definitely it's not expected behavior

48 Views

Open in Slack

Previous Next