Hey community for people running the single flyte binary wha Flyte #flyte-support

Hey community, for people running the single flyte...

microscopic-furniture-57275

08/24/2023, 4:32 PM

Hey community, for people running the single flyte-binary, what is your experience with memory utilization? We're seeing our flyte-binary pod get killed about once a week due to ever-increasing memory utilization (we're running 1.8.1). As you can see on the attached graph, it is sometimes the case that there appears to be some kind of garbage-collection or similar and utilization drops dramatically before climbing again, but mostly we experience the pod getting evicted once a week as it gets beyond 5-6GB and runs out of memory on the 8GB node. Thoughts? Config options? Thanks! (the lines in the image are all the flyte-binary pod -- when you see one go up and not come down, it is killed, and the next line in a slightly different shade is a new flyte-binary pod getting scheduled to replace the evicted one)

freezing-airport-6809

08/24/2023, 4:57 PM

hmm seems like a memory leak

freezing-airport-6809

08/24/2023, 4:57 PM

cc @high-accountant-32689?

high-accountant-32689

08/24/2023, 6:22 PM

very interesting. Definitely has the hallmarks of a memory leak. Will open an issue to track this.

high-accountant-32689

08/24/2023, 6:41 PM

@microscopic-furniture-57275, can you give more details about the rest of your setup? Are you running on aws? Can you give a sense of workloads (# of executions / min, etc)? A quick look at one of our deployments doesn't show this, but I'm def. intrigued.

microscopic-furniture-57275

08/25/2023, 11:13 AM

Hey @high-accountant-32689, yes we are running on AWS. Our core services are provisioned with Karpenter and we use a nodeSelector to cause these services to land on smallish machines (8GB memory). We previously had no mem requests or limits for the flyte-binary, but have installed a request of ~5Gi per goldilocks suggestion. Our execution throughput is probably small in the flyte world - we execute 10s of workflows per day, each of which have a handful of python tasks. They typically are running signal-processing on lots of cores using considerable memory, and run for 10-30 minutes. Occasionally a workflow may be run that employ a map_task running 10s of tasks in parallel. We do employ dynamic workflows quite a lot, do dynamic resource alloc via

with_overrides

, and have recently been working to add caching wherever possible. As a developer I make use of the Flyte Console all day long to view executions, inputs/outputs, etc. I am quite often (many times a day) registering new versions of all workflows/tasks (~100 total) to my dev project/domain for testing on the k8s/flyte cluster via pyflyte to serialize and then flytectl to register against a new image, and this also happens on the production project/domain several times a week as part of CI/CD. (Our wheel-based distribution which employs compiled C extensions makes "fast registration" something we've not managed to use yet). Probably many of those details don't matter to flyte-binary operation, just giving as much context as possible.

high-accountant-32689

08/26/2023, 5:19 PM

Great, thanks for the details, @microscopic-furniture-57275. Tracking this investigation in https://github.com/flyteorg/flyte/issues/3991.

freezing-airport-6809

08/26/2023, 6:32 PM

What are the settings / config - especially the storage config and cache?

freezing-airport-6809

08/26/2023, 6:33 PM

To also explain - propeller is memory bound

microscopic-furniture-57275

08/28/2023, 11:38 AM

@freezing-airport-6809 @high-accountant-32689 Here is the config for our flyte-binary as deployed via Pulumi. I am not super-familiar with flyte-binary configuration but am learning. :)

flyte-binary deploy config.ts

microscopic-furniture-57275

08/28/2023, 11:46 AM

Elsewhere we create Roles/Policies/Attachments for various AWS resources including S3, but I do not see any special configuration there with respect to storage/cache config. We also configure routes for the the http server and grpc, but again no special configuration that I see.

microscopic-furniture-57275

09/18/2023, 5:07 PM

@freezing-airport-6809 @high-accountant-32689 FYI - we've moved our flyte binary deployment to a dedicated node with more memory, but as you can see over the last 10 days -- it just consumes more and more memory. It's the same pattern as shown above in the OP, it just hasn't been killed yet because it has an 8G request and no limit and has the node to itself. We'll see what happens in 10 more days as it exhausts the 16G on the machine. I don't think there is anything special about our workflows. As stated above, I don't think our situation is extreme or unusual.

high-accountant-32689

09/18/2023, 7:53 PM

@microscopic-furniture-57275, thanks for the update. It's in our list of things to investigate and will be prioritized shortly. We have a few suspects (the first one being the prometheus integration), but couldn't confirm yet. We'll update this thread when we find out more.

👍 1

microscopic-furniture-57275

09/18/2023, 7:54 PM

Thanks, FWIW we are using DataDog, but I didn't do the integration so am not certain what is involved there.

👍 1

microscopic-furniture-57275

09/22/2023, 4:13 PM

@high-accountant-32689 just another update -- that flyte binary running on a dedicated 16G node did die, and you can see memory continued to climb. There are also some "interesting" allocation patterns -- see the big spike where utilization goes from ~6.5G to over 13 in a short period, and then immediately drops back. If this kind of request occurs when the utilization is already higher, the pod will be OOM killed.

microscopic-furniture-57275

09/22/2023, 4:15 PM

This may just be a coincidence, but going from 6.5 => 13 is exactly doubling, as if perhaps there is an internal "heap" managed by flyte and when it is nearing exhaustion, it tries doubling the size. ?

freezing-airport-6809

09/23/2023, 1:42 PM

And this is only in single binary?

high-accountant-32689

09/24/2023, 3:19 AM

I have evidence that’s caused by Prometheus metrics. We will get this sorted out next week.

🙏 1

microscopic-furniture-57275

09/25/2023, 6:41 PM

@freezing-airport-6809 Yes, this is in single-binary. This is the only version we run, so I'm not sure if it's exclusively a single-binary problem or not.

16 Views

Open in Slack

Previous Next