Hey community, for people running the single flyte...
# ask-the-community
t
Hey community, for people running the single flyte-binary, what is your experience with memory utilization? We're seeing our flyte-binary pod get killed about once a week due to ever-increasing memory utilization (we're running 1.8.1). As you can see on the attached graph, it is sometimes the case that there appears to be some kind of garbage-collection or similar and utilization drops dramatically before climbing again, but mostly we experience the pod getting evicted once a week as it gets beyond 5-6GB and runs out of memory on the 8GB node. Thoughts? Config options? Thanks! (the lines in the image are all the flyte-binary pod -- when you see one go up and not come down, it is killed, and the next line in a slightly different shade is a new flyte-binary pod getting scheduled to replace the evicted one)
k
hmm seems like a memory leak
cc @Eduardo Apolinario (eapolinario)?
e
very interesting. Definitely has the hallmarks of a memory leak. Will open an issue to track this.
@Thomas Blom, can you give more details about the rest of your setup? Are you running on aws? Can you give a sense of workloads (# of executions / min, etc)? A quick look at one of our deployments doesn't show this, but I'm def. intrigued.
t
Hey @Eduardo Apolinario (eapolinario), yes we are running on AWS. Our core services are provisioned with Karpenter and we use a nodeSelector to cause these services to land on smallish machines (8GB memory). We previously had no mem requests or limits for the flyte-binary, but have installed a request of ~5Gi per goldilocks suggestion. Our execution throughput is probably small in the flyte world - we execute 10s of workflows per day, each of which have a handful of python tasks. They typically are running signal-processing on lots of cores using considerable memory, and run for 10-30 minutes. Occasionally a workflow may be run that employ a map_task running 10s of tasks in parallel. We do employ dynamic workflows quite a lot, do dynamic resource alloc via
with_overrides
, and have recently been working to add caching wherever possible. As a developer I make use of the Flyte Console all day long to view executions, inputs/outputs, etc. I am quite often (many times a day) registering new versions of all workflows/tasks (~100 total) to my dev project/domain for testing on the k8s/flyte cluster via pyflyte to serialize and then flytectl to register against a new image, and this also happens on the production project/domain several times a week as part of CI/CD. (Our wheel-based distribution which employs compiled C extensions makes "fast registration" something we've not managed to use yet). Probably many of those details don't matter to flyte-binary operation, just giving as much context as possible.
e
Great, thanks for the details, @Thomas Blom. Tracking this investigation in https://github.com/flyteorg/flyte/issues/3991.
k
What are the settings / config - especially the storage config and cache?
To also explain - propeller is memory bound
t
@Ketan (kumare3) @Eduardo Apolinario (eapolinario) Here is the config for our flyte-binary as deployed via Pulumi. I am not super-familiar with flyte-binary configuration but am learning. :)
Elsewhere we create Roles/Policies/Attachments for various AWS resources including S3, but I do not see any special configuration there with respect to storage/cache config. We also configure routes for the the http server and grpc, but again no special configuration that I see.
@Ketan (kumare3) @Eduardo Apolinario (eapolinario) FYI - we've moved our flyte binary deployment to a dedicated node with more memory, but as you can see over the last 10 days -- it just consumes more and more memory. It's the same pattern as shown above in the OP, it just hasn't been killed yet because it has an 8G request and no limit and has the node to itself. We'll see what happens in 10 more days as it exhausts the 16G on the machine. I don't think there is anything special about our workflows. As stated above, I don't think our situation is extreme or unusual.
e
@Thomas Blom, thanks for the update. It's in our list of things to investigate and will be prioritized shortly. We have a few suspects (the first one being the prometheus integration), but couldn't confirm yet. We'll update this thread when we find out more.
t
Thanks, FWIW we are using DataDog, but I didn't do the integration so am not certain what is involved there.
@Eduardo Apolinario (eapolinario) just another update -- that flyte binary running on a dedicated 16G node did die, and you can see memory continued to climb. There are also some "interesting" allocation patterns -- see the big spike where utilization goes from ~6.5G to over 13 in a short period, and then immediately drops back. If this kind of request occurs when the utilization is already higher, the pod will be OOM killed.
This may just be a coincidence, but going from 6.5 => 13 is exactly doubling, as if perhaps there is an internal "heap" managed by flyte and when it is nearing exhaustion, it tries doubling the size. ?
k
And this is only in single binary?
e
I have evidence that’s caused by Prometheus metrics. We will get this sorted out next week.
t
@Ketan (kumare3) Yes, this is in single-binary. This is the only version we run, so I'm not sure if it's exclusively a single-binary problem or not.