We have some workflows that seem to be triggering a memory l Flyte #flyte-support

We have some workflows that seem to be triggering ...

hallowed-camera-82098

04/01/2024, 4:02 PM

We have some workflows that seem to be triggering a memory leak in flytepropeller. Within a dynamic task there's a map task sometimes with up to ~5000 subtasks, and even though the subtasks all say succeeded, the map task doesn't for a very long time (sometimes it eventually does). In the meantime memory in flytepropeller keeps rising and even if the map task eventually succeeds or is aborted the memory usage remains. Possibly we're making some connections that aren't there between the memory usage and the workflows, but just wondering if someone has seen anything like that or has any ideas what could be happening?

flat-area-42876

04/01/2024, 4:45 PM

are you using the legacy/old map tasks or the array node map task implementation?

flat-area-42876

04/01/2024, 4:46 PM

does the memory usage stay elevated after the task and/or workflow succeeds?

hallowed-camera-82098

04/01/2024, 4:57 PM

Legacy map tasks and the memory usage does remain elevated after it succeeds or we abort them (though it stops going up)

hallowed-camera-82098

04/01/2024, 6:49 PM

will update with heap profiles this afternoon

🙏 1

freezing-airport-6809

04/02/2024, 12:01 AM

@hallowed-camera-82098 Flytepropeller has a big cache - that is greedy and will go up with usage. It gc's at 70% of usage (default) you can adjust this. This is usually not a leak but by design

freezing-airport-6809

04/02/2024, 12:02 AM

but 5000 map tasks, with very very large outputs per task may cause propeller to really memory thrash

👍 1

hallowed-camera-82098

04/02/2024, 12:02 AM

We think we mitigated this by disabling cache in the flyte workflow, Not sure if this is helpful, here's heap diagram (diff_based against 35 minutes earlier while the memory usage was steadily rising)

diff_35_minutes.svg

freezing-airport-6809

04/02/2024, 12:05 AM

close to 5GB in task handler

famous-flag-22960

04/02/2024, 12:05 AM

The total size of inputs.pb to these map tasks weighs in at ~7MB. Outputs are even smaller, O(100 bytes) per task, just a GCS path

freezing-airport-6809

04/02/2024, 12:05 AM

this is not good, would love to see an example of the workflow

freezing-airport-6809

04/02/2024, 12:05 AM

hmm 7MB per task? so 5k * 7MB?

freezing-airport-6809

04/02/2024, 12:06 AM

disabling cache does not sound good

famous-flag-22960

04/02/2024, 12:06 AM

7MB is the total inputs.pb to the map task

famous-flag-22960

04/02/2024, 12:07 AM

I would like to think it doesn't need a separate copy of the inputs for each of the map tasks. We did hypothesize that briefly, but at the very least we ruled out that it's loading that inputs.pb 5k times from object storage (though I suppose the fetch from object storage could be cached and it just unmarshals it once for each task inside the map task)

freezing-airport-6809

04/02/2024, 12:08 AM

it does not need it

freezing-airport-6809

04/02/2024, 12:08 AM

but this is an interesting usecase, of how you reached 4GB of usage

famous-flag-22960

04/02/2024, 12:12 AM

The workflow is really rather simple

famous-flag-22960

04/02/2024, 12:12 AM

famous-flag-22960

04/02/2024, 12:13 AM

hallowed-camera-82098

04/02/2024, 12:13 AM

Oh just noticed the 70% gc comment, forgot to mention we were actually OOMing, even with 128Gi

famous-flag-22960

04/02/2024, 12:17 AM

outputs.pb of that

create_map_inputs

run is 7MB, corresponds to 4.7k tasks in the map task

freezing-airport-6809

04/02/2024, 12:31 AM

I think it’s not about simple or complex we will Have to see the actual values

famous-flag-22960

04/02/2024, 12:32 AM

Mind being specific about what additional things it'd be useful to see here? We're happy to pull together whatever we can

freezing-airport-6809

04/02/2024, 12:46 AM

i would love to see the workflow representative, so that i can reproduce

freezing-airport-6809

04/02/2024, 12:46 AM

it might be a leak, but would love to redo it

freezing-airport-6809

04/02/2024, 12:46 AM

it seems like the 5k concurrent tasks is what caused it

famous-flag-22960

04/02/2024, 1:04 AM

I unfortunately cannot share the full workflow, lots of company code in there. Couple options, we're always happy to hop on a call and do some live debugging, or I could try to make a version of the workflow with a stubbed implementations that have the same inputs/outputs to see if we can reproduce that way. The inputs and outputs are generally not themselves sensitive.

famous-flag-22960

04/02/2024, 4:40 PM

We're picking this up this morning and actively continuing debugging today as we have workloads pending. My current plan is to see if I can minimize this test case: use identical inputs and outputs from all the tasks in the workflow, run locally in a sandbox and see if I can reproduce the high memory usage or a memory leak there. Happy to run any other more specific tests if y'all have any hypotheses you'd want to test or specific data to capture

flat-area-42876

04/02/2024, 9:04 PM

@famous-flag-22960 a repro would be great. Thank you. Looking forward to figuring this one out.

flat-area-42876

04/08/2024, 5:27 PM

cc: @hallowed-mouse-14616

hallowed-mouse-14616

04/08/2024, 7:40 PM

@hallowed-camera-82098 / @famous-flag-22960 flyte single binary (and each individual component) exposes a golang pprof endpoint on port 10254 by default. Using this we can see exactly where in the binary we're storing tons of data. You can use:

Copy code

wget -O heap.out <http://localhost:10254/debug/pprof/heap>

if the flyte binary is at

localhost:10254

to retrieve a dump of the heap, and then something like:

Copy code

go tool pprof -no_browser -http :8080 heap.out

to start a webserver displaying the results which should show something like the image below. It would be great it we could even get a dump of the heap and would be happy to look through the issues there.

ppro

famous-flag-22960

04/08/2024, 8:46 PM

There’s a diff of the heap growing posted earlier in this thread

famous-flag-22960

04/08/2024, 8:47 PM

https://flyte-org.slack.com/archives/CP2HDHKE1/p1712016156497859?thread_ts=1711987346.995029&channel=CP2HDHKE1&message_ts=1712016156.497859

famous-flag-22960

04/08/2024, 10:39 PM

@hallowed-mouse-14616 we have the raw before/after that we used to generate the diff as well that we can share.

famous-flag-22960

04/08/2024, 10:40 PM

I've not had luck reproducing locally yet, but sometimes work gets busy, making one more pass at reproducing this week.

60 Views

Open in Slack

Previous Next