wave We ve noticed that fast registered workflow tasks cons Flyte #flyte-support

:wave: We've noticed that fast registered workflow...

gentle-night-59824

04/11/2024, 4:41 PM

👋 We've noticed that fast registered workflow tasks consume a lot more memory when they start up (before running user code) and hit OOMKilled errors a lot more frequently. If the workflow is full registered, memory usage is a lot predictable. We tried digging into this but couldn't find any logs except for this one which happens right around the time the task uses a lot of memory:

Copy code

tar: Removing leading `/' from member names

Does fast registration run any scripts when the task runs that could be using a lot of memory and/or emitting this log? Any pointers would be much appreciated!

gentle-night-59824

04/11/2024, 4:43 PM

also, we mostly use dynamic workflows so I'm not sure if this issue is specific to that

thankful-minister-83577

04/11/2024, 5:16 PM

this sounds very odd. fast register doesn’t do anything special except download a targz file and extract it.

thankful-minister-83577

04/11/2024, 5:16 PM

even if the file is very large it shouldn’t matter.

thankful-minister-83577

04/11/2024, 5:16 PM

are you saying if you add a log line to your task body, it ooms before that prints?

gentle-night-59824

04/11/2024, 5:18 PM

yup! that's right - this tar log line is often the last log line we see before the task OOMs

thankful-minister-83577

04/11/2024, 5:18 PM

how many resources are you giving the pod?

gentle-night-59824

04/11/2024, 5:18 PM

we monitor the pod memory usage on prometheus, and I noticed that right around when this log line is emitted, the memory utilization spikes up quite high

gentle-night-59824

04/11/2024, 5:18 PM

in our user's code it's 2 Gi of memory request and limit

thankful-minister-83577

04/11/2024, 5:19 PM

that’s pretty high

thankful-minister-83577

04/11/2024, 5:20 PM

could you download the zip file it’s trying to download manually to your local computer and monitor memory usage there?

👀 1

gentle-night-59824

04/11/2024, 5:20 PM

here's an example - with the same exact task and same exact code first image is fast registration and second is full registration (blue line is memory usage, yellow is request/limit)

gentle-night-59824

04/11/2024, 5:21 PM

in the first case the pod was killed with the OOMKilled status

gentle-night-59824

04/11/2024, 5:22 PM

could you download the zip file it’s trying to download manually to your local computer and monitor memory usage there?

hmm that sounds like a good idea - how would I go about doing that? I could try to pull the image and run it locally, would the zip file be located in the container?

gentle-night-59824

04/11/2024, 5:22 PM

if it tries to download it I assume I'd have to run the container first

gentle-night-59824

04/11/2024, 5:23 PM

download a targz file and extract it

hm actually - what does the targz file contain, and is the container downloading it from flyte admin?

gentle-night-59824

04/11/2024, 5:24 PM

we noticed that the bigger the task (I think in terms of dependencies), the higher the memory consumption here so it feels possible this targz file is blowing up the memory consumption

thankful-minister-83577

04/11/2024, 5:31 PM

when you fast register flytekit zips up your code and ships it off to the configured blob store

thankful-minister-83577

04/11/2024, 5:31 PM

when the task runs it downloads and extracts that code first

gentle-night-59824

04/11/2024, 5:38 PM

gotcha, I'll take a look at the tars at the upload path

gentle-night-59824

04/11/2024, 5:39 PM

strange that the download would take up a large chunk of memory though - could you perhaps point me to the code where it does this?

gentle-night-59824

04/11/2024, 5:42 PM

hm any suggestions on how I can trace down the exact tar file it's downloading? we configure the

storagePrefix

on flyte admin, but the rest of the s3 path seems to be a randomly generated string

gentle-night-59824

04/11/2024, 5:47 PM

ah found it in propeller's logs and yeah the tar file is 2.2 Gib in size

thankful-minister-83577

04/11/2024, 5:51 PM

sorry, was afk

thankful-minister-83577

04/11/2024, 5:51 PM

what do you mean… propeller logs?

thankful-minister-83577

04/11/2024, 5:52 PM

the location of it? you should be able to see that by clicking on the “Task” tab in the UI

😮 1

thankful-minister-83577

04/11/2024, 5:52 PM

it’s not pretty, but it’s a json dump effectively of the definition of the task

thankful-minister-83577

04/11/2024, 5:52 PM

and the inputs/outputs are templatized but the location of the code remains fixed.

thankful-minister-83577

04/11/2024, 5:53 PM

i didn’t realize un-taring was that memory inefficient.

thankful-minister-83577

04/11/2024, 5:53 PM

actually not sure what’s the inefficient part, the ungzing or the untaring

🤔 1

gentle-night-59824

04/11/2024, 5:55 PM

gotcha, yeah I can confirm that the tar file shown in the flyte console UI is the same 2.2 Gib file - I'll try to see if I can download this and inspect memory when ungzing / untarring

gentle-night-59824

04/11/2024, 5:56 PM

any chance you can point me to the code where flyte does this? would be nice to replicate in the same way it's done in the task

thankful-minister-83577

04/11/2024, 5:57 PM

yeah

thankful-minister-83577

04/11/2024, 5:57 PM

https://github.com/flyteorg/flytekit/blob/177571b0c85e3b7d95bb24e299bed70fbec992ae/flytekit/tools/fast_registration.py#L43

❤️ 1

thankful-minister-83577

04/11/2024, 5:58 PM

that’s the zipping up

thankful-minister-83577

04/11/2024, 5:58 PM

and

download_distribution

in the same file

thankful-minister-83577

04/11/2024, 5:58 PM

also fast register should respect gitignores.

gentle-night-59824

04/11/2024, 5:59 PM

perfect, will analyze and follow-up thank you for the pointers!!

freezing-airport-6809

04/12/2024, 2:10 AM

Thank you. Also there is a rust entrypoint we are working on - perf and Lower memory footprint

❤️ 1

gentle-night-59824

04/16/2024, 7:19 PM

thanks ketan and yee! the rust entrypoint would be awesome in the meantime, do you have any suggestions on how to implement some memory footprint profiling that includes the entrypoint? the best idea I had was a flytekit plugin but I think even those run after this untarring step

thankful-minister-83577

04/16/2024, 11:01 PM

btw were you able to shrink the size of the file?

thankful-minister-83577

04/16/2024, 11:02 PM

that’s a separate question of course. but wrt the first question, i think it should be profile-able locally, assuming you have a way of downloading the flyte package tgz file locally

gentle-night-59824

04/17/2024, 4:27 PM

not yet no 😞 we didn't set up the storage prefix properly so I'm still working on getting access perms to download it I'll definitely try to profile this locally but I guess I was thinking about long-term - basically where/how we could add a memory profiler in production the motivation is that we see a lot of OOMs that seem unrelated to user code but we have a hard time tracking down exactly what stage in the task setup causes it

gentle-night-59824

04/18/2024, 6:10 PM

quick update: I enabled access to the tarballs, and profiled download_distribution locally - the max memory usage was around 600 Mb (using

resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

) so it's pretty strange that the task OOMs

gentle-night-59824

04/18/2024, 6:12 PM

it OOMs slightly after the tar log line, so it's possible this memory usage was only part of the reason it OOMed it's quite difficult to track down where exactly it peaked in memory usage - hence why a way to profile the entrypoint running in the flyte environment would be extremely useful 😅, I'm wondering if we can maybe wrap the command that executes the entrypoint in a profiler but I can confirm that we didn't reach user code yet when we OOMed

thankful-minister-83577

04/18/2024, 7:03 PM

i’m surprised the download is that high tbh.

thankful-minister-83577

04/18/2024, 7:04 PM

but just to confirm… what is that 600 number?

thankful-minister-83577

04/18/2024, 7:08 PM

the download distribution function ends up calling subprocess.

gentle-night-59824

04/18/2024, 7:09 PM

yeah me too tbh - to clarify I basically ran a python file that just calls

download_distribution

and printed the value of

ru_maxrss

which I think shows the peak memory usage which was 600 Mib also for more clarity this is on flytekit==1.3.2 so part of this usage could be due to things like the tensorflow import which was removed in a later version

thankful-minister-83577

04/18/2024, 7:09 PM

does the ru_maxrss account for the subprocess as well?

🤔 1

gentle-night-59824

04/18/2024, 7:17 PM

I don't think it does looking at the docs

Copy code

resource.RUSAGE_SELF
Pass to getrusage() to request resources consumed by the calling process, which is the sum of resources used by all threads in the process.

resource.RUSAGE_CHILDREN
Pass to getrusage() to request resources consumed by child processes of the calling process which have been terminated and waited for.

resource.RUSAGE_BOTH
Pass to getrusage() to request resources consumed by both the current process and child processes. May not be available on all systems.

resource.RUSAGE_THREAD
Pass to getrusage() to request resources consumed by the current thread. May not be available on all systems.

but I printed both SELF and CHILDREN after calling

download_distribution

Copy code

max_mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
max_sub = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss

and I got 604744 kb for SELF and 322152 kb for CHILDREN

thankful-minister-83577

04/18/2024, 7:18 PM

and this is the 2.2gb file?

gentle-night-59824

04/18/2024, 7:21 PM

yup!

gentle-night-59824

04/18/2024, 7:22 PM

it's about 2.27 Gib to be more precise

thankful-minister-83577

04/18/2024, 7:31 PM

and the pod oomed with 2gb right?

👍 1

thankful-minister-83577

04/18/2024, 7:31 PM

and we’re only seeing 1 locally

thankful-minister-83577

04/18/2024, 7:33 PM

and the image arch matches right?

gentle-night-59824

04/18/2024, 7:34 PM

do you mean the image environment it's being run on? the python packages do match, but for local the testing I ran it on a host directly

thankful-minister-83577

04/18/2024, 7:34 PM

how are you calling/using resource?

thankful-minister-83577

04/18/2024, 7:35 PM

yeah i mean when it’s being run for real…

thankful-minister-83577

04/18/2024, 7:35 PM

i assume it is

gentle-night-59824

04/18/2024, 7:35 PM

this is what I'm running locally

Copy code

import resource
from flytekit.tools.fast_registration import download_distribution

download_distribution(...)

max_mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
max_sub = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss

print(f"Max memory used: {max_mem} kb")
print(f"Max memory children used: {max_sub} kb")

gentle-night-59824

04/18/2024, 7:36 PM

yeah i mean when it’s being run for real…

yeah, the same python universe is used and I copied the s3 path from flyte console, so I do think everything important matches up 🤔

thankful-minister-83577

04/18/2024, 7:36 PM

so what’s the quickest thing we can do?

thankful-minister-83577

04/18/2024, 7:37 PM

we can add that blurb to a branch of flytekit.

thankful-minister-83577

04/18/2024, 7:37 PM

that little snippet of code

thankful-minister-83577

04/18/2024, 7:37 PM

and then run on a pod with 4gb.

thankful-minister-83577

04/18/2024, 7:38 PM

but that doesn’t really get the profile we need.

gentle-night-59824

04/18/2024, 7:38 PM

I'm not too sure tbh - but yeah some way to profile memory here line by line when running it in flyte would be great because it does peak in memory OOM ~20s after the tar log line (there are no log lines after this one) so it's possible it OOMs somewhere in the entrypoint a bit after untarring

gentle-night-59824

04/18/2024, 7:39 PM

or maybe even a way to get the current stacktrace when it OOMs

thankful-minister-83577

04/18/2024, 7:39 PM

i think i’ve used memray before

😮 1

thankful-minister-83577

04/18/2024, 7:40 PM

i think first best to isolate what’s causing the oom.

thankful-minister-83577

04/18/2024, 7:41 PM

can you first run with

FLYTE_SDK_LOGGING_LEVEL=10

set in the environment variable?

👀 1

thankful-minister-83577

04/18/2024, 7:41 PM

that will at the very least produce more logs.

gentle-night-59824

04/18/2024, 7:41 PM

did that work in the flyte context btw? I might be using it wrong but I tried it locally with the script I showed you and this is what I got

gentle-night-59824

04/18/2024, 7:42 PM

will try the env var

thankful-minister-83577

04/18/2024, 7:42 PM

https://github.com/flyteorg/flytekit/blob/d2c7ddc22aa65bae97a46684c157b668fd7f4f12/flytekit/core/python_auto_container.py#L137 this is where the command is set.

thankful-minister-83577

04/18/2024, 7:43 PM

would be curious to change the command to “echo hello” and the task will fail, but i wonder if it’ll oom.

thankful-minister-83577

04/18/2024, 7:43 PM

and if it doesn’t oom, then we can add a memray to the front of that command, and a sleep somewhere.

thankful-minister-83577

04/18/2024, 7:44 PM

run it on a bigger box, then exec in and generate the flame graph after

gentle-night-59824

04/18/2024, 7:47 PM

hmm are you saying that if it OOMs with echo hello (or has high memory usage) that means the problem is before the entrypoint? also, will think about how to best do this since I'm not sure if there's a straightforward way for me to use a modified flytekit - our standard process is to fork, modify, publish to internal artifactory, then use that patched version

thankful-minister-83577

04/18/2024, 7:48 PM

that works as well.

thankful-minister-83577

04/18/2024, 7:48 PM

or is that a slow process?

👍 1

thankful-minister-83577

04/18/2024, 7:48 PM

but yes, before or after the second command

gentle-night-59824

04/18/2024, 7:49 PM

yeah that would be quite a slow process for our iterative debugging purposes

thankful-minister-83577

04/18/2024, 7:55 PM

you want something to hook into?

thankful-minister-83577

04/18/2024, 7:55 PM

the flytekit entrypoint loads a task resolver (in the default case it loads the default task resolver) (and the task resolver is actually the thing that calls importlib on your task)

👀 1

thankful-minister-83577

04/18/2024, 7:56 PM

you can write a custom one that just raises an exception.

thankful-minister-83577

04/18/2024, 7:56 PM

set a task to use that task resolver, that would get around the hello world bit.

gentle-night-59824

04/18/2024, 8:00 PM

I might not be understanding this correctly but the resolver gets called after some setup in the entrypoint including downloading the tarball code right? I assume it's the point where it's trying to execute the task

gentle-night-59824

04/18/2024, 8:01 PM

also curious if having a custom resolver would override things like flytekit plugins (which should be fine, but I'm curious about the ordering here)

thankful-minister-83577

04/18/2024, 8:40 PM

it won’t override plugins.

thankful-minister-83577

04/18/2024, 8:41 PM

and yes, it gets called after downloading the tarball, and after loading flytekit, and after loading flytekit plugins that are automatically loaded, but before user code is loaded

thankful-minister-83577

04/18/2024, 8:41 PM

as the resolver is what loads user code.

thankful-minister-83577

04/18/2024, 8:42 PM

just wanted to binary search the problem space a bit.

gentle-night-59824

04/18/2024, 8:52 PM

makes sense, thanks for the suggestions yee! I'll work with the user to do this - but to summarize I think these are the steps, lmk if this captures your suggestions: • set

FLYTE_SDK_LOGGING_LEVEL=10

and see if we get more logs in the task init phase when it OOMs • create a task resolver that will print resource usage or just call echo instead of calling

pyflyte-execute

◦ if the OOM-type memory behavior is observed, then we conclude the problem is before this call, otherwise it's after ◦ if it's after, we can debug this via

memray

wrapping the

pyflyte-execute

◦ if it's before, not clear yet how we could debug this further atm

thankful-minister-83577

04/18/2024, 8:54 PM

yeah. if it’s before you can also play around with trying to exec into the container before

thankful-minister-83577

04/18/2024, 8:55 PM

i don’t know how much leeway you guys have, but basically run the pod manually on the cluster, and see if that resource module lines up with your local test.

gentle-night-59824

04/18/2024, 9:01 PM

makes sense - I think that should be doable, I'll look into it more if it comes to that thanks again for all the suggestions 🙂 will report back here with what I find

🙏 1

gentle-night-59824

04/23/2024, 10:46 PM

quikc update here: • I set

FLYTE_SDK_LOGGING_LEVEL=10

which generated a few more logs but nothing that feels too relevant • I also emitted an exception on

load_task

in the task resolver and I can confirm the excessive memory usage does occur before these are the logs nearby the peak memory which was at 222500 to 222520 ish

Copy code

{"asctime": "2024-04-19 22:24:44,690", "name": "flytekit", "levelname": "INFO", "message": "Exiting timed context: Copying (<s3 path to .tar.gz file> -> <local path>) [Wall Time: 27.35208011994837s, Process Time: 0.050577096000001376s]"}
Show syntax highlighted
tar: Removing leading `/' from member names
{"asctime": "2024-04-19 22:25:10,655", "name": "flytekit", "levelname": "INFO", "message": "Setting protocol to file"}
Show syntax highlighted
2024-04-19 22:25:23.048676: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

gentle-night-59824

04/23/2024, 10:50 PM

the log before this was the tar download which did report the file was 2.9 Gi which kinda corresponds to the peak usage

Copy code

...Completed 2.9 GiB/2.9 GiB (119.4 MiB/s)...

but that'd be strange if the tar was the cause since memory doesn't go up much locally for the

download_distribution

call

gentle-night-59824

04/23/2024, 10:54 PM

I guess next step is I'll try to profile the entrypoint on a pod on the cluster not sure how exactly I'd do this: • maybe spin up an empty pod and profile just the

download_distribution

code • or somehow spin up a pod with the flyte env without automatically running the entrypoint, SSHing into the pod, and then profiling while running the entry point or perhaps even a PDB or something

gentle-night-59824

05/21/2024, 7:27 PM

hey yee sorry I forgot to report back my findings on this thread - so we confirmed it is due to the tarball sizes, looks like it needs enough memory to load the entire tarball I was able to reduce our tarball sizes (basically ignoring bazel-generated jar files) which makes fast registration not usable for spark job changes, but that's fine since our user changes are mostly python and it works well there on an unrelated note: ideally we can have flyte support VPAs (flyte/issues/2234) but not sure what the effort there would be - I didn't get any responses on my comment there yet 🥲 thanks again for all your help! hope the findings here are useful to you in any way

thankful-minister-83577

05/21/2024, 7:49 PM

that sounds like something that should be brought up with the broader community… any chance you can join the next contributor sync?

thankful-minister-83577

05/21/2024, 7:49 PM

and thanks for testing…

thankful-minister-83577

05/21/2024, 7:50 PM

we’re working on replacing the i/o layer that flytekit uses, this problem with the tar files will be obviated hopefully in the medium term.

gentle-night-59824

05/21/2024, 8:31 PM

no problem! I'm definitely happy to join the contributor sync - I briefly went over the hackmd doc, should I start by posting in #C04NJPLRWUX or just bring this up in the meeting? I'm kinda new to contributing so not sure how to prepare 😅

thankful-minister-83577

05/21/2024, 8:43 PM

cc @cold-lock-43986 when’s the next one again?

thankful-minister-83577

05/21/2024, 8:43 PM

can we add this to the agenda?

gentle-night-59824

05/22/2024, 7:34 PM

I noticed the next meeting on the ical invite in the hackmd was scheduled for tomorrow morning so bumping this! cc @average-finland-92144 (seems like there are two @.David Espejo's here 👀)

average-finland-92144

05/22/2024, 8:35 PM

@gentle-night-59824 thanks!

seems like there are two @.David Espejo's here 👀)

Yeah, I think the other one is an account I created long ago with another email. I'll remove it to avoid confusion

average-finland-92144

05/22/2024, 8:35 PM

and yeah, you're so welcome to join us tomorrow at 7:00am PT

16 Views

Open in Slack

Previous Next