:wave: We've noticed that fast registered workflow...
# ask-the-community
s
👋 We've noticed that fast registered workflow tasks consume a lot more memory when they start up (before running user code) and hit OOMKilled errors a lot more frequently. If the workflow is full registered, memory usage is a lot predictable. We tried digging into this but couldn't find any logs except for this one which happens right around the time the task uses a lot of memory:
Copy code
tar: Removing leading `/' from member names
Does fast registration run any scripts when the task runs that could be using a lot of memory and/or emitting this log? Any pointers would be much appreciated!
also, we mostly use dynamic workflows so I'm not sure if this issue is specific to that
y
this sounds very odd. fast register doesn’t do anything special except download a targz file and extract it.
even if the file is very large it shouldn’t matter.
are you saying if you add a log line to your task body, it ooms before that prints?
s
yup! that's right - this tar log line is often the last log line we see before the task OOMs
y
how many resources are you giving the pod?
s
we monitor the pod memory usage on prometheus, and I noticed that right around when this log line is emitted, the memory utilization spikes up quite high
in our user's code it's 2 Gi of memory request and limit
y
that’s pretty high
could you download the zip file it’s trying to download manually to your local computer and monitor memory usage there?
s
here's an example - with the same exact task and same exact code first image is fast registration and second is full registration (blue line is memory usage, yellow is request/limit)
in the first case the pod was killed with the OOMKilled status
could you download the zip file it’s trying to download manually to your local computer and monitor memory usage there?
hmm that sounds like a good idea - how would I go about doing that? I could try to pull the image and run it locally, would the zip file be located in the container?
if it tries to download it I assume I'd have to run the container first
download a targz file and extract it
hm actually - what does the targz file contain, and is the container downloading it from flyte admin?
we noticed that the bigger the task (I think in terms of dependencies), the higher the memory consumption here so it feels possible this targz file is blowing up the memory consumption
y
when you fast register flytekit zips up your code and ships it off to the configured blob store
when the task runs it downloads and extracts that code first
s
gotcha, I'll take a look at the tars at the upload path
strange that the download would take up a large chunk of memory though - could you perhaps point me to the code where it does this?
hm any suggestions on how I can trace down the exact tar file it's downloading? we configure the
storagePrefix
on flyte admin, but the rest of the s3 path seems to be a randomly generated string
ah found it in propeller's logs and yeah the tar file is 2.2 Gib in size
y
sorry, was afk
what do you mean… propeller logs?
the location of it? you should be able to see that by clicking on the “Task” tab in the UI
it’s not pretty, but it’s a json dump effectively of the definition of the task
and the inputs/outputs are templatized but the location of the code remains fixed.
i didn’t realize un-taring was that memory inefficient.
actually not sure what’s the inefficient part, the ungzing or the untaring
s
gotcha, yeah I can confirm that the tar file shown in the flyte console UI is the same 2.2 Gib file - I'll try to see if I can download this and inspect memory when ungzing / untarring
any chance you can point me to the code where flyte does this? would be nice to replicate in the same way it's done in the task
y
yeah
that’s the zipping up
and
download_distribution
in the same file
also fast register should respect gitignores.
s
perfect, will analyze and follow-up thank you for the pointers!!
k
Thank you. Also there is a rust entrypoint we are working on - perf and Lower memory footprint
s
thanks ketan and yee! the rust entrypoint would be awesome in the meantime, do you have any suggestions on how to implement some memory footprint profiling that includes the entrypoint? the best idea I had was a flytekit plugin but I think even those run after this untarring step
y
btw were you able to shrink the size of the file?
that’s a separate question of course. but wrt the first question, i think it should be profile-able locally, assuming you have a way of downloading the flyte package tgz file locally
s
not yet no 😞 we didn't set up the storage prefix properly so I'm still working on getting access perms to download it I'll definitely try to profile this locally but I guess I was thinking about long-term - basically where/how we could add a memory profiler in production the motivation is that we see a lot of OOMs that seem unrelated to user code but we have a hard time tracking down exactly what stage in the task setup causes it
quick update: I enabled access to the tarballs, and profiled download_distribution locally - the max memory usage was around 600 Mb (using
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
) so it's pretty strange that the task OOMs
it OOMs slightly after the tar log line, so it's possible this memory usage was only part of the reason it OOMed it's quite difficult to track down where exactly it peaked in memory usage - hence why a way to profile the entrypoint running in the flyte environment would be extremely useful 😅, I'm wondering if we can maybe wrap the command that executes the entrypoint in a profiler but I can confirm that we didn't reach user code yet when we OOMed
y
i’m surprised the download is that high tbh.
but just to confirm… what is that 600 number?
the download distribution function ends up calling subprocess.
s
yeah me too tbh - to clarify I basically ran a python file that just calls
download_distribution
and printed the value of
ru_maxrss
which I think shows the peak memory usage which was 600 Mib also for more clarity this is on flytekit==1.3.2 so part of this usage could be due to things like the tensorflow import which was removed in a later version
y
does the ru_maxrss account for the subprocess as well?
s
I don't think it does looking at the docs
Copy code
resource.RUSAGE_SELF
Pass to getrusage() to request resources consumed by the calling process, which is the sum of resources used by all threads in the process.

resource.RUSAGE_CHILDREN
Pass to getrusage() to request resources consumed by child processes of the calling process which have been terminated and waited for.

resource.RUSAGE_BOTH
Pass to getrusage() to request resources consumed by both the current process and child processes. May not be available on all systems.

resource.RUSAGE_THREAD
Pass to getrusage() to request resources consumed by the current thread. May not be available on all systems.
but I printed both SELF and CHILDREN after calling
download_distribution
Copy code
max_mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
max_sub = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss
and I got 604744 kb for SELF and 322152 kb for CHILDREN
y
and this is the 2.2gb file?
s
yup!
it's about 2.27 Gib to be more precise
y
and the pod oomed with 2gb right?
and we’re only seeing 1 locally
and the image arch matches right?
s
do you mean the image environment it's being run on? the python packages do match, but for local the testing I ran it on a host directly
y
how are you calling/using resource?
yeah i mean when it’s being run for real…
i assume it is
s
this is what I'm running locally
Copy code
import resource
from flytekit.tools.fast_registration import download_distribution

download_distribution(...)

max_mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
max_sub = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss

print(f"Max memory used: {max_mem} kb")
print(f"Max memory children used: {max_sub} kb")
yeah i mean when it’s being run for real…
yeah, the same python universe is used and I copied the s3 path from flyte console, so I do think everything important matches up 🤔
y
so what’s the quickest thing we can do?
we can add that blurb to a branch of flytekit.
that little snippet of code
and then run on a pod with 4gb.
but that doesn’t really get the profile we need.
s
I'm not too sure tbh - but yeah some way to profile memory here line by line when running it in flyte would be great because it does peak in memory OOM ~20s after the tar log line (there are no log lines after this one) so it's possible it OOMs somewhere in the entrypoint a bit after untarring
or maybe even a way to get the current stacktrace when it OOMs
y
i think i’ve used memray before
i think first best to isolate what’s causing the oom.
can you first run with
FLYTE_SDK_LOGGING_LEVEL=10
set in the environment variable?
that will at the very least produce more logs.
s
did that work in the flyte context btw? I might be using it wrong but I tried it locally with the script I showed you and this is what I got
will try the env var
would be curious to change the command to “echo hello” and the task will fail, but i wonder if it’ll oom.
and if it doesn’t oom, then we can add a memray to the front of that command, and a sleep somewhere.
run it on a bigger box, then exec in and generate the flame graph after
s
hmm are you saying that if it OOMs with echo hello (or has high memory usage) that means the problem is before the entrypoint? also, will think about how to best do this since I'm not sure if there's a straightforward way for me to use a modified flytekit - our standard process is to fork, modify, publish to internal artifactory, then use that patched version
y
that works as well.
or is that a slow process?
but yes, before or after the second command
s
yeah that would be quite a slow process for our iterative debugging purposes
y
you want something to hook into?
the flytekit entrypoint loads a task resolver (in the default case it loads the default task resolver) (and the task resolver is actually the thing that calls importlib on your task)
you can write a custom one that just raises an exception.
set a task to use that task resolver, that would get around the hello world bit.
s
I might not be understanding this correctly but the resolver gets called after some setup in the entrypoint including downloading the tarball code right? I assume it's the point where it's trying to execute the task
also curious if having a custom resolver would override things like flytekit plugins (which should be fine, but I'm curious about the ordering here)
y
it won’t override plugins.
and yes, it gets called after downloading the tarball, and after loading flytekit, and after loading flytekit plugins that are automatically loaded, but before user code is loaded
as the resolver is what loads user code.
just wanted to binary search the problem space a bit.
s
makes sense, thanks for the suggestions yee! I'll work with the user to do this - but to summarize I think these are the steps, lmk if this captures your suggestions: • set
FLYTE_SDK_LOGGING_LEVEL=10
and see if we get more logs in the task init phase when it OOMs • create a task resolver that will print resource usage or just call echo instead of calling
pyflyte-execute
◦ if the OOM-type memory behavior is observed, then we conclude the problem is before this call, otherwise it's after ◦ if it's after, we can debug this via
memray
wrapping the
pyflyte-execute
◦ if it's before, not clear yet how we could debug this further atm
y
yeah. if it’s before you can also play around with trying to exec into the container before
i don’t know how much leeway you guys have, but basically run the pod manually on the cluster, and see if that resource module lines up with your local test.
s
makes sense - I think that should be doable, I'll look into it more if it comes to that thanks again for all the suggestions 🙂 will report back here with what I find
quikc update here: • I set
FLYTE_SDK_LOGGING_LEVEL=10
which generated a few more logs but nothing that feels too relevant • I also emitted an exception on
load_task
in the task resolver and I can confirm the excessive memory usage does occur before these are the logs nearby the peak memory which was at 222500 to 222520 ish
Copy code
{"asctime": "2024-04-19 22:24:44,690", "name": "flytekit", "levelname": "INFO", "message": "Exiting timed context: Copying (<s3 path to .tar.gz file> -> <local path>) [Wall Time: 27.35208011994837s, Process Time: 0.050577096000001376s]"}
Show syntax highlighted
tar: Removing leading `/' from member names
{"asctime": "2024-04-19 22:25:10,655", "name": "flytekit", "levelname": "INFO", "message": "Setting protocol to file"}
Show syntax highlighted
2024-04-19 22:25:23.048676: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
the log before this was the tar download which did report the file was 2.9 Gi which kinda corresponds to the peak usage
Copy code
...Completed 2.9 GiB/2.9 GiB (119.4 MiB/s)...
but that'd be strange if the tar was the cause since memory doesn't go up much locally for the
download_distribution
call
I guess next step is I'll try to profile the entrypoint on a pod on the cluster not sure how exactly I'd do this: • maybe spin up an empty pod and profile just the
download_distribution
code • or somehow spin up a pod with the flyte env without automatically running the entrypoint, SSHing into the pod, and then profiling while running the entry point or perhaps even a PDB or something
hey yee sorry I forgot to report back my findings on this thread - so we confirmed it is due to the tarball sizes, looks like it needs enough memory to load the entire tarball I was able to reduce our tarball sizes (basically ignoring bazel-generated jar files) which makes fast registration not usable for spark job changes, but that's fine since our user changes are mostly python and it works well there on an unrelated note: ideally we can have flyte support VPAs (flyte/issues/2234) but not sure what the effort there would be - I didn't get any responses on my comment there yet 🥲 thanks again for all your help! hope the findings here are useful to you in any way
y
that sounds like something that should be brought up with the broader community… any chance you can join the next contributor sync?
and thanks for testing…
we’re working on replacing the i/o layer that flytekit uses, this problem with the tar files will be obviated hopefully in the medium term.
s
no problem! I'm definitely happy to join the contributor sync - I briefly went over the hackmd doc, should I start by posting in #contribute or just bring this up in the meeting? I'm kinda new to contributing so not sure how to prepare 😅
y
cc @David Espejo when’s the next one again?
can we add this to the agenda?
s
I noticed the next meeting on the ical invite in the hackmd was scheduled for tomorrow morning so bumping this! cc @David Espejo (he/him) (seems like there are two @.David Espejo's here 👀)
d
@Swarup Srinivasan thanks!
seems like there are two @.David Espejo's here 👀)
Yeah, I think the other one is an account I created long ago with another email. I'll remove it to avoid confusion
and yeah, you're so welcome to join us tomorrow at 7:00am PT