Swarup Srinivasan
04/11/2024, 4:41 PMtar: Removing leading `/' from member names
Does fast registration run any scripts when the task runs that could be using a lot of memory and/or emitting this log? Any pointers would be much appreciated!Swarup Srinivasan
04/11/2024, 4:43 PMYee
Yee
Yee
Swarup Srinivasan
04/11/2024, 5:18 PMYee
Swarup Srinivasan
04/11/2024, 5:18 PMSwarup Srinivasan
04/11/2024, 5:18 PMYee
Yee
Swarup Srinivasan
04/11/2024, 5:20 PMSwarup Srinivasan
04/11/2024, 5:21 PMSwarup Srinivasan
04/11/2024, 5:22 PMcould you download the zip file it’s trying to download manually to your local computer and monitor memory usage there?hmm that sounds like a good idea - how would I go about doing that? I could try to pull the image and run it locally, would the zip file be located in the container?
Swarup Srinivasan
04/11/2024, 5:22 PMSwarup Srinivasan
04/11/2024, 5:23 PMdownload a targz file and extract ithm actually - what does the targz file contain, and is the container downloading it from flyte admin?
Swarup Srinivasan
04/11/2024, 5:24 PMYee
Yee
Swarup Srinivasan
04/11/2024, 5:38 PMSwarup Srinivasan
04/11/2024, 5:39 PMSwarup Srinivasan
04/11/2024, 5:42 PMstoragePrefix
on flyte admin, but the rest of the s3 path seems to be a randomly generated stringSwarup Srinivasan
04/11/2024, 5:47 PMYee
Yee
Yee
Yee
Yee
Yee
Yee
Swarup Srinivasan
04/11/2024, 5:55 PMSwarup Srinivasan
04/11/2024, 5:56 PMYee
Yee
Yee
download_distribution
in the same fileYee
Swarup Srinivasan
04/11/2024, 5:59 PMKetan (kumare3)
Swarup Srinivasan
04/16/2024, 7:19 PMYee
Yee
Swarup Srinivasan
04/17/2024, 4:27 PMSwarup Srinivasan
04/18/2024, 6:10 PMresource.getrusage(resource.RUSAGE_SELF).ru_maxrss
) so it's pretty strange that the task OOMsSwarup Srinivasan
04/18/2024, 6:12 PMYee
Yee
Yee
Swarup Srinivasan
04/18/2024, 7:09 PMdownload_distribution
and printed the value of ru_maxrss
which I think shows the peak memory usage which was 600 Mib
also for more clarity this is on flytekit==1.3.2 so part of this usage could be due to things like the tensorflow import which was removed in a later versionYee
Swarup Srinivasan
04/18/2024, 7:17 PMresource.RUSAGE_SELF
Pass to getrusage() to request resources consumed by the calling process, which is the sum of resources used by all threads in the process.
resource.RUSAGE_CHILDREN
Pass to getrusage() to request resources consumed by child processes of the calling process which have been terminated and waited for.
resource.RUSAGE_BOTH
Pass to getrusage() to request resources consumed by both the current process and child processes. May not be available on all systems.
resource.RUSAGE_THREAD
Pass to getrusage() to request resources consumed by the current thread. May not be available on all systems.
but I printed both SELF and CHILDREN after calling download_distribution
max_mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
max_sub = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss
and I got 604744 kb for SELF and 322152 kb for CHILDRENYee
Swarup Srinivasan
04/18/2024, 7:21 PMSwarup Srinivasan
04/18/2024, 7:22 PMYee
Yee
Yee
Swarup Srinivasan
04/18/2024, 7:34 PMYee
Yee
Yee
Swarup Srinivasan
04/18/2024, 7:35 PMimport resource
from flytekit.tools.fast_registration import download_distribution
download_distribution(...)
max_mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
max_sub = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss
print(f"Max memory used: {max_mem} kb")
print(f"Max memory children used: {max_sub} kb")
Swarup Srinivasan
04/18/2024, 7:36 PMyeah i mean when it’s being run for real…yeah, the same python universe is used and I copied the s3 path from flyte console, so I do think everything important matches up 🤔
Yee
Yee
Yee
Yee
Yee
Swarup Srinivasan
04/18/2024, 7:38 PMSwarup Srinivasan
04/18/2024, 7:39 PMYee
Yee
Yee
FLYTE_SDK_LOGGING_LEVEL=10
set in the environment variable?Yee
Swarup Srinivasan
04/18/2024, 7:41 PMSwarup Srinivasan
04/18/2024, 7:42 PMYee
Yee
Yee
Yee
Swarup Srinivasan
04/18/2024, 7:47 PMYee
Yee
Yee
Swarup Srinivasan
04/18/2024, 7:49 PMYee
Yee
Yee
Yee
Swarup Srinivasan
04/18/2024, 8:00 PMSwarup Srinivasan
04/18/2024, 8:01 PMYee
Yee
Yee
Yee
Swarup Srinivasan
04/18/2024, 8:52 PMFLYTE_SDK_LOGGING_LEVEL=10
and see if we get more logs in the task init phase when it OOMs
• create a task resolver that will print resource usage or just call echo instead of calling pyflyte-execute
◦ if the OOM-type memory behavior is observed, then we conclude the problem is before this call, otherwise it's after
◦ if it's after, we can debug this via memray
wrapping the pyflyte-execute
◦ if it's before, not clear yet how we could debug this further atmYee
Yee
Swarup Srinivasan
04/18/2024, 9:01 PMSwarup Srinivasan
04/23/2024, 10:46 PMFLYTE_SDK_LOGGING_LEVEL=10
which generated a few more logs but nothing that feels too relevant
• I also emitted an exception on load_task
in the task resolver and I can confirm the excessive memory usage does occur before
these are the logs nearby the peak memory which was at 222500 to 222520 ish
{"asctime": "2024-04-19 22:24:44,690", "name": "flytekit", "levelname": "INFO", "message": "Exiting timed context: Copying (<s3 path to .tar.gz file> -> <local path>) [Wall Time: 27.35208011994837s, Process Time: 0.050577096000001376s]"}
Show syntax highlighted
tar: Removing leading `/' from member names
{"asctime": "2024-04-19 22:25:10,655", "name": "flytekit", "levelname": "INFO", "message": "Setting protocol to file"}
Show syntax highlighted
2024-04-19 22:25:23.048676: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Swarup Srinivasan
04/23/2024, 10:50 PM...Completed 2.9 GiB/2.9 GiB (119.4 MiB/s)...
but that'd be strange if the tar was the cause since memory doesn't go up much locally for the download_distribution
callSwarup Srinivasan
04/23/2024, 10:54 PMdownload_distribution
code
• or somehow spin up a pod with the flyte env without automatically running the entrypoint, SSHing into the pod, and then profiling while running the entry point or perhaps even a PDB or somethingSwarup Srinivasan
05/21/2024, 7:27 PMYee
Yee
Yee
Swarup Srinivasan
05/21/2024, 8:31 PMYee
Yee
Swarup Srinivasan
05/22/2024, 7:34 PMDavid Espejo (he/him)
05/22/2024, 8:35 PMseems like there are two @.David Espejo's here 👀)Yeah, I think the other one is an account I created long ago with another email. I'll remove it to avoid confusion
David Espejo (he/him)
05/22/2024, 8:35 PM