https://flyte.org logo
Title
m

Mike Ossareh

05/10/2023, 2:45 PM
We’re isolated that workloads under flytekit 1.5.0 have much worse memory profiles than workloads under flytekit 1.4.2 Tasks which work just fine with
requests.memory=2Gi, limits.memory=2Gi
under flytekit 1.4.2 fail under 1.5.0 We bumped these tasks to have
requests.memory=64Gi, limits.memory=64Gi
and they succeed under 1.5.0. Here are two graphs that illustrate the difference in RAM usage. The k8s request differences are listed on the graphs. Everything else (inputs, etc) are the same. The only differences are one flytekit 1.4.2 vs 1.5.0. What changed?
An important observation here is that the data that is being fetched is approx. 12Gb - so the best I can come up with is that s3fs is allocating where in the past the fetch mechanism was simply downloading the data and making it available on the filesystem.
If that turns out to be the case it would be helpful to know whether there are controls on this decision. Our data is in the Tb’s but we’re able to process it by a vector that only requires we load Gb’s into RAM.
I’ve opened a bug on github: https://github.com/flyteorg/flyte/issues/3665
k

Ketan (kumare3)

05/10/2023, 3:02 PM
Hmm this is interesting
Can you try using flytefile streaming
Or write to a file
Cc @Kevin Su / @Yee
m

Mike Ossareh

05/10/2023, 3:04 PM
The devs that caught this were up until the wee hours of the morning investigating the issue - I’ll ask them to look into this when they’re online.
It’s unlikely to be a “quick” test for us due to the path we’ve taken to go from our legacy pipeline solution to flyte.
k

Ketan (kumare3)

05/10/2023, 3:05 PM
Sure but I don’t follow the problem- we have not seen it, will have to reproduce- happy to suggest a fix
m

Mike Ossareh

05/10/2023, 3:06 PM
understood
t

Thomas Blom

05/10/2023, 3:07 PM
@Ketan (kumare3) - for some context, the task that is consuming so much memory (and under 1.4.2 does not) is setting the folder on a FlyteDirectory. In a downstream task, this FlyteDirectory will be used to
download
the data to the compute node. But we aren't even getting to the downstream task. The setting of the folder on the FlyteDirectory appears to be causing something, presumably the large amount of data, to get loaded into memory.
k

Ketan (kumare3)

05/10/2023, 3:07 PM
Interesting it’s the directory
Ideally it should be consuming low memory, please add a code sample to the issue
t

Thomas Blom

05/10/2023, 3:09 PM
Maybe it's clearest for me to create a super-simple repro case to isolate this to FlyteDirectory. It's just a theory at present. 🙂
Below is the workflow that employs two tasks. The only thing that is done: 1. In task1, create a FlyteDirectory object passing it a folder on a networked (EFS) filesystem 2. In task2, consume the FlyteDirectory, and download the folder. The question is, why does flyte require so much memory for task1? The folder in question is 19G of data. Creating the FlyteDirectory object consumes this much memory in flytekit 1.5. On the other hand, downloading the data only consumes ~2G.
Blue line is task 1 (only creates a FlyteDirectory object, causing the data to be uploaded to s3) purple is task 2 (download data to node)
I have a more specific conclusion to add here based on another test: 1. flytekit 1.4.2 will also consume more memory (not as much as above, but still ~12G in this example) if that memory is available to the POD as specified in a task-based Resource request. 2. However, on this same 19G dataset, flytekit 1.4.2 will happily create the FlyteDirectory object and upload this data even with a 2Gi memory request and limit (see attached) 3. Flytekit 1.50 will not -- it will OOM unless you have memory available to the POD on par with the size of the data folder you are trying to upload.
@Ketan (kumare3) Let me know if you can verify the above. We're planning to roll back to flytekit 1.4.2 until this is resolved.
y

Yee

05/10/2023, 9:21 PM
hey @Thomas Blom can you confirm which versions of fsspec and s3fs you have?
also when you were on 1.4.2 did you have the flytekit-data-fsspec plugin installed as well?
we just tried locally (on an arm mac admittedly) uploading 10G and 1G files. we are unable to repro.
t

Thomas Blom

05/10/2023, 9:38 PM
From the image/container I'm running this test in:
>>> import fsspec
>>> fsspec.__version__
'2023.5.0'
I'm not sure how to get the version of
s3fs
, and I don't know if we are using
flytekit-data-fsspec
- I don't see this in our dependencies.
m

Mike Ossareh

05/10/2023, 9:39 PM
root@078cd129a8a3:/app# pip list | grep -E '(fsspec|flytekit|s3fs)'
flytekit                 1.4.2
flytekitplugins-pod      1.4.2
fsspec                   2023.5.0
(same container that Thomas is in fwiw)
y

Yee

05/10/2023, 9:40 PM
pip show s3fs too?
just to be sure
m

Mike Ossareh

05/10/2023, 9:40 PM
root@078cd129a8a3:/app# pip show s3fs
WARNING: Package(s) not found: s3fs
y

Yee

05/10/2023, 9:41 PM
can you
which aws
? (in the container)
t

Thomas Blom

05/10/2023, 9:41 PM
here is from the 1.5.0 image:
root@e28c57ef87a7:/app# pip list | grep -E '(fsspec|flytekit|s3fs)'
flytekit                 1.5.0
flytekitplugins-pod      1.5.0
fsspec                   2023.5.0
s3fs                     2023.5.0
m

Mike Ossareh

05/10/2023, 9:42 PM
1.4.2 image:
root@078cd129a8a3:/app# which aws
/app/venv/bin/aws
root@078cd129a8a3:/app# aws --version
aws-cli/1.27.132 Python/3.9.16 Linux/5.10.178-162.673.amzn2.x86_64 botocore/1.29.132
y

Yee

05/10/2023, 9:42 PM
so in 1.4.2, if you didn’t have the
flytekit-data-fsspec
plugin, you would default to the aws cli.
in 1.5, we defaulted to using fsspec (and added it as a dependency)
the reason for this was because the default flytekit image has had the flytekit-data-fsspec plugin for a while, so we’ve been testing it for quite some time
and we’ve not seen any issues.
yeah testing with 2023.5.0 isn’t showing more than 270MB of memory usage
but still on the arm machine. i think we need to move to amd
t

Thomas Blom

05/10/2023, 9:45 PM
We can do some more testing here. I'm curious, though - you said you'd uploaded a 1G and 10G File -- is this using FlyteFile, or FlyteDirectory?
y

Yee

05/10/2023, 9:45 PM
it was done with flytedirectory, and directly with fsspec
yeah i still can’t - now using amd/eks cluster
ran this workflow
import time
import subprocess
from flytekit import task, workflow, Resources
from flytekit.types.directory import FlyteDirectory


@task(requests=Resources(mem="1Gi"), limits=Resources(mem="1Gi"))
def waiter_task(a: int) -> str:
    if a == 0:
        time.sleep(86400)
    else:
        time.sleep(a)
    return "hello world"


@task(requests=Resources(mem="1Gi"), limits=Resources(mem="1Gi"))
def dd_and_upload() -> FlyteDirectory:
    command = ["dd", "if=/dev/random", "of=/root/temp_10GB_file", "bs=1", "count=0", "seek=10G"]
    subprocess.run(command)
    return FlyteDirectory("/root/temp_10GB_file")


@workflow
def waiter(a: int = 0) -> str:
    return waiter_task(a=a)


@workflow
def uploader() -> FlyteDirectory:
    return dd_and_upload()
first time i exec’ed in, created the file, and uploaded it via a separate script
second attempt was with the second workflow - confirmed the 10gb file now sitting in s3
monitoring it on the side, memory usage never went above 250MB
t

Thomas Blom

05/10/2023, 10:28 PM
Ok, thanks for your help. I'll see what more I can find out over here.
y

Yee

05/10/2023, 10:28 PM
🙏
let us know
t

Thomas Blom

05/10/2023, 10:42 PM
@Yee to be clear -- this last test you did with flytekit 1.5.0? Or with 1.4.2 with fsspec installed?
y

Yee

05/10/2023, 10:42 PM
neither
1.6.0b4
🙂
t

Thomas Blom

05/10/2023, 10:44 PM
lol
m

Mike Ossareh

05/11/2023, 12:06 AM
Thanks for testing @Yee 👍🏼
t

Thomas Blom

05/11/2023, 2:16 AM
Hey @Yee, sadly I haven't come to any clear-cut demonstration of the problem after many hours of testing. It seems to be the case that pods running together on a node put memory pressure on each other and the result is hard to characterize. Mostly, I've had LOTS of file copies fail in my testing, but it has happened with both flytekit 1.4.2 and 1.50. The one common trend that I have noticed, when trying to replicate the pattern from our real workflows, is that trying to upload via a FlyteDirectory fails much more consistently due to OOM when operating on LOTS of subfolders/files. So instead of a single 16G file, I did tests with several subfolders containing 1000 files each. I'll let you know if we find anything further.
k

Ketan (kumare3)

05/11/2023, 3:15 AM
I think it might be parallelism in fsspec now, while in the older one we had it serial. This is speed vs memory maybe
t

Thomas Blom

05/11/2023, 2:49 PM
This sounds right, thanks @Ketan (kumare3)