*Improving download speed of large FlyteFiles*: he...
# ask-the-community
q
Improving download speed of large FlyteFiles: hello dear community 👋 😄 For some jobs I use big FlyteFiles (>10GB) stored on S3. Flyte uses fsspec under the hood to download the file on the EC2 instance so I get something like 15MB/s of download speed --> more than 10 min to get the file. For the same file+instance+bucket, boto3 achieves 200-300MB/s (which is roughly x10 to x20 speed improvement). So I'm willing to specify my own boto3-based downloader to the FlyteFile constructor. Has anyone done something like that ? The documentation doesn't seem to describe this use-case. And when looking at this part of the code, I'm wondering if this is possible.
I guess the simplest way is to recreate a FlyteFile with the right downloader and use that instead of the one provided by flyte in the job (inspired by this) That's what I'll try in my next experiments... But I'm curious about how other people in the community are handling that use case.
k
We actually have a new version of fsspec s3 driver and we have a rust version too, both are way faster
Cc @Yi Chiu
q
Ho interesting ! Is it released? I'll have a look at the changelog.
k
q
Wow so nice ! I'll definitely test it.
For the moment I've found a manual workaround
Copy code
def faster_flyte_file(flyte_file: FlyteFile) -> FlyteFile:
    uri: str = flyte_file._remote_source  # type: ignore
    local_path: str = flyte_file.path  # type: ignore

    def _downloader():
        bucket = uri.split("/")[2]
        key = "/".join(uri.split("/")[3:])
        s3 = boto3.resource("s3")
        s3.Bucket(bucket).download_file(Key=key, Filename=local_path)

    os.makedirs(Path(local_path).parent)
    flyte_file_with_s3_downloader = FlyteFile(path=local_path, downloader=_downloader)
    flyte_file_with_s3_downloader._remote_source = uri  # type: ignore
    flyte_file_with_s3_downloader.download()
    return flyte_file_with_s3_downloader
 
# then in your task do
flyte_file_fast = faster_flyte_file(flyte_file_slow)
On the small cpu worker I tested the transfer speed increases from ~30MB/s to ~170MB/s.
@Ketan (kumare3) I'm curious about the design rationale for this plugin. What are the requirements which need to reimplement something which seems to be already implemented in boto3 and/or fsspec ? 🤔
The goal of this is to improve the performance but still sticking with the fsspec API ?
The flytekit-async-fsspec plugin doesn't seem to be published yet on pypi. Running
pip install flytekitplugins-async-fsspec
yields an error:
Copy code
ERROR: Could not find a version that satisfies the requirement flytekitplugins-async-fsspec (from versions: none)
ERROR: No matching distribution found for flytekitplugins-async-fsspec
k
aah that means it is not yet fully released
we found a bug in s3fs
and fixing it in s3fs directly was non trivial, so we decided to see how it performs when we re-implment
we found that this was way faster, but there is a potentially memory hit (more memory requirement), so we did not make it the default yet
q
Yeah that's tricky
Thanks for the info 🙂
y
Hi @Quentin Chenevier There is an implementation flaw in s3fs that significantly slows down the upload and download speeds. Consequently, I developed a plugin that leverages the entire bandwidth to download and upload files while maintaining the same interface. This plugin inherit from s3fs and only override get_file and put_file methods, so we didn't actually reimplement everything. You can see the speed improvement in the description of this PR.
I think the memory usage problem has been resolved. @Kevin Su Are we able to release it?
k