Hi Community we are experiencing problems when uploading or Flyte #flyte-support

Hi Community, we are experiencing problems when up...

white-teacher-47376

06/07/2024, 6:15 AM

Hi Community, we are experiencing problems when uploading or downloading FlyteDiretories onto our self-managed s3 store (ceph). We sporadically get errors like this:

Original exception: [Errno 5] An error occurred () when calling the PutObject operation: , cause=[Errno 5] An error occurred () when calling the PutObject operation:

The error seems to be coming from fsspec/s3fs, deep inside botocore, a 408 Request Time-out could be observed, although, I neither suspect botocore, nor our infrastructure to be the cause of this problem, because the error could not be reproduced with aws cli. Did anyone else encounter such issues using FlyteDirectories? I was able to reproduce this error using the following code. The S3 directory contained roughly 300 files of about 300kB size each.

Copy code

import multiprocessing
import os
import tempfile

from s3fs import S3FileSystem

s3_endpoint = os.environ.get("S3_ENDPOINT") or os.environ.get("FSSPEC_S3_ENDPOINT_URL")
s3_access_key = os.environ.get("AWS_ACCESS_KEY_ID") or os.environ.get("FSSPEC_S3_KEY")
s3_secret_key = os.environ.get("AWS_SECRET_ACCESS_KEY") or os.environ.get("FSSPEC_S3_SECRET")


src = "<s3://path/to/folder>"


def download_folder(src, dst):
    fs = S3FileSystem(
        key=s3_access_key,
        secret=s3_secret_key,
        client_kwargs={"endpoint_url": s3_endpoint},
    )
    try:
        fs.get(src, dst, recursive=True)
    except Exception as exc:
        print(str(exc))

temp_dir = tempfile.mkdtemp()
processes = []
for i in range(500):
    dst = os.path.join(temp_dir, str(i))
    process = multiprocessing.Process(target=download_folder, args=(src, dst))
    processes.append(process)
    process.start()
    

for process in processes:
    process.join()

freezing-airport-6809

06/07/2024, 1:31 PM

Timeout is possible- you can add retries

freezing-airport-6809

06/07/2024, 1:32 PM

I mean retries for boto- this can be done by env vars

white-teacher-47376

06/10/2024, 11:25 AM

Thanks Ketan, I assume you mean something like this?

Copy code

config_kwargs={
            "connect_timeout": 86400,
            "read_timeout": 86400,
            "retries": {
                "total_max_attempts": 100,
                "max_attempts": 100,
                "mode": "standard",
            },
            "tcp_keepalive": True,
        }

Unfortunately, this doesn't fix the problem. This is the message I am extracting from somewhere inside botocore, it looks like botocore doesn't even retry (RetryAttempts: 0), even though, I can confirm from another log message, that the config parameters have been set properly.

Copy code

{'Error': {'Message': '', 'Code': ''}, 'body': {'h1': '408 Request Time-out'}, 'ResponseMetadata': {'HTTPStatusCode': 408, 'HTTPHeaders': {'content-length': '110', 'cache-control': 'no-cache', 'content-type': 'text/html', 'connection': 'close'}, 'RetryAttempts': 0}}

freezing-airport-6809

06/10/2024, 1:42 PM

Sad fsspec - you are using s3fs I assume

white-teacher-47376

06/10/2024, 2:35 PM

yes, exactly

20 Views

Open in Slack

Previous Next