Hi Community, we are experiencing problems when up...
# ask-the-community
k
Hi Community, we are experiencing problems when uploading or downloading FlyteDiretories onto our self-managed s3 store (ceph). We sporadically get errors like this:
Original exception: [Errno 5] An error occurred () when calling the PutObject operation: , cause=[Errno 5] An error occurred () when calling the PutObject operation:
The error seems to be coming from fsspec/s3fs, deep inside botocore, a 408 Request Time-out could be observed, although, I neither suspect botocore, nor our infrastructure to be the cause of this problem, because the error could not be reproduced with aws cli. Did anyone else encounter such issues using FlyteDirectories? I was able to reproduce this error using the following code. The S3 directory contained roughly 300 files of about 300kB size each.
Copy code
import multiprocessing
import os
import tempfile

from s3fs import S3FileSystem

s3_endpoint = os.environ.get("S3_ENDPOINT") or os.environ.get("FSSPEC_S3_ENDPOINT_URL")
s3_access_key = os.environ.get("AWS_ACCESS_KEY_ID") or os.environ.get("FSSPEC_S3_KEY")
s3_secret_key = os.environ.get("AWS_SECRET_ACCESS_KEY") or os.environ.get("FSSPEC_S3_SECRET")


src = "<s3://path/to/folder>"


def download_folder(src, dst):
    fs = S3FileSystem(
        key=s3_access_key,
        secret=s3_secret_key,
        client_kwargs={"endpoint_url": s3_endpoint},
    )
    try:
        fs.get(src, dst, recursive=True)
    except Exception as exc:
        print(str(exc))

temp_dir = tempfile.mkdtemp()
processes = []
for i in range(500):
    dst = os.path.join(temp_dir, str(i))
    process = multiprocessing.Process(target=download_folder, args=(src, dst))
    processes.append(process)
    process.start()
    

for process in processes:
    process.join()
k
Timeout is possible- you can add retries
I mean retries for boto- this can be done by env vars
k
Thanks Ketan, I assume you mean something like this?
Copy code
config_kwargs={
            "connect_timeout": 86400,
            "read_timeout": 86400,
            "retries": {
                "total_max_attempts": 100,
                "max_attempts": 100,
                "mode": "standard",
            },
            "tcp_keepalive": True,
        }
Unfortunately, this doesn't fix the problem. This is the message I am extracting from somewhere inside botocore, it looks like botocore doesn't even retry (RetryAttempts: 0), even though, I can confirm from another log message, that the config parameters have been set properly.
Copy code
{'Error': {'Message': '', 'Code': ''}, 'body': {'h1': '408 Request Time-out'}, 'ResponseMetadata': {'HTTPStatusCode': 408, 'HTTPHeaders': {'content-length': '110', 'cache-control': 'no-cache', 'content-type': 'text/html', 'connection': 'close'}, 'RetryAttempts': 0}}
k
Sad fsspec - you are using s3fs I assume
k
yes, exactly