Hey Flyte Community! We are experiencing issues u...
# flyte-support
m
Hey Flyte Community! We are experiencing issues using
FlyteDirectory
to download an s3-based folder to the compute node during a task. The s3 folder contains ~45000 files of 8MB each, for a total of ~350GB. What happens is that just after 15 minutes, consistently, we start getting lots of 4xx errors, and the flyte task fails with this error:
Copy code
Traceback (most recent call last):

      File "/app/venv/lib/python3.9/site-packages/flytekit/exceptions/scopes.py", line 242, in user_entry_point
        return wrapped(*args, **kwargs)
      File "/app/venv/lib/python3.9/site-packages/plaster/run/ims_import/ims_import_task.py", line 193, in ims_import_flyte_task
        local.path(src_dir.download()),
      File "/app/venv/lib/python3.9/site-packages/flytekit/types/directory/types.py", line 258, in download
        return self.__fspath__()
      File "/app/venv/lib/python3.9/site-packages/flytekit/types/directory/types.py", line 150, in __fspath__
        self._downloader()
      File "/app/venv/lib/python3.9/site-packages/flytekit/types/directory/types.py", line 482, in _downloader
        return ctx.file_access.get_data(uri, local_folder, is_multipart=True, batch_size=batch_size)
      File "/app/venv/lib/python3.9/site-packages/flytekit/core/data_persistence.py", line 521, in get_data
        raise FlyteAssertion(

Message:

    FlyteAssertion: USER:AssertionError: error=Failed to get data from s3://<path-redacted> (recursive=True).

Original exception: Access Denied

User error.
"Access Denied" doesn't really make sense in the basic sense of that term, in that the folder is a flat folder of 45k files that all have the same permissions. We suspect some kind of rate/request limit or throttling but haven't been able to confirm this with logs on AWS (we're adding more detailed logging to the bucket -- at present we only know many "4xx" errors, as reported by AWS RequestMetrics, occur around the time of the flyte task failure). We are using the flyte-binary-default authType="iam" for the pod, and the keys generated should be good for the default of 1 hour, though we're also experimenting with providing a specific key. We are running flyte-binary 1.13.3 and using flytekit 1.12.0. Thanks in advance for any insights!
t
i assume everything works if you try with a smaller folder?
m
Yeah - we use flytedirectory all the time for 10s of G of data like this....
t
for directories, the flytekit sdk completely offloads the work to an fsspec (i.e. s3fs) call.
m
Ok, we'll keep digging on the logging etc. We thought perhaps there were some client-side session-timeouts configured somewhere that we'd missed or something.
t
sorry got pulled away. is there any other error messages you can offer?
I know it’s also possible to configure the number of coroutines fsspec is running, but more information from aws maybe even would be helpful to guide what to tweak
m
Hey @thankful-minister-83577 I got pulled away as well. I'll check with our devops to see if the more detailed logging on the aws bucket has revealed any more specific errors. And yeah I wondered about the annotated BatchSize that can be used with FlyteDirectory -- we're not specifying anything in this case, so using whatever the default is.
b
Hey @thankful-minister-83577! I work with @microscopic-furniture-57275 and can provide more details on the above issue (sorry for the late reply). We enabled the S3 access logs and can see the following 403 access denied on S3 bucket:
Copy code
<bucket_name> [31/Jan/2025:09:35:02 +0000] 10.0.46.199 - G5SXVSN2BXTED7PV REST.GET.BUCKET - "GET /?list-type=2&prefix=beta001%2F2025%2F2025_01%2Fb1_250109%2Fb1_esn_250109_1530c_01_bsp002%2Fb1_esn_250109_1530c_01_bsp002%2F&delimiter=&encoding-type=url HTTP/1.1" 403 AccessDenied 275 - 16 - "-" "Botocore/1.29.161 Python/3.9.21 Linux/5.10.230-223.885.amzn2.x86_64" - 4Bgf98LRg7mUjMa9m5GiNnj2BAXZksMeL7EAr7tGprM8et6p5i+gh81jGyhb9qjPqowOaELUpVeMbNlTDbWhDtF9mpxEIQO+kdBWeh/rPYQ= - TLS_AES_128_GCM_SHA256 - <http://erisyon-acquire.s3.amazonaws.com|erisyon-acquire.s3.amazonaws.com> TLSv1.3 - -
There is a valid token attached to a pod (we use IRSA). Any idea how we could proceed to debug this? Could this be anything related to
s3fs
and some parameters we can tweak there? Thanks for you support!
m
@thankful-minister-83577 - a bit more info on this issue which we've solved for now - AWS was able to look at internal logs and determine that after 15 minutes, the http requests to the s3 bucket became anonymous, most likely because some session timed out -- the requests continued, but were anonymous (unlike for the first 15 minutes), which caused the Access Denied error. Our keys generated from AWS roles are set to be good for 1 hour, so we think it possible something else behind the scenes (e.g. when flytekit calls s3fs/fssspec) is using some default which causes some aspect of the session to timeout at 15 minutes. We fixed this by allowing anonymous requests to this bucket, which is still secure, because it must come from our own virtual private endpoint. Hopefully we can address this more specifically at some point. We have been using flytekit 1.12.0 which requires as dependency an even older version of s3fs/fsspec. So maybe this will also fix itself when we upgrade to flytekit 1.15.0, which we are in the process of, but running into some serialization issues discussed in a separate message. @blue-salesmen-88843 feel free to elaborate or correct me if I've misspoken!
t
Could you open an issue with this actually? I’d like to keep track of this.
thanks for digging into this with aws.
but it still doesn’t really make sense to me and I haven’t seen this behavior before. with every call we end up calling fsspec, so this has to be something on that layer.
like it must be managing sessions incorrectly or something… even before upgrading flytekit, may be worth it to see if the problem goes away with different boto/fsspec/s3fs versions