microscopic-furniture-57275
01/28/2025, 8:11 PMFlyteDirectory
to download an s3-based folder to the compute node during a task. The s3 folder contains ~45000 files of 8MB each, for a total of ~350GB. What happens is that just after 15 minutes, consistently, we start getting lots of 4xx errors, and the flyte task fails with this error:
Traceback (most recent call last):
File "/app/venv/lib/python3.9/site-packages/flytekit/exceptions/scopes.py", line 242, in user_entry_point
return wrapped(*args, **kwargs)
File "/app/venv/lib/python3.9/site-packages/plaster/run/ims_import/ims_import_task.py", line 193, in ims_import_flyte_task
local.path(src_dir.download()),
File "/app/venv/lib/python3.9/site-packages/flytekit/types/directory/types.py", line 258, in download
return self.__fspath__()
File "/app/venv/lib/python3.9/site-packages/flytekit/types/directory/types.py", line 150, in __fspath__
self._downloader()
File "/app/venv/lib/python3.9/site-packages/flytekit/types/directory/types.py", line 482, in _downloader
return ctx.file_access.get_data(uri, local_folder, is_multipart=True, batch_size=batch_size)
File "/app/venv/lib/python3.9/site-packages/flytekit/core/data_persistence.py", line 521, in get_data
raise FlyteAssertion(
Message:
FlyteAssertion: USER:AssertionError: error=Failed to get data from s3://<path-redacted> (recursive=True).
Original exception: Access Denied
User error.
"Access Denied" doesn't really make sense in the basic sense of that term, in that the folder is a flat folder of 45k files that all have the same permissions. We suspect some kind of rate/request limit or throttling but haven't been able to confirm this with logs on AWS (we're adding more detailed logging to the bucket -- at present we only know many "4xx" errors, as reported by AWS RequestMetrics, occur around the time of the flyte task failure).
We are using the flyte-binary-default authType="iam" for the pod, and the keys generated should be good for the default of 1 hour, though we're also experimenting with providing a specific key.
We are running flyte-binary 1.13.3 and using flytekit 1.12.0.
Thanks in advance for any insights!thankful-minister-83577
microscopic-furniture-57275
01/30/2025, 6:08 PMthankful-minister-83577
microscopic-furniture-57275
01/30/2025, 6:10 PMthankful-minister-83577
thankful-minister-83577
microscopic-furniture-57275
01/31/2025, 3:04 AMblue-salesmen-88843
02/18/2025, 10:51 AM<bucket_name> [31/Jan/2025:09:35:02 +0000] 10.0.46.199 - G5SXVSN2BXTED7PV REST.GET.BUCKET - "GET /?list-type=2&prefix=beta001%2F2025%2F2025_01%2Fb1_250109%2Fb1_esn_250109_1530c_01_bsp002%2Fb1_esn_250109_1530c_01_bsp002%2F&delimiter=&encoding-type=url HTTP/1.1" 403 AccessDenied 275 - 16 - "-" "Botocore/1.29.161 Python/3.9.21 Linux/5.10.230-223.885.amzn2.x86_64" - 4Bgf98LRg7mUjMa9m5GiNnj2BAXZksMeL7EAr7tGprM8et6p5i+gh81jGyhb9qjPqowOaELUpVeMbNlTDbWhDtF9mpxEIQO+kdBWeh/rPYQ= - TLS_AES_128_GCM_SHA256 - <http://erisyon-acquire.s3.amazonaws.com|erisyon-acquire.s3.amazonaws.com> TLSv1.3 - -
There is a valid token attached to a pod (we use IRSA). Any idea how we could proceed to debug this? Could this be anything related to s3fs
and some parameters we can tweak there? Thanks for you support!microscopic-furniture-57275
02/25/2025, 3:57 PMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
thankful-minister-83577