Hi there, I’m stuck running an execution on my fly...
# flyte-support
c
Hi there, I’m stuck running an execution on my flyte cluster. I’m running flyte-binary on a vanilla Kubernetes Cluster in the latest version with NetApp S3 Storage. I was able to register tasks, workloads and launchplans and for that the usage of the S3 Bucket was possible. It was also possible to run an execution, but it exited with this error message:
Copy code
File "/usr/local/lib/python3.12/site-packages/flytekit/core/data_persistence.py", line 614, in async_get_data
    raise FlyteDownloadDataException(
flytekit.exceptions.system.FlyteDownloadDataException: SYSTEM:DownloadDataError: error=Failed to get data from s3://*****/test-project-01/development/YR7RMXMIOOCGYZIJKBIO2N4KUI======/fast6c01ca0737d31ff994073617f3ac5dec.tar.gz to /root/ (recursive=False).

Original exception: Unable to locate credentials
.
If I get it correctly, the difference to the registration process of the workflow is that the user context changed right? But now I stuck for 2 days because I can’t figure out what is the right place and best practice for providing the credentials to the workflow. Do I have to create a k8s-Service Account and connect them to the launchplan? And what is the right way to attach the credentials to the SA? The documentation is very focused on hyperscaler usage and less on prem setups. I’m thankful for any kind of help.
c
Registration happens in the control plane, execution of the task happens in the data plane. Seems to me that that the data plane might be missing the s3 credentials so they can't be plumbed into the generated k8s pod manifest. I'm not familiar with flyte-binary so I'd have to see if those credentials are different. (we are on prem btw)
c
Thank you Jason for your input. Based on that I continued my research and found a working solution by adding the environment variables
FLYTE_AWS_ENDPOINT
,
FLYTE_AWS_ACCESS_KEY_ID
and
FLYTE_AWS_SECRET_ACCESS_KEY
to the
env
section of the execution yaml file. The download part seems to work, but now there is an error when uploading the data:
Copy code
flytekit.exceptions.system.FlyteUploadDataException: SYSTEM:UploadDataError: error=Failed to put data from /tmp/flyteyr9uht3k/local_flytekit/engine_dir to s3://****/metadata/propeller/test-project-01-development-as6jzb2mglblwnbgrhfj/n0/data/0 (recursive=True).

Original exception: [Errno 22] x-amz-content-sha256 must be UNSIGNED-PAYLOAD, STREAMING-AWS4-HMAC-SHA256-PAYLOAD or a valid sha256 value., cause=[Errno 22] x-amz-content-sha256 must be UNSIGNED-PAYLOAD, STREAMING-AWS4-HMAC-SHA256-PAYLOAD or a valid sha256 value.
Because I’m using NetApp S3 Storage and not AWS the header
x-amz-content-sha256
is not supported. Is it possible to disable this checksum in flyte/boto3?
t
which version of ONTAP are you on? looks like 9.11.1 or higher should support this.
c
We are using S3 "compliant" storage as well and have similar issues. We run an old version of botocore to work around the issue.
c
We are running ONTAP 9.15.1P10 . And you are right:
ONTAP S3 now supports chunked uploads signing requests using x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD
I will try to investigate if flyte/boto does not send the right header of if NetApp reject it.
👍 1
The issue seems to originate from Flyte. I was able to upload data to the bucket with the header
x-amz-content-sha256: UNSIGNED-PAYLOAD
as well as
x-amz-content-sha256: <hash>
with cURL:
Copy code
curl -sS "https://###.###.de/s3-###/testdata/test01" -H "x-amz-content-sha256: UNSIGNED-PAYLOAD" -T "test01" --user "<access_key_id>:<secret>" --aws-sigv4 "aws:amz:US:s3" -v

curl -sS "https://###.###.de/s3-###/testdata/test01" -H "x-amz-content-sha256: $(sha256sum --quiet test01)" -T "test01" --user "<access_key_id>:<secret>" --aws-sigv4 "aws:amz:US:s3" -v
Do you have any ideas or should I better open an issue on Github?
t
@cuddly-napkin-839 i think you’ll need to check whether the header is present in the request. you can add some debug print statements in the task to see what’s happening. here’s the file you’ll want to debug: https://github.com/flyteorg/flytekit/blob/master/flytekit/core/data_persistence.py. flytekit uses fsspec under the hood for data handling, and it’s unclear why the header is missing or why its value isn’t correct.
c
My tasks are simple and based on the examples. There is no additional S3 communication on my side. So what can I debug in this case? If I get it correctly it’s just internal S3 communication of flytekit.
Copy code
import flytekit as fl

@fl.task
def task_1(a: int, b: int, c: int) -> int:
    return a + b + c

@fl.task
def task_2(m: int, n: int) -> int:
    return m * n

@fl.task
def task_3(x: int, y: int) -> int:
    return x - y

@fl.workflow
def my_workflow(a: int, b: int, c: int, m: int, n: int) -> int:
    x = task_1(a=a, b=b, c=c)
    y = task_2(m=m, n=n)
    return task_3(x=x, y=y)

# Combining default and fixed inputs
lp_combined = fl.LaunchPlan.get_or_create(
    workflow=my_workflow,
    name="combined_inputs",
    default_inputs={"b": 1, "c": 2, "m": 3, "n": 4},
    fixed_inputs={"a": 200}
)
t
yeah, it’s related to flytekit. per https://kb.netapp.com/on-prem/ontap/da/S3/S3-KBs/Copy_Upload_failure_using_AWS_CLI_and_Java__InvalidArgument, can you try setting the suggested env vars in the task (
AWS_REQUEST_CHECKSUM_CALCULATION
and
AWS_RESPONSE_CHECKSUM_VALIDATION
to
when_required
)? if that doesn’t work, we’ll need to add some debug statements to the flytekit code, use the dev version in the image, and run it on the cluster to inspect the s3 request.
c
Thank you. Now it’s working. I set the Env variables
AWS_REQUEST_CHECKSUM_CALCULATION
and
AWS_RESPONSE_CHECKSUM_VALIDATION
to
when_required
. Now I’m not sure what HEADER is actually used, but that’s fine for me at the moment.
t
good to know you're unblocked for now! the error has to do with botocore. i'm not sure if the image has the latest version, but you could try installing it.
if you really want to dig deeper into the header, you could enable logging in the flytekit data persistence file i shared (e.g.
logging.getLogger('botocore').setLevel(logging.DEBUG)
), then run
make build-dev
to build the flytekit image and use it for your task. i believe you should then see headers in the task logs.