Hi folks -- a question on blob storage ETags. As w...
# contribute
e
Hi folks -- a question on blob storage ETags. As we're working through adding support for SSE-KMS support in S3 for encryption-at-rest, we're realizing that some of the assumptions Flyte makes about the use of ETags are no longer valid. The S3 docs have this to say:
Copy code
An entity tag (ETag) that represents a specific version of an object. For objects that are not uploaded as a multipart upload and are either unencrypted or encrypted by server-side encryption with Amazon S3 managed keys (SSE-S3), the ETag is an MD5 digest of the data.
In other words, a client generated MD5 != ETag when using SSE-KMS. So basic checks around "is this the same file content" will never work properly. I wouldn't be surprised if Azure blob has the same issue with encryption enabled. Anyone have thoughts about this? Does anyone know off the top of their head what sort of problems we're going to run into without ETags always being MD5? It seems like a custom header controlled by clients should be used instead like
x-flyte-checksum-md5
. Looking at API docs -- S3, Azure Blob Storage, GCS and Minio all support custom metadata (though some may require specific header prefixes, so
x-flyte
might not work). So I'm wondering if there's a path to addressing this problem with client controlled checksums? Maintaining backwards compat with this approach seems tricky...
k
Cc @Eduardo Apolinario (eapolinario) / @Yee any thoughts
y
@Ethan Brown could you submit an issue for this please?
we'll do some investigation and circle back on there.
e
k
cc @Haytham Abuelfutuh