proud-answer-87162
10/25/2023, 4:08 PMRemoteURLInterface for Azure (example). It seems the interface is only called when remoteDataConfig.SignedURL.Enabled == true, is that right?
Also, I noticed that the aws implementation uses the s3 client directly to fetch a signed URL from a created Request object which does not appear to be used (beyond fetching the URL).
I found that surprising; does anyone know why RawStore.CreateSignedURL isn't used for that purpose? It feels like offloading that responsibility to the store implementation (likely Stow) would work, while avoiding having to re-implement the logic in each concrete RemoteURL type.thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
10/30/2023, 2:40 PMas for the url that’s returned however, that is used. this is part of the fast-registration/upload of offloaded data types flow used in flytekit.ok, that makes sense. but the URL returned from s3 client and the URL returned from stow should be the same, right? i think stow has an s3 implementation, which would likely use the s3 client in the same manner as the flyte code
proud-answer-87162
10/30/2023, 6:04 PMas for the url that’s returned however, that is used. this is part of the fast-registration/upload of offloaded data types flow used in flytekit.right, the URL is used to fetch data from the store. but only when
remoteDataConfig.SignedURL.Enabled == true, correct?proud-answer-87162
11/29/2023, 5:36 PMthankful-minister-83577
thankful-minister-83577
proud-answer-87162
11/30/2023, 2:55 PMproud-answer-87162
12/14/2023, 10:11 PMCreateUploadLocation flow. flytekit has a method get_upload_signed_url which seems to be responsible for fetching the SAS and using it to upload the blob.
i do see that the WorkflowExecutionGetDataResponse gets returned UrlBlob.bytes is actively used, and presumably that could be handled differently. i think calling remoteDataStoreClient.CreateSignedURL is unnecessary.proud-answer-87162
12/14/2023, 10:19 PMproud-answer-87162
12/21/2023, 12:23 AMmaxSizeInBytes, and flyteremote tries to access an flyte resource (e.g., remote.fetch_execution()), _get_input_literal_map attempts to use the SAS token. However, in both the existing AWS case and in the new Azure implementation, flytekit throws an error when trying to use the SAS token.
In the existing AWS case flytekit reports the "file name is too long"; the service seems to use the SAS token and other query params when copying data to the local dir. For the new Azure implementation, the error is "IsADirectoryError: [Errno 21] Is a directory:
'/var/folders/2m/lj3gvn411sbff510w8lgnv480000gv/T/flytemcf_achx/control_plane_me
tadata/local_flytekit/outputs.pb'"proud-answer-87162
12/21/2023, 12:28 AMRemoteURLInterface , and i just stumbled upon a bug or an unexpected configuration. but it doesn't seem like the SAS produced by the remote-url feature is successfully used by either flyte or the clients i have tested with.
do you have any insight into my findings above? if this is a deprecated feature as some comments suggest, does it need to be supported in the az implementation?thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
outputs.pb not the whole path but that’s just a guess. do you have a draft pr that we could take a look at? copy here the request payload that is being sent?proud-answer-87162
12/21/2023, 3:10 AMproud-answer-87162
12/21/2023, 3:11 AMproud-answer-87162
12/21/2023, 3:15 AMIsADirectoryError comes from the python code rather than azure, but i'll track that down tomorrow morning.proud-answer-87162
12/21/2023, 3:16 AM_assign_inputs_and_outputs flow. is that expected to work with presigned urls and the flyteremote api?thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
12/21/2023, 2:43 PMproud-answer-87162
12/21/2023, 2:43 PMproud-answer-87162
12/21/2023, 3:47 PMremoteData inline config with flyte-binary does not work for me locally with my configurationthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
12/22/2023, 3:07 PMpyflyte, and it works when loading data in the UI.
i only see a failure when setting signedUrls: true and maxByteSize to a low limit, then fetching a resource using flyteremote.
this is the config i added to flyte-binary values.yaml:
inline:
remoteData:
region: us-west-2
scheme: aws
maxSizeInBytes: 1
signedUrls:
enabled: true
durationMinutes: 5proud-answer-87162
12/22/2023, 11:21 PMdata_persistence to dst = file_system.get(from_path, to_path, recursive=recursive, **kwargs) will fail if from_path is over some char limit (at least on macOS). fsspec does not appear to have a configuration to use the full from_path to fetch the data but only a portion of it for the dst result. it's simple to repro this with a local test, assuming the url is long enough to trigger the failure.proud-answer-87162
12/22/2023, 11:24 PMproud-answer-87162
01/05/2024, 5:31 PM_get_output_literal_map. this is due to another fsspec quirk: calling get with a URL for rpath and a filename for lpath results in lpath being used as a path dir. e.g., if rpath is http://mydomain/outpub.pb and lpath is /var/output.pb a file gets written at /var/outpub.pb/output.pb.
this causes an issue in the flytekit code because remote._get_output_literal_map() creates tmp_name with the filename ("output.pb") and then uses tmp_name as the location for the resulting file. in reality, that path is a dir that houses the actual file.
to demonstrate what works with fsspec i put together a hacky update to remote._get_output_literal_map() for my particular use case:
elif execution_data.outputs.bytes > 0:
with self.remote_context() as ctx:
tmp_name = os.path.join(ctx.file_access.local_sandbox_dir)
file_name = execution_data.outputs.url.rsplit("/", 1)[1]
ctx.file_access.get_data(execution_data.outputs.url, tmp_name)
return literal_models.LiteralMap.from_flyte_idl(
utils.load_proto_from_file(literals_pb2.LiteralMap, tmp_name + "/" + file_name)
)proud-answer-87162
01/05/2024, 5:32 PMtmp_name represents the dir the local file is written to (as opposed to a filename)
2. I construct the resulting file location from tmp_name and the file from remote urlproud-answer-87162
01/05/2024, 5:33 PMproud-answer-87162
01/08/2024, 9:46 PMthankful-minister-83577
<http://mydomain/outpub.pb> and lpath is /var/output.pb ” do you mean that the other way around? that is, in this call here in the function of interest, execution_data.outputs.url == <http://mydomain/outpub.pb> and tmp_name == /var/outpubs.pb ?thankful-minister-83577
is_multipart arg).proud-answer-87162
01/09/2024, 2:51 PMexecution_data.outputs.url == <http://mydomain/output.pb> and tmp_name == /var/output.pb/output.pb , which results in rpath (remote) == <http://mydomain/output.pb> and lpath (local) == /var/output.pbproud-answer-87162
01/09/2024, 2:54 PMutils.py:other_paths()) that does the concatenationproud-answer-87162
01/09/2024, 3:03 PMdata_persistence get_data and get are just passthroughs if recursive == falsethankful-minister-83577
tmp_name getting set to /var/output.pb/output.pb?thankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 4:46 PMtmp_name is set to /var/output.pbthankful-minister-83577
proud-answer-87162
01/09/2024, 4:47 PMother_pathsthankful-minister-83577
proud-answer-87162
01/09/2024, 4:51 PMother_paths behaves as you expect. but if exists == true, instead of using lpath with a filename as the destination, it uses it to form the final path. you can see this by looking at a test in fsspec test_utils.py : (["/path1"], "/path2", True, ["/path2/path1"])proud-answer-87162
01/09/2024, 4:52 PMexists == true in this use casethankful-minister-83577
thankful-minister-83577
/var/outputs.pb then the problem doesn’t happen?proud-answer-87162
01/09/2024, 5:02 PMthankful-minister-83577
rm /var/outputs.pbproud-answer-87162
01/09/2024, 5:02 PMproud-answer-87162
01/09/2024, 5:03 PM/var/outputs.pb is a dir. the file is at /var/outputs.pb/outputs.pbthankful-minister-83577
rm then exists will be false the first time right?proud-answer-87162
01/09/2024, 5:15 PMexists = source_is_str and (
(has_magic(rpath) and source_is_file)
or (not has_magic(rpath) and dest_is_dir and source_not_trailing_sep)proud-answer-87162
01/09/2024, 5:15 PMgetthankful-minister-83577
/var/outputs.pbthankful-minister-83577
proud-answer-87162
01/09/2024, 5:18 PMrpath is the remote file/source in the _get operationproud-answer-87162
01/09/2024, 5:18 PMthankful-minister-83577
thankful-minister-83577
rm the local file, make sure it doesn’t exist either as a directory or a file, does the problem persist?proud-answer-87162
01/09/2024, 5:56 PMexists is a test for the rpath (source)proud-answer-87162
01/09/2024, 5:57 PM/var/outputs.pb/output.pb but flytekit expects the file to be at /var/outputs.pbproud-answer-87162
01/09/2024, 5:58 PM/var/outputs.pb a IsADirectoryError gets thrownproud-answer-87162
01/09/2024, 5:59 PMdef load_proto_from_file(pb2_type, path):
with open(path, "rb") as reader:
out = pb2_type()
out.ParseFromString(reader.read())
return outthankful-minister-83577
proud-answer-87162
01/09/2024, 6:04 PMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 6:05 PMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 6:08 PMremoteData is configured to set signedUrls:true
b) maxSizeInBytes
d) FlyteRemote is used to interact with the execution - I am not 100% confident in this, but it's the only use case I could findproud-answer-87162
01/09/2024, 6:09 PMproud-answer-87162
01/09/2024, 6:14 PMdef test_get_output():
kwargs = {}
spec = fsspec.implementations.http.HTTPFileSystem(fsspec.filesystem("https", **kwargs))
frompath = linkToUrlFileile
topath = "/var/inputs10.txt"
response = spec.get(frompath, topath, **kwargs)
print(response)proud-answer-87162
01/09/2024, 6:15 PMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 6:16 PMthankful-minister-83577
proud-answer-87162
01/09/2024, 6:20 PMproud-answer-87162
01/09/2024, 6:23 PMthankful-minister-83577
thankful-minister-83577
import fsspec
import shutil
import os
def test_file_handling():
s3kwargs = {'cache_regions': False, 'key': 'minio', 'secret': 'miniostorage',
'client_kwargs': {'endpoint_url': '<http://localhost:30002>'}}
s3_file = "<s3://my-s3-bucket/data/5w/acwl6plp7ps4fwpsgrzb-n1-0/264513408320143fe0a02bf95c023505/00000>"
sss = fsspec.filesystem("s3", **s3kwargs)
local_dir = "/Users/ytong/temp/fss_output"
try:
shutil.rmtree(local_dir)
except FileNotFoundError:
...
sss.get(s3_file, os.path.join(local_dir, "from_s3"))
def test_file_handling_http():
local_dir = "/Users/ytong/temp/fss_output"
try:
shutil.rmtree(local_dir)
except FileNotFoundError:
...
http_fs = fsspec.filesystem("http")
http_signed_url = "<http://localhost:9000/my-s3-bucket/data/5w/acwl6plp7ps4fwpsgrzb-n1-0/264513408320143fe0a02bf95c023505/00000?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AIY0FO8J1300DN5XZSXF%2F20240109%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240109T181127Z&X-Amz-Expires=604800&X-Amz-Security-Token=eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3NLZXkiOiJBSVkwRk84SjEzMDBETjVYWlNYRiIsImV4cCI6MTcwNDg2NzAxNSwicGFyZW50IjoibWluaW8ifQ.MmX5eqMsw2e77Pd90EiYaQtyFZRDg0jZJRWo4QEefxV8PeLvszALvCiXW8HRenpamY-Y-roSwGhoYDKRKGcCHQ&X-Amz-SignedHeaders=host&versionId=null&X-Amz-Signature=72d2bae5a221adb9d7044a38210e365dbf969939b7ba9b8fa9d44117cc041f8e>"
http_fs.get(http_signed_url, os.path.join(local_dir, "from_http"))thankful-minister-83577
thankful-minister-83577
kf port-forward svc/flyte-sandbox-minio 9000:9000 to get the port 9000 to show upthankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 6:27 PMput case. i haven't debugged that because i know it works, but i suspect it's related to how fsspec uses lpath to construct rpath in a different mannerproud-answer-87162
01/09/2024, 6:28 PM_get use case)thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
/thankful-minister-83577
"<http://localhost:9000/my-s3-bucket/data/5w/acwl6plp7ps4fwpsgrzb-n1-0/264513408320143fe0a02bf95c023505/00000?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AIY0FO8J1300DN5XZSXF%2F20240109%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240109T181127Z&X-Amz-Expires=604800&X-Amz-Security-Token=eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3NLZXkiOiJBSVkwRk84SjEzMDBETjVYWlNYRiIsImV4cCI6MTcwNDg2NzAxNSwicGFyZW50IjoibWluaW8ifQ.MmX5eqMsw2e77Pd90EiYaQtyFZRDg0jZJRWo4QEefxV8PeLvszALvCiXW8HRenpamY-Y-roSwGhoYDKRKGcCHQ&X-Amz-SignedHeaders=host&versionId=null&X-Amz-Signature=72d2bae5a221adb9d7044a38210e365dbf969939b7ba9b8fa9d44117cc041f8e>"proud-answer-87162
01/09/2024, 7:00 PMproud-answer-87162
01/09/2024, 7:01 PM/ the behavior is the same if exists == trueproud-answer-87162
01/09/2024, 7:01 PMput use case, where exists == falseproud-answer-87162
01/09/2024, 7:02 PMother_paths does is strip trailing /, hehthankful-minister-83577
proud-answer-87162
01/09/2024, 7:08 PMproud-answer-87162
01/09/2024, 7:09 PMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 7:09 PMTrue for URLsthankful-minister-83577
http_fs.get([http_signed_url], [os.path.join(local_dir, "from_http")]) works, not sure if you mentioned that already aboveproud-answer-87162
01/09/2024, 7:11 PMproud-answer-87162
01/09/2024, 7:11 PMthankful-minister-83577
proud-answer-87162
01/09/2024, 7:13 PMthankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 7:19 PMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 7:22 PMproud-answer-87162
01/09/2024, 7:22 PMproud-answer-87162
01/09/2024, 7:49 PMthankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 7:54 PMif fsspec fixes the behavior, that will resolve both of these right?4701 will be fixed if fsspec fixes the bug. but 4700 will still exist because flytekit expects lpath to be
/var/output.pb but instead it has been updated to /var/output.pb/output.pb.proud-answer-87162
01/09/2024, 7:58 PM_get_output_literal_map constructs/uses tmp_name is probably the simplestthankful-minister-83577
cp /file/a /other/b linux doesn’t put it under /other/b/athankful-minister-83577
proud-answer-87162
01/09/2024, 8:09 PMproud-answer-87162
01/10/2024, 8:05 PM