proud-answer-87162
10/25/2023, 4:08 PMRemoteURLInterface
for Azure (example). It seems the interface is only called when remoteDataConfig.SignedURL.Enabled == true
, is that right?
Also, I noticed that the aws implementation uses the s3 client directly to fetch a signed URL from a created Request object which does not appear to be used (beyond fetching the URL).
I found that surprising; does anyone know why RawStore.CreateSignedURL
isn't used for that purpose? It feels like offloading that responsibility to the store implementation (likely Stow) would work, while avoiding having to re-implement the logic in each concrete RemoteURL type.thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
10/30/2023, 2:40 PMas for the url that’s returned however, that is used. this is part of the fast-registration/upload of offloaded data types flow used in flytekit.ok, that makes sense. but the URL returned from s3 client and the URL returned from stow should be the same, right? i think stow has an s3 implementation, which would likely use the s3 client in the same manner as the flyte code
proud-answer-87162
10/30/2023, 6:04 PMas for the url that’s returned however, that is used. this is part of the fast-registration/upload of offloaded data types flow used in flytekit.right, the URL is used to fetch data from the store. but only when
remoteDataConfig.SignedURL.Enabled == true
, correct?proud-answer-87162
11/29/2023, 5:36 PMthankful-minister-83577
thankful-minister-83577
proud-answer-87162
11/30/2023, 2:55 PMproud-answer-87162
12/14/2023, 10:11 PMCreateUploadLocation
flow. flytekit has a method get_upload_signed_url
which seems to be responsible for fetching the SAS and using it to upload the blob.
i do see that the WorkflowExecutionGetDataResponse
gets returned UrlBlob.bytes
is actively used, and presumably that could be handled differently. i think calling remoteDataStoreClient.CreateSignedURL
is unnecessary.proud-answer-87162
12/14/2023, 10:19 PMproud-answer-87162
12/21/2023, 12:23 AMmaxSizeInBytes
, and flyteremote tries to access an flyte resource (e.g., remote.fetch_execution()
), _get_input_literal_map
attempts to use the SAS token. However, in both the existing AWS case and in the new Azure implementation, flytekit throws an error when trying to use the SAS token.
In the existing AWS case flytekit reports the "file name is too long"; the service seems to use the SAS token and other query params when copying data to the local dir. For the new Azure implementation, the error is "IsADirectoryError: [Errno 21] Is a directory:
'/var/folders/2m/lj3gvn411sbff510w8lgnv480000gv/T/flytemcf_achx/control_plane_me
tadata/local_flytekit/outputs.pb'"proud-answer-87162
12/21/2023, 12:28 AMRemoteURLInterface
, and i just stumbled upon a bug or an unexpected configuration. but it doesn't seem like the SAS produced by the remote-url feature is successfully used by either flyte or the clients i have tested with.
do you have any insight into my findings above? if this is a deprecated feature as some comments suggest, does it need to be supported in the az implementation?thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
outputs.pb
not the whole path but that’s just a guess. do you have a draft pr that we could take a look at? copy here the request payload that is being sent?proud-answer-87162
12/21/2023, 3:10 AMproud-answer-87162
12/21/2023, 3:11 AMproud-answer-87162
12/21/2023, 3:15 AMIsADirectoryError
comes from the python code rather than azure, but i'll track that down tomorrow morning.proud-answer-87162
12/21/2023, 3:16 AM_assign_inputs_and_outputs
flow. is that expected to work with presigned urls and the flyteremote api?thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
12/21/2023, 2:43 PMproud-answer-87162
12/21/2023, 2:43 PMproud-answer-87162
12/21/2023, 3:47 PMremoteData
inline config with flyte-binary does not work for me locally with my configurationthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
12/22/2023, 3:07 PMpyflyte
, and it works when loading data in the UI.
i only see a failure when setting signedUrls: true
and maxByteSize
to a low limit, then fetching a resource using flyteremote.
this is the config i added to flyte-binary values.yaml:
inline:
remoteData:
region: us-west-2
scheme: aws
maxSizeInBytes: 1
signedUrls:
enabled: true
durationMinutes: 5
proud-answer-87162
12/22/2023, 11:21 PMdata_persistence
to dst = file_system.get(from_path, to_path, recursive=recursive, **kwargs)
will fail if from_path
is over some char limit (at least on macOS). fsspec does not appear to have a configuration to use the full from_path
to fetch the data but only a portion of it for the dst
result. it's simple to repro this with a local test, assuming the url is long enough to trigger the failure.proud-answer-87162
12/22/2023, 11:24 PMproud-answer-87162
01/05/2024, 5:31 PM_get_output_literal_map
. this is due to another fsspec quirk: calling get
with a URL for rpath
and a filename for lpath
results in lpath
being used as a path dir. e.g., if rpath
is http://mydomain/outpub.pb
and lpath
is /var/output.pb
a file gets written at /var/outpub.pb/output.pb
.
this causes an issue in the flytekit code because remote._get_output_literal_map()
creates tmp_name
with the filename ("output.pb") and then uses tmp_name
as the location for the resulting file. in reality, that path is a dir that houses the actual file.
to demonstrate what works with fsspec i put together a hacky update to remote._get_output_literal_map()
for my particular use case:
elif execution_data.outputs.bytes > 0:
with self.remote_context() as ctx:
tmp_name = os.path.join(ctx.file_access.local_sandbox_dir)
file_name = execution_data.outputs.url.rsplit("/", 1)[1]
ctx.file_access.get_data(execution_data.outputs.url, tmp_name)
return literal_models.LiteralMap.from_flyte_idl(
utils.load_proto_from_file(literals_pb2.LiteralMap, tmp_name + "/" + file_name)
)
proud-answer-87162
01/05/2024, 5:32 PMtmp_name
represents the dir the local file is written to (as opposed to a filename)
2. I construct the resulting file location from tmp_name
and the file from remote urlproud-answer-87162
01/05/2024, 5:33 PMproud-answer-87162
01/08/2024, 9:46 PMthankful-minister-83577
<http://mydomain/outpub.pb>
and lpath
is /var/output.pb
” do you mean that the other way around? that is, in this call here in the function of interest, execution_data.outputs.url
== <http://mydomain/outpub.pb>
and tmp_name
== /var/outpubs.pb
?thankful-minister-83577
is_multipart
arg).proud-answer-87162
01/09/2024, 2:51 PMexecution_data.outputs.url
== <http://mydomain/output.pb>
and tmp_name
== /var/output.pb/output.pb
, which results in rpath
(remote) == <http://mydomain/output.pb>
and lpath
(local) == /var/output.pb
proud-answer-87162
01/09/2024, 2:54 PMutils.py:other_paths()
) that does the concatenationproud-answer-87162
01/09/2024, 3:03 PMdata_persistence
get_data
and get
are just passthroughs if recursive == false
thankful-minister-83577
tmp_name
getting set to /var/output.pb/output.pb
?thankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 4:46 PMtmp_name
is set to /var/output.pb
thankful-minister-83577
proud-answer-87162
01/09/2024, 4:47 PMother_paths
thankful-minister-83577
proud-answer-87162
01/09/2024, 4:51 PMother_paths
behaves as you expect. but if exists == true
, instead of using lpath
with a filename as the destination, it uses it to form the final path. you can see this by looking at a test in fsspec test_utils.py
: (["/path1"], "/path2", True, ["/path2/path1"])
proud-answer-87162
01/09/2024, 4:52 PMexists == true
in this use casethankful-minister-83577
thankful-minister-83577
/var/outputs.pb
then the problem doesn’t happen?proud-answer-87162
01/09/2024, 5:02 PMthankful-minister-83577
rm /var/outputs.pb
proud-answer-87162
01/09/2024, 5:02 PMproud-answer-87162
01/09/2024, 5:03 PM/var/outputs.pb
is a dir. the file is at /var/outputs.pb/outputs.pb
thankful-minister-83577
rm
then exists will be false the first time right?proud-answer-87162
01/09/2024, 5:15 PMexists = source_is_str and (
(has_magic(rpath) and source_is_file)
or (not has_magic(rpath) and dest_is_dir and source_not_trailing_sep)
proud-answer-87162
01/09/2024, 5:15 PMget
thankful-minister-83577
/var/outputs.pb
thankful-minister-83577
proud-answer-87162
01/09/2024, 5:18 PMrpath
is the remote file/source in the _get
operationproud-answer-87162
01/09/2024, 5:18 PMthankful-minister-83577
thankful-minister-83577
rm
the local file, make sure it doesn’t exist either as a directory or a file, does the problem persist?proud-answer-87162
01/09/2024, 5:56 PMexists
is a test for the rpath
(source)proud-answer-87162
01/09/2024, 5:57 PM/var/outputs.pb/output.pb
but flytekit expects the file to be at /var/outputs.pb
proud-answer-87162
01/09/2024, 5:58 PM/var/outputs.pb
a IsADirectoryError
gets thrownproud-answer-87162
01/09/2024, 5:59 PMdef load_proto_from_file(pb2_type, path):
with open(path, "rb") as reader:
out = pb2_type()
out.ParseFromString(reader.read())
return out
thankful-minister-83577
proud-answer-87162
01/09/2024, 6:04 PMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 6:05 PMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 6:08 PMremoteData
is configured to set signedUrls:true
b) maxSizeInBytes
d) FlyteRemote
is used to interact with the execution - I am not 100% confident in this, but it's the only use case I could findproud-answer-87162
01/09/2024, 6:09 PMproud-answer-87162
01/09/2024, 6:14 PMdef test_get_output():
kwargs = {}
spec = fsspec.implementations.http.HTTPFileSystem(fsspec.filesystem("https", **kwargs))
frompath = linkToUrlFileile
topath = "/var/inputs10.txt"
response = spec.get(frompath, topath, **kwargs)
print(response)
proud-answer-87162
01/09/2024, 6:15 PMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 6:16 PMthankful-minister-83577
proud-answer-87162
01/09/2024, 6:20 PMproud-answer-87162
01/09/2024, 6:23 PMthankful-minister-83577
thankful-minister-83577
import fsspec
import shutil
import os
def test_file_handling():
s3kwargs = {'cache_regions': False, 'key': 'minio', 'secret': 'miniostorage',
'client_kwargs': {'endpoint_url': '<http://localhost:30002>'}}
s3_file = "<s3://my-s3-bucket/data/5w/acwl6plp7ps4fwpsgrzb-n1-0/264513408320143fe0a02bf95c023505/00000>"
sss = fsspec.filesystem("s3", **s3kwargs)
local_dir = "/Users/ytong/temp/fss_output"
try:
shutil.rmtree(local_dir)
except FileNotFoundError:
...
sss.get(s3_file, os.path.join(local_dir, "from_s3"))
def test_file_handling_http():
local_dir = "/Users/ytong/temp/fss_output"
try:
shutil.rmtree(local_dir)
except FileNotFoundError:
...
http_fs = fsspec.filesystem("http")
http_signed_url = "<http://localhost:9000/my-s3-bucket/data/5w/acwl6plp7ps4fwpsgrzb-n1-0/264513408320143fe0a02bf95c023505/00000?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AIY0FO8J1300DN5XZSXF%2F20240109%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240109T181127Z&X-Amz-Expires=604800&X-Amz-Security-Token=eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3NLZXkiOiJBSVkwRk84SjEzMDBETjVYWlNYRiIsImV4cCI6MTcwNDg2NzAxNSwicGFyZW50IjoibWluaW8ifQ.MmX5eqMsw2e77Pd90EiYaQtyFZRDg0jZJRWo4QEefxV8PeLvszALvCiXW8HRenpamY-Y-roSwGhoYDKRKGcCHQ&X-Amz-SignedHeaders=host&versionId=null&X-Amz-Signature=72d2bae5a221adb9d7044a38210e365dbf969939b7ba9b8fa9d44117cc041f8e>"
http_fs.get(http_signed_url, os.path.join(local_dir, "from_http"))
thankful-minister-83577
thankful-minister-83577
kf port-forward svc/flyte-sandbox-minio 9000:9000
to get the port 9000 to show upthankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 6:27 PMput
case. i haven't debugged that because i know it works, but i suspect it's related to how fsspec uses lpath
to construct rpath
in a different mannerproud-answer-87162
01/09/2024, 6:28 PM_get
use case)thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
thankful-minister-83577
/
thankful-minister-83577
"<http://localhost:9000/my-s3-bucket/data/5w/acwl6plp7ps4fwpsgrzb-n1-0/264513408320143fe0a02bf95c023505/00000?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AIY0FO8J1300DN5XZSXF%2F20240109%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240109T181127Z&X-Amz-Expires=604800&X-Amz-Security-Token=eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3NLZXkiOiJBSVkwRk84SjEzMDBETjVYWlNYRiIsImV4cCI6MTcwNDg2NzAxNSwicGFyZW50IjoibWluaW8ifQ.MmX5eqMsw2e77Pd90EiYaQtyFZRDg0jZJRWo4QEefxV8PeLvszALvCiXW8HRenpamY-Y-roSwGhoYDKRKGcCHQ&X-Amz-SignedHeaders=host&versionId=null&X-Amz-Signature=72d2bae5a221adb9d7044a38210e365dbf969939b7ba9b8fa9d44117cc041f8e>"
proud-answer-87162
01/09/2024, 7:00 PMproud-answer-87162
01/09/2024, 7:01 PM/
the behavior is the same if exists == true
proud-answer-87162
01/09/2024, 7:01 PMput
use case, where exists == false
proud-answer-87162
01/09/2024, 7:02 PMother_paths
does is strip trailing /
, hehthankful-minister-83577
proud-answer-87162
01/09/2024, 7:08 PMproud-answer-87162
01/09/2024, 7:09 PMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 7:09 PMTrue
for URLsthankful-minister-83577
http_fs.get([http_signed_url], [os.path.join(local_dir, "from_http")])
works, not sure if you mentioned that already aboveproud-answer-87162
01/09/2024, 7:11 PMproud-answer-87162
01/09/2024, 7:11 PMthankful-minister-83577
proud-answer-87162
01/09/2024, 7:13 PMthankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 7:19 PMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 7:22 PMproud-answer-87162
01/09/2024, 7:22 PMproud-answer-87162
01/09/2024, 7:49 PMthankful-minister-83577
thankful-minister-83577
proud-answer-87162
01/09/2024, 7:54 PMif fsspec fixes the behavior, that will resolve both of these right?4701 will be fixed if fsspec fixes the bug. but 4700 will still exist because flytekit expects lpath to be
/var/output.pb
but instead it has been updated to /var/output.pb/output.pb
.proud-answer-87162
01/09/2024, 7:58 PM_get_output_literal_map
constructs/uses tmp_name
is probably the simplestthankful-minister-83577
cp /file/a /other/b
linux doesn’t put it under /other/b/athankful-minister-83577
proud-answer-87162
01/09/2024, 8:09 PMproud-answer-87162
01/10/2024, 8:05 PM