Chris Grass
10/25/2023, 4:08 PMRemoteURLInterface
for Azure (example). It seems the interface is only called when remoteDataConfig.SignedURL.Enabled == true
, is that right?
Also, I noticed that the aws implementation uses the s3 client directly to fetch a signed URL from a created Request object which does not appear to be used (beyond fetching the URL).
I found that surprising; does anyone know why RawStore.CreateSignedURL
isn't used for that purpose? It feels like offloading that responsibility to the store implementation (likely Stow) would work, while avoiding having to re-implement the logic in each concrete RemoteURL type.Yee
Yee
Yee
Yee
Chris Grass
10/30/2023, 2:40 PMas for the url that’s returned however, that is used. this is part of the fast-registration/upload of offloaded data types flow used in flytekit.ok, that makes sense. but the URL returned from s3 client and the URL returned from stow should be the same, right? i think stow has an s3 implementation, which would likely use the s3 client in the same manner as the flyte code
Chris Grass
10/30/2023, 6:04 PMas for the url that’s returned however, that is used. this is part of the fast-registration/upload of offloaded data types flow used in flytekit.right, the URL is used to fetch data from the store. but only when
remoteDataConfig.SignedURL.Enabled == true
, correct?Chris Grass
11/29/2023, 5:36 PMYee
Yee
Chris Grass
11/30/2023, 2:55 PMChris Grass
12/14/2023, 10:11 PMCreateUploadLocation
flow. flytekit has a method get_upload_signed_url
which seems to be responsible for fetching the SAS and using it to upload the blob.
i do see that the WorkflowExecutionGetDataResponse
gets returned UrlBlob.bytes
is actively used, and presumably that could be handled differently. i think calling remoteDataStoreClient.CreateSignedURL
is unnecessary.Chris Grass
12/14/2023, 10:19 PMChris Grass
12/21/2023, 12:23 AMmaxSizeInBytes
, and flyteremote tries to access an flyte resource (e.g., remote.fetch_execution()
), _get_input_literal_map
attempts to use the SAS token. However, in both the existing AWS case and in the new Azure implementation, flytekit throws an error when trying to use the SAS token.
In the existing AWS case flytekit reports the "file name is too long"; the service seems to use the SAS token and other query params when copying data to the local dir. For the new Azure implementation, the error is "IsADirectoryError: [Errno 21] Is a directory:
'/var/folders/2m/lj3gvn411sbff510w8lgnv480000gv/T/flytemcf_achx/control_plane_me
tadata/local_flytekit/outputs.pb'"Chris Grass
12/21/2023, 12:28 AMRemoteURLInterface
, and i just stumbled upon a bug or an unexpected configuration. but it doesn't seem like the SAS produced by the remote-url feature is successfully used by either flyte or the clients i have tested with.
do you have any insight into my findings above? if this is a deprecated feature as some comments suggest, does it need to be supported in the az implementation?Yee
Yee
Yee
outputs.pb
not the whole path but that’s just a guess. do you have a draft pr that we could take a look at? copy here the request payload that is being sent?Chris Grass
12/21/2023, 3:10 AMChris Grass
12/21/2023, 3:11 AMChris Grass
12/21/2023, 3:15 AMIsADirectoryError
comes from the python code rather than azure, but i'll track that down tomorrow morning.Chris Grass
12/21/2023, 3:16 AM_assign_inputs_and_outputs
flow. is that expected to work with presigned urls and the flyteremote api?Yee
Yee
Yee
Chris Grass
12/21/2023, 2:43 PMChris Grass
12/21/2023, 2:43 PMChris Grass
12/21/2023, 3:47 PMremoteData
inline config with flyte-binary does not work for me locally with my configurationYee
Yee
Yee
Yee
Yee
Chris Grass
12/22/2023, 3:07 PMpyflyte
, and it works when loading data in the UI.
i only see a failure when setting signedUrls: true
and maxByteSize
to a low limit, then fetching a resource using flyteremote.
this is the config i added to flyte-binary values.yaml:
inline:
remoteData:
region: us-west-2
scheme: aws
maxSizeInBytes: 1
signedUrls:
enabled: true
durationMinutes: 5
Chris Grass
12/22/2023, 11:21 PMdata_persistence
to dst = file_system.get(from_path, to_path, recursive=recursive, **kwargs)
will fail if from_path
is over some char limit (at least on macOS). fsspec does not appear to have a configuration to use the full from_path
to fetch the data but only a portion of it for the dst
result. it's simple to repro this with a local test, assuming the url is long enough to trigger the failure.Chris Grass
12/22/2023, 11:24 PMChris Grass
01/05/2024, 5:31 PM_get_output_literal_map
. this is due to another fsspec quirk: calling get
with a URL for rpath
and a filename for lpath
results in lpath
being used as a path dir. e.g., if rpath
is http://mydomain/outpub.pb
and lpath
is /var/output.pb
a file gets written at /var/outpub.pb/output.pb
.
this causes an issue in the flytekit code because remote._get_output_literal_map()
creates tmp_name
with the filename ("output.pb") and then uses tmp_name
as the location for the resulting file. in reality, that path is a dir that houses the actual file.
to demonstrate what works with fsspec i put together a hacky update to remote._get_output_literal_map()
for my particular use case:
elif execution_data.outputs.bytes > 0:
with self.remote_context() as ctx:
tmp_name = os.path.join(ctx.file_access.local_sandbox_dir)
file_name = execution_data.outputs.url.rsplit("/", 1)[1]
ctx.file_access.get_data(execution_data.outputs.url, tmp_name)
return literal_models.LiteralMap.from_flyte_idl(
utils.load_proto_from_file(literals_pb2.LiteralMap, tmp_name + "/" + file_name)
)
Chris Grass
01/05/2024, 5:32 PMtmp_name
represents the dir the local file is written to (as opposed to a filename)
2. I construct the resulting file location from tmp_name
and the file from remote urlChris Grass
01/05/2024, 5:33 PMChris Grass
01/08/2024, 9:46 PMYee
<http://mydomain/outpub.pb>
and lpath
is /var/output.pb
” do you mean that the other way around? that is, in this call here in the function of interest, execution_data.outputs.url
== <http://mydomain/outpub.pb>
and tmp_name
== /var/outpubs.pb
?Yee
is_multipart
arg).Chris Grass
01/09/2024, 2:51 PMexecution_data.outputs.url
== <http://mydomain/output.pb>
and tmp_name
== /var/output.pb/output.pb
, which results in rpath
(remote) == <http://mydomain/output.pb>
and lpath
(local) == /var/output.pb
Chris Grass
01/09/2024, 2:54 PMutils.py:other_paths()
) that does the concatenationChris Grass
01/09/2024, 3:03 PMdata_persistence
get_data
and get
are just passthroughs if recursive == false
Yee
tmp_name
getting set to /var/output.pb/output.pb
?Yee
Yee
Chris Grass
01/09/2024, 4:46 PMtmp_name
is set to /var/output.pb
Yee
Chris Grass
01/09/2024, 4:47 PMother_paths
Yee
Chris Grass
01/09/2024, 4:51 PMother_paths
behaves as you expect. but if exists == true
, instead of using lpath
with a filename as the destination, it uses it to form the final path. you can see this by looking at a test in fsspec test_utils.py
: (["/path1"], "/path2", True, ["/path2/path1"])
Chris Grass
01/09/2024, 4:52 PMexists == true
in this use caseYee
Yee
/var/outputs.pb
then the problem doesn’t happen?Chris Grass
01/09/2024, 5:02 PMYee
rm /var/outputs.pb
Chris Grass
01/09/2024, 5:02 PMChris Grass
01/09/2024, 5:03 PM/var/outputs.pb
is a dir. the file is at /var/outputs.pb/outputs.pb
Yee
rm
then exists will be false the first time right?Chris Grass
01/09/2024, 5:15 PMexists = source_is_str and (
(has_magic(rpath) and source_is_file)
or (not has_magic(rpath) and dest_is_dir and source_not_trailing_sep)
Chris Grass
01/09/2024, 5:15 PMget
Yee
/var/outputs.pb
Yee
Chris Grass
01/09/2024, 5:18 PMrpath
is the remote file/source in the _get
operationChris Grass
01/09/2024, 5:18 PMYee
Yee
rm
the local file, make sure it doesn’t exist either as a directory or a file, does the problem persist?Chris Grass
01/09/2024, 5:56 PMexists
is a test for the rpath
(source)Chris Grass
01/09/2024, 5:57 PM/var/outputs.pb/output.pb
but flytekit expects the file to be at /var/outputs.pb
Chris Grass
01/09/2024, 5:58 PM/var/outputs.pb
a IsADirectoryError
gets thrownChris Grass
01/09/2024, 5:59 PMdef load_proto_from_file(pb2_type, path):
with open(path, "rb") as reader:
out = pb2_type()
out.ParseFromString(reader.read())
return out
Yee
Chris Grass
01/09/2024, 6:04 PMYee
Yee
Yee
Chris Grass
01/09/2024, 6:05 PMYee
Yee
Yee
Yee
Chris Grass
01/09/2024, 6:08 PMremoteData
is configured to set signedUrls:true
b) maxSizeInBytes
d) FlyteRemote
is used to interact with the execution - I am not 100% confident in this, but it's the only use case I could findChris Grass
01/09/2024, 6:09 PMChris Grass
01/09/2024, 6:14 PMdef test_get_output():
kwargs = {}
spec = fsspec.implementations.http.HTTPFileSystem(fsspec.filesystem("https", **kwargs))
frompath = linkToUrlFileile
topath = "/var/inputs10.txt"
response = spec.get(frompath, topath, **kwargs)
print(response)
Chris Grass
01/09/2024, 6:15 PMYee
Yee
Yee
Chris Grass
01/09/2024, 6:16 PMYee
Chris Grass
01/09/2024, 6:20 PMChris Grass
01/09/2024, 6:23 PMYee
Yee
import fsspec
import shutil
import os
def test_file_handling():
s3kwargs = {'cache_regions': False, 'key': 'minio', 'secret': 'miniostorage',
'client_kwargs': {'endpoint_url': '<http://localhost:30002>'}}
s3_file = "<s3://my-s3-bucket/data/5w/acwl6plp7ps4fwpsgrzb-n1-0/264513408320143fe0a02bf95c023505/00000>"
sss = fsspec.filesystem("s3", **s3kwargs)
local_dir = "/Users/ytong/temp/fss_output"
try:
shutil.rmtree(local_dir)
except FileNotFoundError:
...
sss.get(s3_file, os.path.join(local_dir, "from_s3"))
def test_file_handling_http():
local_dir = "/Users/ytong/temp/fss_output"
try:
shutil.rmtree(local_dir)
except FileNotFoundError:
...
http_fs = fsspec.filesystem("http")
http_signed_url = "<http://localhost:9000/my-s3-bucket/data/5w/acwl6plp7ps4fwpsgrzb-n1-0/264513408320143fe0a02bf95c023505/00000?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AIY0FO8J1300DN5XZSXF%2F20240109%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240109T181127Z&X-Amz-Expires=604800&X-Amz-Security-Token=eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3NLZXkiOiJBSVkwRk84SjEzMDBETjVYWlNYRiIsImV4cCI6MTcwNDg2NzAxNSwicGFyZW50IjoibWluaW8ifQ.MmX5eqMsw2e77Pd90EiYaQtyFZRDg0jZJRWo4QEefxV8PeLvszALvCiXW8HRenpamY-Y-roSwGhoYDKRKGcCHQ&X-Amz-SignedHeaders=host&versionId=null&X-Amz-Signature=72d2bae5a221adb9d7044a38210e365dbf969939b7ba9b8fa9d44117cc041f8e>"
http_fs.get(http_signed_url, os.path.join(local_dir, "from_http"))
Yee
Yee
kf port-forward svc/flyte-sandbox-minio 9000:9000
to get the port 9000 to show upYee
Yee
Chris Grass
01/09/2024, 6:27 PMput
case. i haven't debugged that because i know it works, but i suspect it's related to how fsspec uses lpath
to construct rpath
in a different mannerChris Grass
01/09/2024, 6:28 PM_get
use case)Yee
Yee
Yee
Yee
/
Yee
"<http://localhost:9000/my-s3-bucket/data/5w/acwl6plp7ps4fwpsgrzb-n1-0/264513408320143fe0a02bf95c023505/00000?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AIY0FO8J1300DN5XZSXF%2F20240109%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240109T181127Z&X-Amz-Expires=604800&X-Amz-Security-Token=eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3NLZXkiOiJBSVkwRk84SjEzMDBETjVYWlNYRiIsImV4cCI6MTcwNDg2NzAxNSwicGFyZW50IjoibWluaW8ifQ.MmX5eqMsw2e77Pd90EiYaQtyFZRDg0jZJRWo4QEefxV8PeLvszALvCiXW8HRenpamY-Y-roSwGhoYDKRKGcCHQ&X-Amz-SignedHeaders=host&versionId=null&X-Amz-Signature=72d2bae5a221adb9d7044a38210e365dbf969939b7ba9b8fa9d44117cc041f8e>"
Chris Grass
01/09/2024, 7:00 PMChris Grass
01/09/2024, 7:01 PM/
the behavior is the same if exists == true
Chris Grass
01/09/2024, 7:01 PMput
use case, where exists == false
Chris Grass
01/09/2024, 7:02 PMother_paths
does is strip trailing /
, hehYee
Chris Grass
01/09/2024, 7:08 PMChris Grass
01/09/2024, 7:09 PMYee
Yee
Yee
Chris Grass
01/09/2024, 7:09 PMTrue
for URLsYee
http_fs.get([http_signed_url], [os.path.join(local_dir, "from_http")])
works, not sure if you mentioned that already aboveChris Grass
01/09/2024, 7:11 PMChris Grass
01/09/2024, 7:11 PMYee
Chris Grass
01/09/2024, 7:13 PMYee
Yee
Chris Grass
01/09/2024, 7:19 PMYee
Yee
Yee
Chris Grass
01/09/2024, 7:22 PMChris Grass
01/09/2024, 7:22 PMChris Grass
01/09/2024, 7:49 PMYee
Yee
Chris Grass
01/09/2024, 7:54 PMif fsspec fixes the behavior, that will resolve both of these right?4701 will be fixed if fsspec fixes the bug. but 4700 will still exist because flytekit expects lpath to be
/var/output.pb
but instead it has been updated to /var/output.pb/output.pb
.Chris Grass
01/09/2024, 7:58 PM_get_output_literal_map
constructs/uses tmp_name
is probably the simplestYee
cp /file/a /other/b
linux doesn’t put it under /other/b/aYee
Chris Grass
01/09/2024, 8:09 PMChris Grass
01/10/2024, 8:05 PM