Howdy I am working to bring Azure AD authenticatio...
# ask-the-community
c
Howdy I am working to bring Azure AD authentication in stow (here) into flyte. As part of that, I'm adding a
RemoteURLInterface
for Azure (example). It seems the interface is only called when
remoteDataConfig.SignedURL.Enabled == true
, is that right? Also, I noticed that the aws implementation uses the s3 client directly to fetch a signed URL from a created Request object which does not appear to be used (beyond fetching the URL). I found that surprising; does anyone know why
RawStore.CreateSignedURL
isn't used for that purpose? It feels like offloading that responsibility to the store implementation (likely Stow) would work, while avoiding having to re-implement the logic in each concrete RemoteURL type.
y
hey are you still working on this? I can reach out internally to find out more of the history of the code there.
you are correct that this appears to use the s3 client directly and not go through stow. can’t speak to why that is.
as for the url that’s returned however, that is used. this is part of the fast-registration/upload of offloaded data types flow used in flytekit.
this is how your updated code gets sent to blob store.
c
as for the url that’s returned however, that is used. this is part of the fast-registration/upload of offloaded data types flow used in flytekit.
ok, that makes sense. but the URL returned from s3 client and the URL returned from stow should be the same, right? i think stow has an s3 implementation, which would likely use the s3 client in the same manner as the flyte code
as for the url that’s returned however, that is used. this is part of the fast-registration/upload of offloaded data types flow used in flytekit.
right, the URL is used to fetch data from the store. but only when
remoteDataConfig.SignedURL.Enabled == true
, correct?
@Yee - what are your thoughts on me adding presigned url support for the azure use case? is that a feature that the community wants to continue to support, or is relying on a workload identity reasonable for the azure integration?
y
that would be awesome, but i don’t think we have the resources to do/test that.
is that something you could help with?
c
yeah, i can work on that and try to put together some tests
> as for the url that’s returned however, that is used. this is part of the fast-registration/upload of offloaded data types flow used in flytekit. i did a little digging and I think the SAS url used by flytekit is actually generated from the
CreateUploadLocation
flow. flytekit has a method
get_upload_signed_url
which seems to be responsible for fetching the SAS and using it to upload the blob. i do see that the
WorkflowExecutionGetDataResponse
gets returned but I don't think the urlblob (sas url) is actively used. i see it marked as deprecated and returning an empty string does not seem to affect the normal registration/run behavior. if above is true, it seems that the
UrlBlob.bytes
is actively used, and presumably that could be handled differently. i think calling
remoteDataStoreClient.CreateSignedURL
is unnecessary.
thoughts on the above? i see you're on vacation, so no rush on a response
(disclaimer with everything above - i'm an amateur dev when it comes to python and go, so i easily could have misread or missed something)
@Yee after a little more research I see some scenarios where the code attempts to use the SAS (from UrlBlob) but it doesn't appear to work. the URL is included in various responses to flytekit (and possibly other clients). If the blob size is larger than the configured
maxSizeInBytes
, and flyteremote tries to access an flyte resource (e.g.,
remote.fetch_execution()
),
_get_input_literal_map
attempts to use the SAS token. However, in both the existing AWS case and in the new Azure implementation, flytekit throws an error when trying to use the SAS token. In the existing AWS case flytekit reports the "file name is too long"; the service seems to use the SAS token and other query params when copying data to the local dir. For the new Azure implementation, the error is "IsADirectoryError: [Errno 21] Is a directory: '/var/folders/2m/lj3gvn411sbff510w8lgnv480000gv/T/flytemcf_achx/control_plane_me tadata/local_flytekit/outputs.pb'"
it's entirely possible i am misunderstanding the intended use case for
RemoteURLInterface
, and i just stumbled upon a bug or an unexpected configuration. but it doesn't seem like the SAS produced by the remote-url feature is successfully used by either flyte or the clients i have tested with. do you have any insight into my findings above? if this is a deprecated feature as some comments suggest, does it need to be supported in the az implementation?
y
sorry what’s SAS?
each signed url is used to upload one file, not a directory. if you want to upload a whole directory, you’d have to call it once per file unf.
if i were to guess, the isadirectoryerror is just azure saying that the request should only have
outputs.pb
not the whole path but that’s just a guess. do you have a draft pr that we could take a look at? copy here the request payload that is being sent?
c
sorry, SAS = Shared Access Signature (pre-signed URL)
here is the draft PR with azure impl
i didn't think
IsADirectoryError
comes from the python code rather than azure, but i'll track that down tomorrow morning.
but it seemed strange to me that the AWS flow had a different error in the same
_assign_inputs_and_outputs
flow. is that expected to work with presigned urls and the flyteremote api?
that’s the one for aws
i don’t see the azure client anywhere in the pr
c
right, i'm relying on the stow client to construct the signed url
i can confirm that the URL is generated and can be used to access the file
and maybe before diving into the azure case - is AWS supposed to work? because using
remoteData
inline config with flyte-binary does not work for me locally with my configuration
y
aws definitely works
that’s the default one.
i don’t understand the setup exactly though in your case.
the default production deployment with single binary definitely works.
it’s been a while since i’ve tried running the single binary locally, using real AWS instead of minio, but that also works (you just need to add in creds). and certainly using minio as s3 standin definitely works.
c
to be clear, AWS works for me in 90% of use cases. it works for running launchplans and passing data between tasks, it works when triggered from
pyflyte
, and it works when loading data in the UI. i only see a failure when setting
signedUrls: true
and
maxByteSize
to a low limit, then fetching a resource using flyteremote. this is the config i added to flyte-binary values.yaml:
Copy code
inline:
    remoteData:
      region: us-west-2
      scheme: aws
      maxSizeInBytes: 1
      signedUrls:
        enabled: true
        durationMinutes: 5
i figured out the s3 failure. s3 presigned urls can be quite long, and the call in
data_persistence
to
dst = file_system.get(from_path, to_path, recursive=recursive, **kwargs)
will fail if
from_path
is over some char limit (at least on macOS). fsspec does not appear to have a configuration to use the full
from_path
to fetch the data but only a portion of it for the
dst
result. it's simple to repro this with a local test, assuming the url is long enough to trigger the failure.
i'm still tracking down the underlying az issue
@Yee i opened a question on the fsspec board about the presigned URL not working. they confirmed that it's an issue but they aren't sure how they would resolve. i also found the issue with the az implementation. it gets past the long URL issue because az creates a smaller URL but then fails when calling
_get_output_literal_map
. this is due to another fsspec quirk: calling
get
with a URL for
rpath
and a filename for
lpath
results in
lpath
being used as a path dir. e.g., if
rpath
is
http://mydomain/outpub.pb
and
lpath
is
/var/output.pb
a file gets written at
/var/outpub.pb/output.pb
. this causes an issue in the flytekit code because
remote._get_output_literal_map()
creates
tmp_name
with the filename ("output.pb") and then uses
tmp_name
as the location for the resulting file. in reality, that path is a dir that houses the actual file. to demonstrate what works with fsspec i put together a hacky update to
remote._get_output_literal_map()
for my particular use case:
Copy code
elif execution_data.outputs.bytes > 0:
            with self.remote_context() as ctx:
                tmp_name = os.path.join(ctx.file_access.local_sandbox_dir)
                file_name = execution_data.outputs.url.rsplit("/", 1)[1]
                ctx.file_access.get_data(execution_data.outputs.url, tmp_name)
                return literal_models.LiteralMap.from_flyte_idl(
                    utils.load_proto_from_file(literals_pb2.LiteralMap, tmp_name + "/" + file_name)
                )
two changes were necessary: 1.
tmp_name
represents the dir the local file is written to (as opposed to a filename) 2. I construct the resulting file location from
tmp_name
and the file from remote url
this is obviously a bad prod solution, but i think it demonstrates the issue and gives some hints at a real fix
@Yee here is the PR with an explanation of my findings and links for additional context.
y
clarification, when you say “`rpath` is
<http://mydomain/outpub.pb>
and
lpath
is
/var/output.pb
” do you mean that the other way around? that is, in this call here in the function of interest,
execution_data.outputs.url
==
<http://mydomain/outpub.pb>
and
tmp_name
==
/var/outpubs.pb
?
if that’s the case that really shouldn’t be happening. treating the right argument (the destination argument) as a directory should only happen if recursive is on (the
is_multipart
arg).
c
execution_data.outputs.url
==
<http://mydomain/output.pb>
and
tmp_name
==
/var/output.pb/output.pb
, which results in
rpath
(remote) ==
<http://mydomain/output.pb>
and
lpath
(local) ==
/var/output.pb
> that really shouldn’t be happening. treating the right argument (the destination argument) as a directory should only happen if recursive is on can you point me to the flytekit code that enforces that behavior? I think it's the fsspec code (
utils.py:other_paths()
) that does the concatenation
data_persistence
get_data
and
get
are just passthroughs if
recursive == false
y
why is
tmp_name
getting set to
/var/output.pb/output.pb
?
so it should be the local sandbox dir.
c
tmp_name
is set to
/var/output.pb
c
that gets used as input into
other_paths
y
it sounds like the setting for the local sandbox dir is getting messed up?
c
i don't think so - the flytekit code works correctly if
other_paths
behaves as you expect. but if
exists == true
, instead of using
lpath
with a filename as the destination, it uses it to form the final path. you can see this by looking at a test in fsspec
test_utils.py
:
(["/path1"], "/path2", True, ["/path2/path1"])
and
exists == true
in this use case
y
oh
so if you remove
/var/outputs.pb
then the problem doesn’t happen?
c
what do you mean by remove?
y
rm /var/outputs.pb
c
it gets recreated on the next run
and
/var/outputs.pb
is a dir. the file is at
/var/outputs.pb/outputs.pb
y
but if you
rm
then exists will be false the first time right?
c
exists is based on `rpath`:
Copy code
exists = source_is_str and (
                (has_magic(rpath) and source_is_file)
                or (not has_magic(rpath) and dest_is_dir and source_not_trailing_sep)
for the
get
y
but rpath is the destination isn’t it? rpath is
/var/outputs.pb
if this is a get… the lpath, which is the origin, should always exist
c
rpath
is the remote file/source in the
_get
operation
lpath is the dest
y
oh got it, sorry, see their naming convention now
but still curious, if you
rm
the local file, make sure it doesn’t exist either as a directory or a file, does the problem persist?
c
it does because
exists
is a test for the
rpath
(source)
the file gets written to
/var/outputs.pb/output.pb
but flytekit expects the file to be at
/var/outputs.pb
so when it tries to open
/var/outputs.pb
a
IsADirectoryError
gets thrown
Copy code
def load_proto_from_file(pb2_type, path):
    with open(path, "rb") as reader:
        out = pb2_type()
        out.ParseFromString(reader.read())
        return out
y
not sure if it matters but what fsspec filesystem are you using? http?
c
yes, http
y
the example rpath you gave had http
k
let me try to repro this without any flytekit code
c
fsspec has fairly good unit tests to demonstrate the behavior
y
i guess…
but like, what you’re describing is a pretty big problem…
basically it renders flyte remote unable to download inputs/outputs
i feel like we would’ve seen it come up before
c
only under certain conditions: a)
remoteData
is configured to set
signedUrls:true
b) the presigned URL is larger than n (file system limit) characters (common for S3) this is only needed for the lpath size issue. not necessary for the incorrect path bug c) the payload is larger than
maxSizeInBytes
d)
FlyteRemote
is used to interact with the execution - I am not 100% confident in this, but it's the only use case I could find
but i could also have something configured incorrectly that is exposing improper behavior. i'd love that to be true!
and this is a simple test to demonstrate the fsspec behavior:
Copy code
def test_get_output():
    kwargs = {}
    spec = fsspec.implementations.http.HTTPFileSystem(fsspec.filesystem("https", **kwargs))
    frompath = linkToUrlFileile
    topath = "/var/inputs10.txt"
    response = spec.get(frompath, topath, **kwargs)
    print(response)
note that this behavior also depends on from/to paths being strings instead of lists. if they are both lists the lists are not mutated so the dest location remains valid
y
trying…
but i still don’t get it.
condition c above: payload larger than max size, how does that affect the fsspec function call inputs?
c
it doesn't, but it affects how flytekit interacts with fsspec. if the payload is inline then flytekit doesn't try to fetch the remote file
y
oh got it
c
and i haven't tested this theory, but it might also require tasks to pass raw data rather than a FlyteFile
and as i reminder, this issue only pops if you get past the URL issue discussed here. i suspect no one is using the above configuration with S3 or that bug would have been reported already.
y
well this is the test i’m using.
Copy code
import fsspec
import shutil
import os


def test_file_handling():
    s3kwargs = {'cache_regions': False, 'key': 'minio', 'secret': 'miniostorage',
                'client_kwargs': {'endpoint_url': '<http://localhost:30002>'}}

    s3_file = "<s3://my-s3-bucket/data/5w/acwl6plp7ps4fwpsgrzb-n1-0/264513408320143fe0a02bf95c023505/00000>"
    sss = fsspec.filesystem("s3", **s3kwargs)

    local_dir = "/Users/ytong/temp/fss_output"
    try:
        shutil.rmtree(local_dir)
    except FileNotFoundError:
        ...

    sss.get(s3_file, os.path.join(local_dir, "from_s3"))


def test_file_handling_http():
    local_dir = "/Users/ytong/temp/fss_output"
    try:
        shutil.rmtree(local_dir)
    except FileNotFoundError:
        ...

    http_fs = fsspec.filesystem("http")

    http_signed_url = "<http://localhost:9000/my-s3-bucket/data/5w/acwl6plp7ps4fwpsgrzb-n1-0/264513408320143fe0a02bf95c023505/00000?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AIY0FO8J1300DN5XZSXF%2F20240109%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240109T181127Z&X-Amz-Expires=604800&X-Amz-Security-Token=eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3NLZXkiOiJBSVkwRk84SjEzMDBETjVYWlNYRiIsImV4cCI6MTcwNDg2NzAxNSwicGFyZW50IjoibWluaW8ifQ.MmX5eqMsw2e77Pd90EiYaQtyFZRDg0jZJRWo4QEefxV8PeLvszALvCiXW8HRenpamY-Y-roSwGhoYDKRKGcCHQ&X-Amz-SignedHeaders=host&versionId=null&X-Amz-Signature=72d2bae5a221adb9d7044a38210e365dbf969939b7ba9b8fa9d44117cc041f8e>"
    http_fs.get(http_signed_url, os.path.join(local_dir, "from_http"))
this is using the local setup.
and i’m running
kf port-forward svc/flyte-sandbox-minio 9000:9000
to get the port 9000 to show up
and yeah, can confirm i’m running into the lpath too long issue, because it’s incorrectly computing lpath. this doesn’t happen in the s3 case.
we do use signed urls though.
c
yeah, for the
put
case. i haven't debugged that because i know it works, but i suspect it's related to how fsspec uses
lpath
to construct
rpath
in a different manner
fsspec can fetch the file, it just can't write it locally (
_get
use case)
y
i’ve had issues understanding other_paths before
image.png
did you understand this?
cuz in my case, the http doesn’t end in a
/
Copy code
"<http://localhost:9000/my-s3-bucket/data/5w/acwl6plp7ps4fwpsgrzb-n1-0/264513408320143fe0a02bf95c023505/00000?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AIY0FO8J1300DN5XZSXF%2F20240109%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240109T181127Z&X-Amz-Expires=604800&X-Amz-Security-Token=eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3NLZXkiOiJBSVkwRk84SjEzMDBETjVYWlNYRiIsImV4cCI6MTcwNDg2NzAxNSwicGFyZW50IjoibWluaW8ifQ.MmX5eqMsw2e77Pd90EiYaQtyFZRDg0jZJRWo4QEefxV8PeLvszALvCiXW8HRenpamY-Y-roSwGhoYDKRKGcCHQ&X-Amz-SignedHeaders=host&versionId=null&X-Amz-Signature=72d2bae5a221adb9d7044a38210e365dbf969939b7ba9b8fa9d44117cc041f8e>"
c
yeah, some of their documentation is wrong, or at least misleading
but he was also incorrect in describing my problem initially. even without a trailing
/
the behavior is the same if
exists == true
i think he was originally talking about the
put
use case, where
exists == false
the first thing
other_paths
does is strip trailing
/
, heh
y
i don’t think exists is a good variable name…
c
it is not
i think they need to break apart that method - it's doing too much and handles too many use cases
y
though i guess i’m not sure what has magic is doing
i guess in the meantime, we can them these single lists to work around the issue.
but i really don’t like that
c
> though i guess i’m not sure what has magic is doing that confused me as well - seems to be looking for specific characters and comes back
True
for URLs
y
like
http_fs.get([http_signed_url], [os.path.join(local_dir, "from_http")])
works, not sure if you mentioned that already above
c
yeah, i did test that. but agree that it's a kinda ugly sln
but might be the only path until fsspec fixes their behavior
y
yeah
c
alternatively... does flyte{kit} need to support signed URLs for these use cases? no one appears to be using it and the api response model suggests the property (url) is deprecated
y
it should still work for sure. we use signed urls a fair bit and it’s the correct thing to use if the literal map is too big
would you mind/did you already put in a flyte org issue?
c
i have not yet but will file something today
y
tyty
sorry for my confusion earlier
i thought rpath was right path, not remote path, and lpath left, rather than local
c
yeah, that naming convention took me a while to wrap my head around, hah
so i totally understand the confusion
i created a bug for each issue: file too long and misuse of lpath
y
if fsspec fixes the behavior, that will resolve both of these right?
thank you
c
if fsspec fixes the behavior, that will resolve both of these right?
4701 will be fixed if fsspec fixes the bug. but 4700 will still exist because flytekit expects lpath to be
/var/output.pb
but instead it has been updated to
/var/output.pb/output.pb
.
i think there are a few reasonable solutions for 4701. a small change to how
_get_output_literal_map
constructs/uses
tmp_name
is probably the simplest
y
but if i
cp /file/a /other/b
linux doesn’t put it under /other/b/a
let me think about it, i think the [] hack also works
c
yeah, i think fsspec can/should clean up their API. but it will be a breaking change so my guess is they'll add a different interface or throw on another bool flag, and that's if they agree with our position...
I added another comment to the fsspec question to make it clear that we think there are two bugs (or at least unexpected behaviors) in their lib. let me know if you think it represents our position accurately