Hello Flyters I m trying to think through the right way to h Flyte #flyte-support

Hello Flyters, I'm trying to think through the rig...

microscopic-furniture-57275

07/06/2023, 7:30 PM

Hello Flyters, I'm trying to think through the right way to handle what I think is some changed behavior in

FlyteDirectory

after version

1.4.2

. Our tasks/workflows create

dataclass

-based result objects, and we wrap these with a "FlyteResult" class that contains the result object as well as a

FlyteDirectory

. Some tasks may write files during execution, the

FlyteDirectory

is pointed at the folder containing these files, and these will automatically get downloaded (from e.g. s3) later by

FlyteDirectory::download()

when they are requested. However, some tasks don't write any files. The FlyteResult class that generically manages access to these things doesn't know which tasks do or don't write files, and always just calls

self.dir.download()

to get anything that might have been written (at the very least, a subfolder is written, but it may be empty). This worked fine in

flytekit 1.4.2

and previous, but starting with

1.5

, this raises an

Access Denied

, because the remote path doesn't exist. It seems that you actually have to write files during the task to get this remote folder to exist. It is not enough to create a subfolder under

current_context().working_directory

. One way to handle this is to check if the remote s3 "path" exists -- but I don't see how to do this with

FlyteDirectory

. At present I'm just creating a file with each task, to ensure the FlyteDirectory has something in it, such that the remote s3 path will be created, such that the call to

FlyteDirectory::download()

doesn't result in an Access Denied exception. This feels clunky. Better ideas?

thankful-minister-83577

07/06/2023, 10:45 PM

the change from 1.4.2 to 1.5 was the move to fsspec. i assume this is a subsequent task you’re referring to? like a downstream task will always call download()?

thankful-minister-83577

07/06/2023, 10:46 PM

using fsspec you should be able to do

Copy code

ctx.file_access.get_filesystem_for_path(self.dir.remote_source).exists(self.dir.remote_source)

thankful-minister-83577

07/06/2023, 10:46 PM

could you try that?

microscopic-furniture-57275

07/07/2023, 2:04 PM

Thanks @thankful-minister-83577, I'll give that a try and let you know.

microscopic-furniture-57275

07/07/2023, 2:13 PM

@thankful-minister-83577 to answer your original question, our current pattern for all workflows is that any files written to disk by tasks get pulled down to an AWS EFS volume after completion so that they are available to various other processes. So at the end of a workflow, there is typically a sidecar task whose main job is to just pull any results from the backend store (in our case s3) to this EFS filesystem. This is achieved by having all task-result-objects contain a FlyteDirectory instance which is pointed at any results that were written by the task that should be downloaded. Then, in the final sidecar task which copies from s3 to EFS, FlyteDirectory::download() is ultimately called (for each task-result-object) which causes anything in the backend store to get copied to EFS. In this way, anything (e.g. reporting, result-browsing, etc) that wants results to be on a "local" filesystem can have access to these results. We ultimately want to migrate away from this pattern, and access results in s3 directly, but legacy tools require us to get files to EFS at present.

microscopic-furniture-57275

07/07/2023, 9:44 PM

@thankful-minister-83577 your suggestion to use

Copy code

ctx.file_access.get_filesystem_for_path(flytedirectory.remote_source).exists(flytedirectory.remote_source)

to check for existence of the remote source works. Thanks! For any other readers, note that the

ctx

above is

FlyteContextManager::current_context()

, not

flytekit.current_context()

-- the latter returns a context that is a small subset of params compared to the former (and in particular does not include

file_access

)

42 Views

Open in Slack

Previous Next