Hello Flyters, I'm trying to think through the rig...
# ask-the-community
t
Hello Flyters, I'm trying to think through the right way to handle what I think is some changed behavior in
FlyteDirectory
after version
1.4.2
. Our tasks/workflows create
dataclass
-based result objects, and we wrap these with a "FlyteResult" class that contains the result object as well as a
FlyteDirectory
. Some tasks may write files during execution, the
FlyteDirectory
is pointed at the folder containing these files, and these will automatically get downloaded (from e.g. s3) later by
FlyteDirectory::download()
when they are requested. However, some tasks don't write any files. The FlyteResult class that generically manages access to these things doesn't know which tasks do or don't write files, and always just calls
self.dir.download()
to get anything that might have been written (at the very least, a subfolder is written, but it may be empty). This worked fine in
flytekit 1.4.2
and previous, but starting with
1.5
, this raises an
Access Denied
, because the remote path doesn't exist. It seems that you actually have to write files during the task to get this remote folder to exist. It is not enough to create a subfolder under
current_context().working_directory
. One way to handle this is to check if the remote s3 "path" exists -- but I don't see how to do this with
FlyteDirectory
. At present I'm just creating a file with each task, to ensure the FlyteDirectory has something in it, such that the remote s3 path will be created, such that the call to
FlyteDirectory::download()
doesn't result in an Access Denied exception. This feels clunky. Better ideas?
y
the change from 1.4.2 to 1.5 was the move to fsspec. i assume this is a subsequent task you’re referring to? like a downstream task will always call download()?
using fsspec you should be able to do
Copy code
ctx.file_access.get_filesystem_for_path(self.dir.remote_source).exists(self.dir.remote_source)
?
could you try that?
t
Thanks @Yee, I'll give that a try and let you know.
@Yee to answer your original question, our current pattern for all workflows is that any files written to disk by tasks get pulled down to an AWS EFS volume after completion so that they are available to various other processes. So at the end of a workflow, there is typically a sidecar task whose main job is to just pull any results from the backend store (in our case s3) to this EFS filesystem. This is achieved by having all task-result-objects contain a FlyteDirectory instance which is pointed at any results that were written by the task that should be downloaded. Then, in the final sidecar task which copies from s3 to EFS, FlyteDirectory::download() is ultimately called (for each task-result-object) which causes anything in the backend store to get copied to EFS. In this way, anything (e.g. reporting, result-browsing, etc) that wants results to be on a "local" filesystem can have access to these results. We ultimately want to migrate away from this pattern, and access results in s3 directly, but legacy tools require us to get files to EFS at present.
@Yee your suggestion to use
Copy code
ctx.file_access.get_filesystem_for_path(flytedirectory.remote_source).exists(flytedirectory.remote_source)
to check for existence of the remote source works. Thanks! For any other readers, note that the
ctx
above is
FlyteContextManager::current_context()
, not
flytekit.current_context()
-- the latter returns a context that is a small subset of params compared to the former (and in particular does not include
file_access
)