https://flyte.org logo
#ask-the-community
Title
# ask-the-community
n

Nizar Hattab

02/07/2024, 1:33 PM
Hi all! I have a workflow that runs multiple times a day, each step saves its output to a certain path on gcs. each step takes the output of the previous one as an input the path is gonna be the same for all runs. Problem: passing the output directly downloads all the files from gcs path, not only the outputs of the current run. I think this is a bug.
c

Chris Grass

02/07/2024, 1:52 PM
Howdy
Problem: passing the output directly downloads all the files from gcs path, not only the outputs of the current run.
how are you detecting that behavior? do you see it in gcs or flyte logs? can you paste the task signatures you're working with here?
n

Nizar Hattab

02/07/2024, 1:58 PM
t1_op = Task1: output: FlyteDirectory(path.., remote_path) Task2: input: FlyteDirectory = t1_op output: FlyteDirectory(path, remote_path) Task1 actual output has only 1 file, but the remote path has 800 Task2 input is 800 files. I see them through logs
c

Chris Grass

02/07/2024, 2:02 PM
if task2 only needs a file, should the output of task1 be a
FlyteFile
instead of a
FlyteDirectory
?
n

Nizar Hattab

02/07/2024, 2:03 PM
thats an example task 1 is a scraper and outputs multiple files in a directory. task 2 performs some operations on the scraped files.
c

Chris Grass

02/07/2024, 2:06 PM
the directory at the remote path stores results from multiple runs?
n

Nizar Hattab

02/07/2024, 2:06 PM
yes.
when passed to next step, it downloads everything
c

Chris Grass

02/07/2024, 2:07 PM
what behavior do you expect? how should flyte detect the specific files your new run needs?
n

Nizar Hattab

02/07/2024, 2:11 PM
Copy code
op1 = task1() # op1 is FlyteDirectory with remote_path
    op2 = task2(inp=op1) # passing op1 in the same wf, should use a temporary artifact.
this way, u cannot return any remote directory if its an input to the next task because it not reliable
c

Chris Grass

02/07/2024, 2:25 PM
how are you creating remote_path? the aws and azure implementations do create subdirectories per run, which can be accessed with
current_context().working_directory
in the python flytekit. (and i think this fs behavior is cloud agnostic, so should be the same for gcs. but a flyte contributor might need to confirm that)
n

Nizar Hattab

02/07/2024, 2:26 PM
We save data directly to a datasets bucket, remote_path is like this gs://bucket_name/path/to/output
if no remote_path is used, another bucket is used by default for "temporary" artifacts, which is tied to execution id and current context
c

Chris Grass

02/07/2024, 2:34 PM
you're unable to use the default temp execution-specific directory for passing data between tasks?
n

Nizar Hattab

02/07/2024, 2:35 PM
I can, but i wanna save the output to a custom bucket.
c

Chris Grass

02/07/2024, 2:36 PM
it might be best to solve that independently of passing data between tasks. e.g., an archive task and a process results task. the docs for
FlyteDirectory
have a warning about what you are seeing:
Copy code
This class should not be used on very large datasets, as merely listing the dataset will cause
        the entire dataset to be downloaded. Listing on S3 and other backend object stores is not consistent
        and we should not need data to be downloaded to list.
n

Nizar Hattab

02/07/2024, 2:37 PM
I see, thanks