Hi all! I have a workflow that runs multiple times...
# ask-the-community
n
Hi all! I have a workflow that runs multiple times a day, each step saves its output to a certain path on gcs. each step takes the output of the previous one as an input the path is gonna be the same for all runs. Problem: passing the output directly downloads all the files from gcs path, not only the outputs of the current run. I think this is a bug.
c
Howdy
Problem: passing the output directly downloads all the files from gcs path, not only the outputs of the current run.
how are you detecting that behavior? do you see it in gcs or flyte logs? can you paste the task signatures you're working with here?
n
t1_op = Task1: output: FlyteDirectory(path.., remote_path) Task2: input: FlyteDirectory = t1_op output: FlyteDirectory(path, remote_path) Task1 actual output has only 1 file, but the remote path has 800 Task2 input is 800 files. I see them through logs
c
if task2 only needs a file, should the output of task1 be a
FlyteFile
instead of a
FlyteDirectory
?
n
thats an example task 1 is a scraper and outputs multiple files in a directory. task 2 performs some operations on the scraped files.
c
the directory at the remote path stores results from multiple runs?
n
yes.
when passed to next step, it downloads everything
c
what behavior do you expect? how should flyte detect the specific files your new run needs?
n
Copy code
op1 = task1() # op1 is FlyteDirectory with remote_path
    op2 = task2(inp=op1) # passing op1 in the same wf, should use a temporary artifact.
this way, u cannot return any remote directory if its an input to the next task because it not reliable
c
how are you creating remote_path? the aws and azure implementations do create subdirectories per run, which can be accessed with
current_context().working_directory
in the python flytekit. (and i think this fs behavior is cloud agnostic, so should be the same for gcs. but a flyte contributor might need to confirm that)
n
We save data directly to a datasets bucket, remote_path is like this gs://bucket_name/path/to/output
if no remote_path is used, another bucket is used by default for "temporary" artifacts, which is tied to execution id and current context
c
you're unable to use the default temp execution-specific directory for passing data between tasks?
n
I can, but i wanna save the output to a custom bucket.
c
it might be best to solve that independently of passing data between tasks. e.g., an archive task and a process results task. the docs for
FlyteDirectory
have a warning about what you are seeing:
Copy code
This class should not be used on very large datasets, as merely listing the dataset will cause
        the entire dataset to be downloaded. Listing on S3 and other backend object stores is not consistent
        and we should not need data to be downloaded to list.
n
I see, thanks