Hi all I have a workflow that runs multiple times a day each Flyte #flyte-support

Hi all! I have a workflow that runs multiple times...

important-hamburger-34837

02/07/2024, 1:33 PM

Hi all! I have a workflow that runs multiple times a day, each step saves its output to a certain path on gcs. each step takes the output of the previous one as an input the path is gonna be the same for all runs. Problem: passing the output directly downloads all the files from gcs path, not only the outputs of the current run. I think this is a bug.

proud-answer-87162

02/07/2024, 1:52 PM

Howdy

Problem: passing the output directly downloads all the files from gcs path, not only the outputs of the current run.

how are you detecting that behavior? do you see it in gcs or flyte logs? can you paste the task signatures you're working with here?

important-hamburger-34837

02/07/2024, 1:58 PM

t1_op = Task1: output: FlyteDirectory(path.., remote_path) Task2: input: FlyteDirectory = t1_op output: FlyteDirectory(path, remote_path) Task1 actual output has only 1 file, but the remote path has 800 Task2 input is 800 files. I see them through logs

proud-answer-87162

02/07/2024, 2:02 PM

if task2 only needs a file, should the output of task1 be a

FlyteFile

instead of a

FlyteDirectory

important-hamburger-34837

02/07/2024, 2:03 PM

thats an example task 1 is a scraper and outputs multiple files in a directory. task 2 performs some operations on the scraped files.

proud-answer-87162

02/07/2024, 2:06 PM

the directory at the remote path stores results from multiple runs?

important-hamburger-34837

02/07/2024, 2:06 PM

yes.

important-hamburger-34837

02/07/2024, 2:06 PM

when passed to next step, it downloads everything

proud-answer-87162

02/07/2024, 2:07 PM

what behavior do you expect? how should flyte detect the specific files your new run needs?

important-hamburger-34837

02/07/2024, 2:11 PM

Copy code

op1 = task1() # op1 is FlyteDirectory with remote_path
    op2 = task2(inp=op1) # passing op1 in the same wf, should use a temporary artifact.

important-hamburger-34837

02/07/2024, 2:11 PM

this way, u cannot return any remote directory if its an input to the next task because it not reliable

proud-answer-87162

02/07/2024, 2:25 PM

how are you creating remote_path? the aws and azure implementations do create subdirectories per run, which can be accessed with

current_context().working_directory

in the python flytekit. (and i think this fs behavior is cloud agnostic, so should be the same for gcs. but a flyte contributor might need to confirm that)

important-hamburger-34837

02/07/2024, 2:26 PM

We save data directly to a datasets bucket, remote_path is like this gs://bucket_name/path/to/output

important-hamburger-34837

02/07/2024, 2:28 PM

if no remote_path is used, another bucket is used by default for "temporary" artifacts, which is tied to execution id and current context

proud-answer-87162

02/07/2024, 2:34 PM

you're unable to use the default temp execution-specific directory for passing data between tasks?

important-hamburger-34837

02/07/2024, 2:35 PM

I can, but i wanna save the output to a custom bucket.

proud-answer-87162

02/07/2024, 2:36 PM

it might be best to solve that independently of passing data between tasks. e.g., an archive task and a process results task. the docs for

FlyteDirectory

have a warning about what you are seeing:

Copy code

This class should not be used on very large datasets, as merely listing the dataset will cause
        the entire dataset to be downloaded. Listing on S3 and other backend object stores is not consistent
        and we should not need data to be downloaded to list.

important-hamburger-34837

02/07/2024, 2:37 PM

I see, thanks

👍 1

7 Views

Open in Slack

Previous Next