Hi, In the script below I have two tasks, one downloads data from S3 to the "flytekit.current_contex...

freezing-shampoo-67249

11/29/2022, 4:04 PM

Hi, In the script below I have two tasks, one downloads data from S3 to the "flytekit.current_context().working_directory" and returns the working directory as FlyteDirectory. The second tasks reads files from this directory. How can I ensure that both tasks are always running on the same node in a Kubernetes cluster? Is that automatically handled by passing FlyteDirectory?

@task()

def download_data() -> FlyteDirectory:

working_dir = flytekit.current_context().working_directory

download_to_working_dir()

# now data is in "working_dir"

return FlyteDirectory(path=working_dir)

@task()

def train(data_dir: FlyteDirectory): -> None

train_model(data_dir=data_dir)  # reads files from disk and trains model

limited-dog-47035

11/29/2022, 4:25 PM

Passing a FlyteDirectory or FlyteFile between tasks, will give you access to

The source path that users are expected to call open() on

You can use FlyteFile also https://docs.flyte.org/projects/flytekit/en/latest/types.builtins.file.html

👍 3

freezing-shampoo-67249

11/30/2022, 10:44 AM

Thank you very much for your answer. What I am wondering about is, whether the data downloaded in the "download_data" task is persisted locally. Or is it always offloaded to S3 before being downloaded again in the "train" task? If it was the letter then this would make the download_data task useless since the data is then downloaded twice from S3.

limited-dog-47035

11/30/2022, 3:06 PM

Yes, if you're downloading from your own S3 source and not some externally managed S3 or other source, the download step is redundant. Otherwise, it is just going download again from the second task. One reason for downloading in a separate task would be if you wanted to make a copy of the artifacts in a specific output directory, which we've done in at least one of our workflows. This is especially useful if the source of the data where you're pulling from may change over time and you'd like to capture the copy of the data's state at a particular time

👍 1

freezing-shampoo-67249

12/01/2022, 3:24 PM

Thank you very much for the detailed answer.

171 Views

Open in Slack

Previous Next

Flyte

Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.