Hi, In the script below I have two tasks, one down...
# ask-the-community
Hi, In the script below I have two tasks, one downloads data from S3 to the "flytekit.current_context().working_directory" and returns the working directory as FlyteDirectory. The second tasks reads files from this directory. How can I ensure that both tasks are always running on the same node in a Kubernetes cluster? Is that automatically handled by passing FlyteDirectory?
def download_data() -> FlyteDirectory:
working_dir = flytekit.current_context().working_directory
# now data is in "working_dir"
return FlyteDirectory(path=working_dir)
def train(data_dir: FlyteDirectory): -> None
train_model(data_dir=data_dir)  # reads files from disk and trains model
Passing a FlyteDirectory or FlyteFile between tasks, will give you access to
The source path that users are expected to call open() on
You can use FlyteFile also https://docs.flyte.org/projects/flytekit/en/latest/types.builtins.file.html
Thank you very much for your answer. What I am wondering about is, whether the data downloaded in the "download_data" task is persisted locally. Or is it always offloaded to S3 before being downloaded again in the "train" task? If it was the letter then this would make the download_data task useless since the data is then downloaded twice from S3.
Yes, if you're downloading from your own S3 source and not some externally managed S3 or other source, the download step is redundant. Otherwise, it is just going download again from the second task. One reason for downloading in a separate task would be if you wanted to make a copy of the artifacts in a specific output directory, which we've done in at least one of our workflows. This is especially useful if the source of the data where you're pulling from may change over time and you'd like to capture the copy of the data's state at a particular time
Thank you very much for the detailed answer.