Hi,
In the script below I have two tasks, one downloads data from S3 to the "flytekit.current_context().working_directory" and returns the working directory as FlyteDirectory. The second tasks reads files from this directory. How can I ensure that both tasks are always running on the same node in a Kubernetes cluster? Is that automatically handled by passing FlyteDirectory?
Thank you very much for your answer. What I am wondering about is, whether the data downloaded in the "download_data" task is persisted locally. Or is it always offloaded to S3 before being downloaded again in the "train" task? If it was the letter then this would make the download_data task useless since the data is then downloaded twice from S3.
l
limited-dog-47035
11/30/2022, 3:06 PM
Yes, if you're downloading from your own S3 source and not some externally managed S3 or other source, the download step is redundant. Otherwise, it is just going download again from the second task.
One reason for downloading in a separate task would be if you wanted to make a copy of the artifacts in a specific output directory, which we've done in at least one of our workflows. This is especially useful if the source of the data where you're pulling from may change over time and you'd like to capture the copy of the data's state at a particular time