Hello, I'm at a point with flyte now where I'm asking myself how bigger data (talking GBs) is passed efficiently between tasks.
I cannot imagine that if I have a setup on e.g. AWS that those bigger output/input data is synced with S3 all the time. Or am I missing something?
What would be the way to exchange big data between tasks inside a workflow without up and downloading it on every task again?
I can imagine that it could be shared over a mounted volume from the k8s cluster, but that would probably interfere with the caching mechanism at some point, right?
03/02/2023, 3:05 PM
To be honest it might look like premature optimization. How big are you talking about
By the way the entire data subsystem is getting a refresh.
Once this https://github.com/flyteorg/flytekit/pull/1512 you will be able to stream data
You should not need to download and upload things unless you transform it.
And finally you can use efs/lustre fsx or shared volumes using pod templates
But @Broder Peters would you open to a chat - we would love to understand what you are seeing and how we can make this even better.
We are working on cool things, we want to make it simple yet efficient and correct
03/03/2023, 8:32 AM
Thanks for the feedback!
I considered that the volume binding would be the way to go in my scenario.
About a more concrete scenario, I will come back to you once I'm deeper into the topic.
03/03/2023, 2:55 PM
Yes please this will help
Let’s make Flyte make it possible for users to keep it simple