@high-park-82026 - just to clarify, is it really the case that one cannot cause Flyte directly to cache the result of data copied from S3 to the EC2 compute node, across tasks? I'm not clear from your response which of those benefits you list apply to using mountpoint (and if mountpoint is what actually provides the caching), and which apply to your reference to "other tricks" at Union -- and most importantly, whether or not it's possible to give Flyte some hint or clues or flags about caching results on the compute node across tasks (or if instead I need to re-architect my many-task workflow into a big monolithic task to avoid all this slow data copy).
I have a classification job I've just watched do this:
1. Train a classifier, which is 100s GB in size. The trained classifier is coped from EC2 => S3 (1 hour copy)
2. In the "test" phase, the same classifier is copied back to the same node. ( 1.5 hour copy)
3. Now it's time to classify the "real" input data. We're on the same node, yay! But guess what -- we're copying the classifier down from S3, again. (1.5 hour copy)
I think something must be suboptimal in our network/file-io, because this is way too long even for 100s of GB. But we could save the 3 hours in steps 2 and 3 if only we didn't have do download something (twice!) that was just created on the same compute node.
Thanks for any clarification you can provide 🙏