Hey @tall-lock-23197, thanks for your response.
I looked at the PR you linked for file/directory streaming. It's not clear to me how this will be helpful, since we are downloading a very large file that is a classifier -- we need the whole thing present in memory to do any work - (normally I think of streaming as being useful when you can request parts of a large file that can be processed independently, and do work without needing to download the whole file first).
Are you suggesting that the underlying mechanism of FlyteFile.read() is significantly faster than FlyteDirectory.download() -- the latter is what we currently use to pull data from s3.
Finally, I'll admit that we're really still learning how to manually intervene in the result load/save process. Flyte does much of this somewhat transparently when you have a task that returns a result, and that result is then used by a downstream task. Flyte saves the result to s3 as it comes out of task1, and it downloads that result from s3 as it goes into the downstream task.
Our results further contain a FlyteDirectory, and in the downstream task, we then trigger additional download via that FlyteDirectory that contains our larger files.
I'm not really clear on the distinction between using a FlyteFile.read() and a FlyteDirectory.download() like we currently do -- if the former is significantly faster, it may help.
We are working on an alternate solution in which the result that is automatically downloaded by Flyte contains a path referencing the very large file that lives elsewhere -- in this case in an s3 bucket mounted via aws mountpoint. Our hope is to make gains via the additional throughput and caching provided by mountpoint.
This is kind of what I wondered if other people do: use built-in Flyte data management to set/get results in the backend s3-blob-store, and then inside of those intentionally small-ish results, save references to large data that lives elsewhere -- elsewhere being anywhere that is faster/mountable/cacheable so make copies faster and allow caching for tasks that land on the same node and need the same data. This is what @high-park-82026 seemed to hint at in the other thread I referred to.