hey flyte users, am curious what patterns / libraries people use for things like dataset / output caching. Flytekit does have a nice cache feature, and I appreciate that many users probably just read/write to S3 or similar blob storage. But suppose you have a flytekit task that:
• Reads from a large dataset, which could potentially be cached locally on the machine somewhere (e.g. Huggingface datasets can do this). That path might be a local volume-mount.
• The Task might want to write intermediate data locally somewhere (e.g. another volume-mounted location), provide some path to this data, then let the next task in the workflow worry about reading the data, even if the task is on some other machine. E.g. perhaps this data gets written to a local NFS export, and other machines in the cluster can read from this export.
◦ (This usage pattern is similar to a shuffle where machines may directly move data between each other instead of through e.g. a main S3 bucket)
How do folks handle this case today? Or do folks tend to "design to avoid it" ? e.g. just always write to S3, and then maybe later delete old stuff (intermediate data) from S3. Or in the case of Ceph, there could be some architecting / configuration for local storage. Lastly, maybe it's common to just have a very powerful central SAN that can service all Flyte workers very well.