hey flyte users am curious what patterns libraries people us Flyte #flyte-support

hey flyte users, am curious what patterns / librar...

wooden-fish-36386

06/18/2024, 12:49 AM

hey flyte users, am curious what patterns / libraries people use for things like dataset / output caching. Flytekit does have a nice cache feature, and I appreciate that many users probably just read/write to S3 or similar blob storage. But suppose you have a flytekit task that: • Reads from a large dataset, which could potentially be cached locally on the machine somewhere (e.g. Huggingface datasets can do this). That path might be a local volume-mount. • The Task might want to write intermediate data locally somewhere (e.g. another volume-mounted location), provide some path to this data, then let the next task in the workflow worry about reading the data, even if the task is on some other machine. E.g. perhaps this data gets written to a local NFS export, and other machines in the cluster can read from this export. ◦ (This usage pattern is similar to a shuffle where machines may directly move data between each other instead of through e.g. a main S3 bucket) How do folks handle this case today? Or do folks tend to "design to avoid it" ? e.g. just always write to S3, and then maybe later delete old stuff (intermediate data) from S3. Or in the case of Ceph, there could be some architecting / configuration for local storage. Lastly, maybe it's common to just have a very powerful central SAN that can service all Flyte workers very well.

freezing-airport-6809

06/18/2024, 1:43 PM

Recover is an api, a button in Ui and a flytectl command. Also available programmatically in flyteremote

freezing-airport-6809

06/18/2024, 1:45 PM

You can use volumes if you want, or create a new annotation like local cached that would add the local path as metadata to the type. Goal is to store intermediate durably, so s3/ceph writes will be needed

wooden-fish-36386

06/18/2024, 8:08 PM

Cool thanks for the thoughts! If you store intermediates durably and they get kinda large, do you just delete them either adhoc or some cron job to expire after N days? Am curious what people do in practice versus possible approaches. In my experience, people didn't really care about the S3 bill and so deletions were ... very adhoc 😅

16 Views

Open in Slack

Previous Next