Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

image.png

Hi Team,
We have been now using flyte for almost like 1 year. But after 1 year we can see that our flyte storage in our s3 bucket have reached upto 5.5 TB so in that case I would like to know is there any settings in flyte where we it automatically do the cleanup and delete the metadata of execution which are older than x number of days.

We have tried by setting up the cleanup policy at s3 bucket level and this has caused our workflow to fail and hence we have to do the deployment again.

Also planning to set this up as during eval my bucket has grown to almost 30GB in just 2 months during eval, outside of production.
I'm pretty sure a lifecycle/policy on the bucket level would be enough

I think any object in `data` and `metadata/propeller` and `metadata/{project}/{domain}` older than e.g. 30 days (or whatever your policy should be) can be deleted as these are the inputs/outputs generated during executions. (and these grow the most)
The rest I wouldn't touch and instead manage via flyte (as the rest of the objects in a bucket is about workflow/task definitions)

Also just ran a `gsutil du` (I'm on GCP). Of the 26.7GBs in my bucket, almost all of it (99.9%) is in `data`. So just setting a policy on that folder in the bucket should be enough to keep costs (and I imagine it wouldn't break other things unless you happen to run your workflows from 30+ day old cached results)

I have added the policy but in that case it has deleted the some critical files which is required to run the flyte task and flyte workflow and that broke my pipeline. Due to this I had to redeploy the full code and workflow code.
So i was thinking is there any settings in Flyte where we it can do the cleanup but making sure it didnt deletes the required files

Hey <@U05LPF1FJ8L>, very important questions :slightly_smiling_face: thank you for kicking off this discussion...
There are a few types of data that flyte can do better in segregating to allow you to manage lifecycle for them better...
1. Fast-register (if you run `pyflyte run --remote`) these are tar files that have your code as you iterate through it using that command. If you are always building new docker images with your code changes, you won't see these tar files. --&gt; /metadata
2. Intermediate inputs/outputs: primitive data produced by tasks/workflows as well as flyte decks, errors.. etc. --&gt; /metadata
3. Raw data: offloaded data types (e.g. FlyteFile) --&gt; /&lt;two letters&gt;/...
If you delete 2 and/or 3, old execution pages won't render properly (you maybe ok with that) and cache lookups of old executions will result in dangling pointers (a pointer that points no where). There is a setting on Propeller for MaxCacheAge (or something along those lines) to prevent such cases... you should generally set that to be half the lifecycle policy...

If you delete 1, you will fail to rerun old versions of the workflows that used fast register to run...

The problem is you can't easily distinguish between 1 and 2 (they both get written to /metadata)... but maybe you can set the policy by file extension?
I think a combination of that + the maxCacheAge should be sufficient to keep your data usage in check.

Thanks <@UNW4VP36V> for valuable and detailed input, i will try these steps and will update you with the outcome. Much appreciate your help and time :thanks: