Hello, curious if anyone else has had a need for a...
# feature-discussions
t
Hello, curious if anyone else has had a need for a FlyteFile that has an associated lifetime - whereby it would be automatically cleaned up after a user-specified timeframe. This could be along the lines of compliance with data retention policies.
f
@thankful-dress-89577 can you not do this directly on the target S3 bucket?
t
Good point Ketan, you can definitely use s3 bucket lifecycle policies, but it is fairly coarse. Tag filtering could be used to be a bit more granular, up to a point (there would be a limit on the number of lifecycle policies a bucket can have). So I think that would cover many cases, yes. Still, files managed by Flyte (ie. FlyteFile) would need to be created with specific tags that match those policies - not sure if that is controllable and to what degree. Maybe I want some files to expire (retention policy) and others I don’t mind keeping around longer So, I was imagining more fine grained control where per-file expiry timestamps could be specified, but I think you could manage just with some tagging convention paired with the bucket lifecycle policy.
f
would love to hear a proposal
feel free to do it and we can share it in one of the community meetings
t
Sure, for a proposal do you mean a document, github issue, sketch of the programmatic api? Also, it occurred to me, even if we can solve for s3 case via bucket policies, supporting other cloud vendors makes it more complicated if that is the implementation. It feels like it might be a better fit for something flyte keeps track of more directly.
f
hmm
ya an issue or there is an RFC process
check the flyte repo
t
I’ve put together a description as an issue here: https://github.com/flyteorg/flyte/issues/2832
I wonder if one way to represent this might be an
Expireable[T]
annotation. So it could be
Expireable[pd.DataFrame]
or
Expireable[FlyteFile]
and tasks could return their object wrapped in this object.
As a workaround, we noticed we could probably use StructuredDataset’s uri field to control more explicitly where files end up in s3 and hence apply bucket lifecycle policies that would apply to those objects specifically (as a group, more so than the granularity I was suggesting would be ideal). In case someone else has a similar need that seems to be a viable approach if the data can be made into a StructuredDataset.
f
You can control this in flytefile too
Also you can set the raw output prefix per execution and all raw data will be put in that prefix only
t
Oh? How do I set that in FlyteFile? I saw that you could set the prefix per execution, that can help too for some scenarios, but I am picturing say an intermediate dataset needing to be deleted but an output model artifact need to be retained, as part of the same workflow / execution.
f
I would still then just copy the model to a different location as a step
t
Sure, that is always an option, or conversely use manual file management for files needing to obey retention rules. Just was looking for some elegant way to solve it more generally 🙂
f
ya, i think instead of manual file management, just copying the last bit over is nicer. it scales better. - basically the end prefix can be learnt
cc @thankful-minister-83577- Flyte file custom path?