Thread
#feature-discussions
    a

    Andrew Achkar

    3 weeks ago
    Hello, curious if anyone else has had a need for a FlyteFile that has an associated lifetime - whereby it would be automatically cleaned up after a user-specified timeframe. This could be along the lines of compliance with data retention policies.
    Ketan (kumare3)

    Ketan (kumare3)

    3 weeks ago
    @Andrew Achkar can you not do this directly on the target S3 bucket?
    a

    Andrew Achkar

    3 weeks ago
    Good point Ketan, you can definitely use s3 bucket lifecycle policies, but it is fairly coarse. Tag filtering could be used to be a bit more granular, up to a point (there would be a limit on the number of lifecycle policies a bucket can have). So I think that would cover many cases, yes. Still, files managed by Flyte (ie. FlyteFile) would need to be created with specific tags that match those policies - not sure if that is controllable and to what degree. Maybe I want some files to expire (retention policy) and others I don’t mind keeping around longer So, I was imagining more fine grained control where per-file expiry timestamps could be specified, but I think you could manage just with some tagging convention paired with the bucket lifecycle policy.
    Ketan (kumare3)

    Ketan (kumare3)

    3 weeks ago
    would love to hear a proposal
    feel free to do it and we can share it in one of the community meetings
    a

    Andrew Achkar

    3 weeks ago
    Sure, for a proposal do you mean a document, github issue, sketch of the programmatic api? Also, it occurred to me, even if we can solve for s3 case via bucket policies, supporting other cloud vendors makes it more complicated if that is the implementation. It feels like it might be a better fit for something flyte keeps track of more directly.
    Ketan (kumare3)

    Ketan (kumare3)

    3 weeks ago
    hmm
    ya an issue or there is an RFC process
    check the flyte repo
    a

    Andrew Achkar

    3 weeks ago
    I’ve put together a description as an issue here: https://github.com/flyteorg/flyte/issues/2832
    I wonder if one way to represent this might be an
    Expireable[T]
    annotation. So it could be
    Expireable[pd.DataFrame]
    or
    Expireable[FlyteFile]
    and tasks could return their object wrapped in this object.
    As a workaround, we noticed we could probably use StructuredDataset’s uri field to control more explicitly where files end up in s3 and hence apply bucket lifecycle policies that would apply to those objects specifically (as a group, more so than the granularity I was suggesting would be ideal). In case someone else has a similar need that seems to be a viable approach if the data can be made into a StructuredDataset.
    Ketan (kumare3)

    Ketan (kumare3)

    3 weeks ago
    You can control this in flytefile too
    Also you can set the raw output prefix per execution and all raw data will be put in that prefix only
    a

    Andrew Achkar

    3 weeks ago
    Oh? How do I set that in FlyteFile? I saw that you could set the prefix per execution, that can help too for some scenarios, but I am picturing say an intermediate dataset needing to be deleted but an output model artifact need to be retained, as part of the same workflow / execution.
    Ketan (kumare3)

    Ketan (kumare3)

    3 weeks ago
    I would still then just copy the model to a different location as a step
    a

    Andrew Achkar

    3 weeks ago
    Sure, that is always an option, or conversely use manual file management for files needing to obey retention rules. Just was looking for some elegant way to solve it more generally 🙂
    Ketan (kumare3)

    Ketan (kumare3)

    3 weeks ago
    ya, i think instead of manual file management, just copying the last bit over is nicer. it scales better. - basically the end prefix can be learnt
    cc @Yee- Flyte file custom path?