I have a long running task that I want to cache However the Flyte #flyte-support

I have a long running task that I want to cache. H...

faint-monitor-96441

09/20/2022, 12:40 PM

I have a long running task that I want to cache. However, the underlying data may have changed in which case I don’t want to hit the cache. I can pass an md5 hash of the changed data to the task but that hash will only be used to change the task signature (input parameters) and will not actually be used by the task and this feels a bit ugly. Is there a cleaner way to achieve this?

faint-monitor-96441

09/20/2022, 12:52 PM

I am thinking something like defining a type that’s essentially a path-like string, that has its own

HashMethod

, would this work?

freezing-airport-6809

09/20/2022, 1:37 PM

This should work, infact we actually want hash to be applicable to all types, I think this work is planned - cc @high-accountant-32689 - contribution would be ❤️

broad-monitor-993

09/20/2022, 2:16 PM

you can use

Annoteted

types to compute a hash of incoming data so you don’t have to manually pass an md5 hash to the task, see here: https://docs.flyte.org/projects/cookbook/en/latest/auto/core/flyte_basics/task_cache.html#caching-of-non-flyte-offloaded-objects

faint-monitor-96441

09/21/2022, 7:38 AM

@broad-monitor-993 I was looking at precisely that but with a

str

type. It doesn’t work though, perhaps because

str

is a primitive?

faint-monitor-96441

09/21/2022, 8:02 AM

Eg I have something like this but a cached task doesn’t care whether the md5 changes, it always hits the cache:

Copy code

@task
def hash_dataset_function(dataset_name: str) -> str:
    return hashlib.md5(
        open(f"data/dataset/{dataset_name}.dvc", "rb").read()
    ).hexdigest()


@task
def get_dataset_name(process: str) -> Annotated[str, HashMethod(hash_dataset_function)]:
    return process

@task(cache=True,cache_version="1.0")
def cached_task(dataset_name: str) -> float:
   ...

@workflow
def wf():
    dataset_name = get_dataset_name(process=process)
    always_cached = cached_task(dataset_name)

faint-monitor-96441

09/21/2022, 9:44 AM

@freezing-airport-6809 Would love to make contributions but wouldn’t know where to look in this case.

faint-monitor-96441

09/21/2022, 9:52 AM

I changed to another approach instead where I define a new Flyte custom class decorated with

@dataclass @dataclass_json

which has the md5 checksum. If the other approach is supposed to work, let me know.

freezing-airport-6809

09/21/2022, 1:30 PM

Cc @high-accountant-32689 / @thankful-minister-83577 can you please help here

freezing-airport-6809

09/21/2022, 1:34 PM

@faint-monitor-96441 so sadly today the custom hash is only implemented for structureddataset- but this is just to support most important type and in flytekit - docs here - https://docs.flyte.org/projects/cookbook/en/stable/auto/core/flyte_basics/task_cache.html

freezing-airport-6809

09/21/2022, 1:36 PM

But if you see this - https://github.com/flyteorg/flyteidl/blob/c4ea1f9824ce60b18d56d6bd109e89089f23c1ec/protos/flyteidl/core/literals.proto#L110 The core literal supports hashes for all types. So we need to update flytekit for this

freezing-airport-6809

09/21/2022, 1:39 PM

@faint-monitor-96441 this is the code in flytekit https://github.com/flyteorg/flytekit/blob/cfcccb82cbec585946b0e06b6ce659b242cf8219/flytekit/core/type_engine.py#L71

thankful-minister-83577

09/21/2022, 4:55 PM

I don’t really follow this. Will have to take a look tonight.

thankful-minister-83577

09/21/2022, 4:55 PM

I’m not convinced that extending the hash annotation feature to non-pointer-types (which we should do regardless) will solve this.