I have a long running task that I want to cache. H...
# ask-the-community
m
I have a long running task that I want to cache. However, the underlying data may have changed in which case I don’t want to hit the cache. I can pass an md5 hash of the changed data to the task but that hash will only be used to change the task signature (input parameters) and will not actually be used by the task and this feels a bit ugly. Is there a cleaner way to achieve this?
I am thinking something like defining a type that’s essentially a path-like string, that has its own
HashMethod
, would this work?
k
This should work, infact we actually want hash to be applicable to all types, I think this work is planned - cc @Eduardo Apolinario (eapolinario) - contribution would be ❤️
n
you can use
Annoteted
types to compute a hash of incoming data so you don’t have to manually pass an md5 hash to the task, see here: https://docs.flyte.org/projects/cookbook/en/latest/auto/core/flyte_basics/task_cache.html#caching-of-non-flyte-offloaded-objects
m
@Niels Bantilan I was looking at precisely that but with a
str
type. It doesn’t work though, perhaps because
str
is a primitive?
Eg I have something like this but a cached task doesn’t care whether the md5 changes, it always hits the cache:
Copy code
@task
def hash_dataset_function(dataset_name: str) -> str:
    return hashlib.md5(
        open(f"data/dataset/{dataset_name}.dvc", "rb").read()
    ).hexdigest()


@task
def get_dataset_name(process: str) -> Annotated[str, HashMethod(hash_dataset_function)]:
    return process

@task(cache=True,cache_version="1.0")
def cached_task(dataset_name: str) -> float:
   ...

@workflow
def wf():
    dataset_name = get_dataset_name(process=process)
    always_cached = cached_task(dataset_name)
@Ketan (kumare3) Would love to make contributions but wouldn’t know where to look in this case.
I changed to another approach instead where I define a new Flyte custom class decorated with
@dataclass @dataclass_json
which has the md5 checksum. If the other approach is supposed to work, let me know.
k
Cc @Eduardo Apolinario (eapolinario) / @Yee can you please help here
@Martin Hwasser so sadly today the custom hash is only implemented for structureddataset- but this is just to support most important type and in flytekit - docs here - https://docs.flyte.org/projects/cookbook/en/stable/auto/core/flyte_basics/task_cache.html
But if you see this - https://github.com/flyteorg/flyteidl/blob/c4ea1f9824ce60b18d56d6bd109e89089f23c1ec/protos/flyteidl/core/literals.proto#L110 The core literal supports hashes for all types. So we need to update flytekit for this
y
I don’t really follow this. Will have to take a look tonight.
I’m not convinced that extending the hash annotation feature to non-pointer-types (which we should do regardless) will solve this.
i’ve started calling things like Blobs and StructuredDatasets pointer types.
and s3 is our heap 🙂
@Martin Hwasser, I justed merged https://github.com/flyteorg/flytekit/pull/1171. This means that the proposal you have in https://flyte-org.slack.com/archives/CP2HDHKE1/p1663747357181369?thread_ts=1663677600.831079&cid=CP2HDHKE1 would work (with the caveat that you don't need to annotate
hash_dataset_function
with
@task
). You can test it out by installing flytekit from master or waiting for the next release (which should happen about 1 week from now)
158 Views