Hi everyone, we're using Flyte 1.15 and I'm strugg...
# flyte-support
f
Hi everyone, we're using Flyte 1.15 and I'm struggling with grasping the caching system when dataclasses with structuredDatasets are involved. Tried with custom HashMethod but to no avail I've build the following minimal example to explain the issue I'm facing
Copy code
from dataclasses import dataclass
from typing import Annotated
from flytekit import Cache, HashMethod, StructuredDataset, task, workflow
import pandas as pd
import logging


@dataclass
class Data:
    metadata: str
    df: StructuredDataset


def hash_pandas_dataframe(df: pd.DataFrame) -> str:
    return str(pd.util.hash_pandas_object(df))


def hash_data(data: Data) -> str:
    # I cannot access the pd.Dataframe in the hash function?
    return str(pd.util.hash_pandas_object(data.df.open(pd.DataFrame).all()))


@task
def generate_data_a() -> Annotated[Data, HashMethod(hash_data)]:
    data = Data(
        metadata="hello",
        df=StructuredDataset(pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})),
    )
    return data


@task(cache=Cache(version="1.3"))
def process_data_a(data: Data) -> bool:
    logging.error(f"process_data_a: {data.df.open(pd.DataFrame).all()}")
    return True


@task
def generate_data_b() -> Annotated[pd.DataFrame, HashMethod(hash_pandas_dataframe)]:
    return pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})


@task(cache=Cache(version="1.3"))
def process_data_b(data: pd.DataFrame) -> bool:
    logging.error(f"process_data_b: {data}")
    return True


@workflow
def cache_workflow() -> None:
    # With the custom hashMethod for the `Data` object it crashes but perhaps that is also not the correct way to do it
    data_a = generate_data_a()
    process_data_a(data_a)

    # This caches correctly
    data_b = generate_data_b()
    process_data_b(data_b)

    return
f
That is indeed right. I think we eagerly uploaded it. Let me check. Might be a good catch and hard one to fix. One solution would be to calculate the hash in code and set it for your function
f
So passing the hash as part of the method arguments? And I assume then cache ignoring the real StructuredDataset. That could work, but it seems a bit counterintuitive as it moves a lot of responsibility to the user. The caching is such a nice feature when it's automatic 😉
I was also thinking of passing a hash into the creation of the StructuredDataset into the metadata. And using that in the HashMethod, which does seem to be available there. Still a bit of workaround code but easier to abstract away.
f
Indeed caching is nice. With dataclasses the hash is at the dataclass level - it does not cascade down