Hey everyone! I had a quick question around cachin...
# ask-the-community
Hey everyone! I had a quick question around caching. I have a workflow in which one task takes over 13 hours, so leveraging caching would be a tremendous help. I understand that Flyte is caching based on the project, domain, cache version, and task signature -- all of these are constant between runs for me, so no issues there. Caching is also determined based on the input, and this is where I have a question. The input to the task is a pandas dataframe (with potentially many many rows), so I'm wondering how Flyte checks the input. For a dataframe input, does Flyte check every row and column and it's a cache miss if they don't exactly match the last input? In that case, would we expect for Flyte to take a significant amount of time to check for a cache hit/miss if the dataframe has many rows or columns?
@Seth Baer great question. Pandas dataframe is passed by reference in Flyte - I.e as a file in s3. So by default only matching references will be cached But for objects that are large - typically passed by reference- you can override the caching to use your custom hash or provide client side hash Refer to https://docs.flyte.org/projects/cookbook/en/stable/auto/core/flyte_basics/task_cache.html
Thanks for the insight @Ketan (kumare3), I'll run some tests using this info!