Hey everyone! I had a quick question around caching. I have a workflow in which one task takes over 13 hours, so leveraging caching would be a tremendous help.
I understand that Flyte is caching based on the project, domain, cache version, and task signature -- all of these are constant between runs for me, so no issues there. Caching is also determined based on the input, and this is where I have a question. The input to the task is a pandas dataframe (with potentially many many rows), so I'm wondering how Flyte checks the input.
For a dataframe input, does Flyte check every row and column and it's a cache miss if they don't exactly match the last input? In that case, would we expect for Flyte to take a significant amount of time to check for a cache hit/miss if the dataframe has many rows or columns?