Hey everyone I had a quick question around caching I have a Flyte #flyte-support

Hey everyone! I had a quick question around cachin...

brief-leather-11096

01/26/2023, 3:45 PM

Hey everyone! I had a quick question around caching. I have a workflow in which one task takes over 13 hours, so leveraging caching would be a tremendous help. I understand that Flyte is caching based on the project, domain, cache version, and task signature -- all of these are constant between runs for me, so no issues there. Caching is also determined based on the input, and this is where I have a question. The input to the task is a pandas dataframe (with potentially many many rows), so I'm wondering how Flyte checks the input. For a dataframe input, does Flyte check every row and column and it's a cache miss if they don't exactly match the last input? In that case, would we expect for Flyte to take a significant amount of time to check for a cache hit/miss if the dataframe has many rows or columns?

freezing-airport-6809

01/26/2023, 4:07 PM

@brief-leather-11096 great question. Pandas dataframe is passed by reference in Flyte - I.e as a file in s3. So by default only matching references will be cached But for objects that are large - typically passed by reference- you can override the caching to use your custom hash or provide client side hash Refer to https://docs.flyte.org/projects/cookbook/en/stable/auto/core/flyte_basics/task_cache.html

brief-leather-11096

01/26/2023, 4:35 PM

Thanks for the insight @freezing-airport-6809, I'll run some tests using this info!

👍 1

212 Views

Open in Slack

Previous Next