*Caching & workflow version* Hey guys, I have ...
# ask-the-community
q
Caching & workflow version Hey guys, I have a question about how the caching works, regarding to workflow versions. Looking at the doc here, I originally thought that modifying the tasks in a workflow wouldn't invalidate the cache of upstream tasks. E.g: ā€¢ I run the workflow:
n1 -> n2
ā€¢ Cache is created for
n1
and
n2
. Great šŸ‘ ā€¢ I add a task to the workflow:
n1 -> n2 -> n3
ā€¢ I expected
n1
and
n2
to be cached, but they are actually run again. šŸ¤” Is it the expected behaviour (i.e. cache is managed "per task per workflow version", and not only "per task")? Or am I doing something wrong with my code ?
k
It should be per task
It seems your task have or input, or its cache version changed
q
Hmmm ok, that's strange.
k
Share the snippet, I do this all the time
q
Hmmmm to be honest it's a bit too big to be shared, I made a complex cache computation function based on the call graph (using a lib called pycg).
I have ~150 lines of code to build the cache key šŸ˜•
k
Wow is it even identical
q
Haaaaaaaaa I think I found why. I get all the function dependencies with
pycg
, then try to import them and get their source with
inspect.getsource
to compute a hash. But when instanciated by flytekit, my code is not able to import the task. (e.g: if my task
my_task
is located in the
my_workflow.py
file, the code will try
import <http://my_workflow.my|my_workflow.my>_task
, get the source and hash it, but somehow, inside the flytekit import system it doesn't work anymore. As a fallback, my code was taking the source of the whole file (i.e.
import my_workflow
), that's the workflow version was indeed included in the cache key of every task. The solution I found is to call this "import and hash" function as a subprocess, outside of any flytekit interference.
If you'd like to play with the code, here it is.
(quite hacky and lacks proper documentation, but it fulfills my need)