Hello! I have a strange issue I think related to ...
# flyte-support
b
Hello! I have a strange issue I think related to caching. I have a successful execution and when I try to relaunch the same execution with the same parameters, it gets aborted instead of loading in every cached output. Any thoughts on this?
t
seems like this was fixed sometime ago: https://github.com/flyteorg/flyte/issues/3901. mind taking a look at the issue?
b
Yes, this is it, thanks!
e
Could it be that there is a regression here (we are running on 1.12)? We are consistently getting the same issue: workflow breaks, we increase the cache version (or disable caching), it works once, then it fails again:
Screen Shot 2024-06-27 at 9.40.40 AM.png
t
@glamorous-carpet-83516 any idea why this might be happening? if you look at the error log, there's
PythonPickle
being compared against
PyTorchModule
.
g
are you able to share the code?
does it work prior to 1.12?
b
It was the same problem as @tall-lock-23197 sent. I had a nested list (List[List[object]]) and it included some empty lists.
g
Got it, I’m investigating
e
This worked for us prior to 1.12 (we were on 1.10 before), which indicates that the problem was likely introduced in 1.11 or 1.12. There is a small chance that the problem has been there all along, but somehow it didn't surface until now and the fact that it surfaced when upgrading to 1.12 is just a coincidence.
BTW, does anyone know if it's safe to downgrade from 1.12 to 1.10?
We tracked this down to a Flyte serialization issue. pytorch models are serialized differently depending on whether
flytekit.extras.pytorch
is available or not. So if one task happens to serialize the model while this import is available, while the subsequent task doesn't have access to this module, the workflow will break with this error.
t
thanks for diving deep into this to find the root cause. why does the subsequent task doesn't have access?