Hello I have a strange issue I think related to caching I ha Flyte #flyte-support

Hello! I have a strange issue I think related to ...

bored-needle-72209

05/02/2024, 11:23 AM

Hello! I have a strange issue I think related to caching. I have a successful execution and when I try to relaunch the same execution with the same parameters, it gets aborted instead of loading in every cached output. Any thoughts on this?

tall-lock-23197

05/03/2024, 6:29 AM

seems like this was fixed sometime ago: https://github.com/flyteorg/flyte/issues/3901. mind taking a look at the issue?

bored-needle-72209

05/03/2024, 4:01 PM

Yes, this is it, thanks!

enough-car-91616

06/27/2024, 7:40 AM

Could it be that there is a regression here (we are running on 1.12)? We are consistently getting the same issue: workflow breaks, we increase the cache version (or disable caching), it works once, then it fails again:

enough-car-91616

06/27/2024, 7:41 AM

Screen Shot 2024-06-27 at 9.40.40 AM.png

tall-lock-23197

06/27/2024, 10:53 AM

@glamorous-carpet-83516 any idea why this might be happening? if you look at the error log, there's

PythonPickle

being compared against

PyTorchModule

glamorous-carpet-83516

06/27/2024, 5:25 PM

are you able to share the code?

glamorous-carpet-83516

06/27/2024, 5:25 PM

does it work prior to 1.12?

bored-needle-72209

06/27/2024, 5:30 PM

It was the same problem as @tall-lock-23197 sent. I had a nested list (List[List[object]]) and it included some empty lists.

glamorous-carpet-83516

06/27/2024, 5:31 PM

Got it, I’m investigating

enough-car-91616

06/28/2024, 6:20 AM

This worked for us prior to 1.12 (we were on 1.10 before), which indicates that the problem was likely introduced in 1.11 or 1.12. There is a small chance that the problem has been there all along, but somehow it didn't surface until now and the fact that it surfaced when upgrading to 1.12 is just a coincidence.

enough-car-91616

06/28/2024, 6:20 AM

BTW, does anyone know if it's safe to downgrade from 1.12 to 1.10?

enough-car-91616

06/28/2024, 4:27 PM

We tracked this down to a Flyte serialization issue. pytorch models are serialized differently depending on whether

flytekit.extras.pytorch

is available or not. So if one task happens to serialize the model while this import is available, while the subsequent task doesn't have access to this module, the workflow will break with this error.

tall-lock-23197

06/28/2024, 4:37 PM

thanks for diving deep into this to find the root cause. why does the subsequent task doesn't have access?

2 Views

Open in Slack

Previous Next