Hey flyte friends,
sorry for the spam but I'm running into yet another weird error. Might be some misconfiguration on my side as it used to work but I wasn't able to figure out yet what change causes the issue.
So what's happening is that suddenly mosts tasks fail almost immediately after they started running. The high-level error shown in flyteconsole is this:
Some node execution failed, auto-abort.
The task pods start, go into running state. I can see a few lines of task logs sometimes but then it looks like the pod is just deleted long before it's finished. It terminates and is cleaned up.
Log says:
Stopping container m4dydd0r8y-n0-0
So i checked flytepropeller logs and there's one error log a few seconds before the pod is deleted. Not sure if it's related:
Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "m4dydd0r8y": the object has been modified; please apply your changes to the latest version and try again]
I'm a bit at a loss here how to debug this further. Any hints much appreciated. I can also provide the full propeller log if that could help tracing this down.