Has anyone seen Flyte Propeller fail workflows wit...
# flyte-support
c
Has anyone seen Flyte Propeller fail workflows with this failure?
Last known status message: AlreadyExists: Event Already Exists, caused by [event has already been sent]
I can''t tell if this indicates a larger issue or if Flyte Propeller should just be updated to more gracefully handle
AlreadyExists
. Going to dig deeper into this to understand if DB state was updated in Flyte Admin but maybe gRPC call failed the first time the event was sent..
Everything is looking normal. This seems to happen on array nodes.
c
hey @clean-glass-36808 is this still an issue in your environment? I've seen the
Already Exists
but there could be associated with transient communication issues between propeller and admin or propeller and the k8s informer. MapTasks are a common theme
c
Looking into it more it seems like the execution node was bouncing around states like moving to in progress and then being rerun in and old queued state so I think this is the cause for it trying to send duplicate events. I might have some more logs later but generally seemed like some sort of issue with the informer
c
👋 @average-finland-92144 Trying to understand this one a bit more as we are still running into it this issue. It looks like propeller supports a configuration to choose whether or not to ignore this particular error when recording task events: https://github.com/flyteorg/flyte/blob/master/flytepropeller/pkg/controller/nodes/node_exec_context.go#L38-L43 But, the array node task (which is where we are seeing this), explicitly enables
ErrorOnAlreadyExists
for events: https://github.com/flyteorg/flyte/blob/master/flytepropeller/pkg/controller/nodes/array/handler.go#L746 It looks like this was intentional so ArrayNode sub-node state is not lost: https://github.com/flyteorg/flyte/pull/5680. During the failed workflows we see the event sink retries fail continuously:
Copy code
"msg": "Event version already exists, bumping version and retrying (1/3): [AlreadyExists: Event Already Exists, caused by [event has already been sent]]",
"msg": "Event version already exists, bumping version and retrying (2/3): [AlreadyExists: Event Already Exists, caused by [event has already been sent]]",
"msg": "Event version already exists, bumping version and retrying (3/3): [AlreadyExists: Event Already Exists, caused by [event has already been sent]]",
"msg": "Event version already exists, bumping version and retrying (4/3): [AlreadyExists: Event Already Exists, caused by [event has already been sent]]",
...
This presumably repeats until the workflow fails out completely:
Copy code
Workflow[...] failed. RuntimeExecutionError: max number of system retry attempts [31/30] exhausted. Last known status message: AlreadyExists: Event Already Exists, caused by [event has already been sent]
One question I have is around how
3 retries
was chosen? Is it possible the
TaskPhaseVersion
is > 3 versions out of sync? Going to continue looking into this more but any additional context or guidance would be appreciated 🙇
We are seeing this sync error a handful of times prior to the event already exist error which does point to a similar symptom described in that pull request:
Copy code
E0430 15:32:53.086687       1 workers.go:103] error syncing '...': Operation cannot be fulfilled on flyteworkflows.flyte.lyft.com "...": the object has been modified; please apply your changes to the latest version and try again
a
hey @creamy-piano-60645 That last message tyically comes under situations of high load when there's a delay in syncing the Informer cache, propeller tries to update the CRD status in etcd and it's rejected bc there's is a more recent version (ref). What Flyte version are you using? 1.15.1 introduced some fixes that could probably fix or improve things: https://github.com/flyteorg/flyte/releases/tag/v1.15.1
c
We are using v1.14.x, we'll upgrade to v1.15 and report back