acoustic-carpenter-78188
04/11/2023, 11:54 AMCausedByError: Failed to propagate Abort for workflow. Error: 0: [SystemError] system error, caused by: rpc error: code = PermissionDenied desc = Cannot abort an already terminate workflow execution
.
One of the subworkflows is intented to fail under certain conditions. When this workflow fails, Propeller tries to abort the rest of the running subworkflows. Sometimes the rest of the subworkflows are properly aborted but other times Propeller receives that PermissionDenied error from Flyte Admin.
It seems to be a race condition in Propeller, when Propeller tries to abort a workflow in a terminated status because when Propeller checks the Status of the rest of the subworkflows they are in status "running" but at the time when the abort is called they already changed to a terminated status. I checked that the finish time difference when this happened between the failing subworkflow that is trying to abort the rest and the successful one is 3 ms so I think that when propeller checks the status of the rest it is reported as running although it is actually Succeeded when the abort call is executed. Maybe these lines are relevant to the issue: https://github.com/flyteorg/flytepropeller/blob/master/pkg/controller/nodes/task/handler.go#L795-L825
(currentPhase might change when p.Abort is called)
Please check the attached screenshots to see how different executions of the same code produce different results.
Eventually, the parent workflow (the one containing the subworkflows) fails with this error:
RuntimeExecutionError: max number of system retry attempts [51/50] exhausted.
This error is increasing the number of calls made to FlyteAdmin and also this is increasing the metric associated to the PermissionDenied error.
Please do not hesitate to ask for further information if needed.
Expected behavior
FlytePropeller should not retry to abort a node in a terminated status and that node status should be updated in parent workflow with the terminated status (sometimes the node is shown as running although it is succeeded when you open the subworkflow).
Additional context to reproduce
No response
Screenshots
image▾
image▾
image▾
acoustic-carpenter-78188
04/16/2023, 4:48 PMacoustic-carpenter-78188
04/16/2023, 4:49 PMacoustic-carpenter-78188
04/16/2023, 5:06 PM