Flyte propeller got stuck and stopped any executio...
# flyte-support
n
Flyte propeller got stuck and stopped any execution of workflows with the following error:
Copy code
{"json":{"exec_id":"***masked-exec-id***","ns":"***masked-namespace***","res_ver":"***masked-ver***","routine":"worker-2","wf":"***masked-workflow-id***:***masked-workflow-id***:map_task.my_map_workflow"},"level":"error","msg":"Error when trying to reconcile workflow. Error [[]]. Error Type[*errors.WorkflowErrorWithCause]","ts":"2024-11-13T08:12:16Z"}
E1113 08:12:16.842540       1 workers.go:103] error syncing '***masked-namespace***/***masked-exec-id***': Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: EventSinkError: Error sending event, caused by [rpc error: code = DeadlineExceeded desc = context deadline exceeded]
{"json":{"exec_id":"***masked-exec-id-2***","ns":"***masked-namespace***","res_ver":"***masked-ver-2***","routine":"worker-3","wf":"***masked-workflow-id***:***masked-workflow-id***:map_task.my_map_workflow"},"level":"warning","msg":"Event recording failed. Error [EventSinkError: Error sending event, caused by [rpc error: code = DeadlineExceeded desc = context deadline exceeded]]","ts":"2024-11-13T08:12:42Z"}
{"json":{"exec_id":"***masked-exec-id-2***","ns":"***masked-namespace***","res_ver":"***masked-ver-2***","routine":"worker-3","wf":"***masked-workflow-id***:***masked-workflow-id***:map_task.my_map_workflow"},"level":"error","msg":"Error when trying to reconcile workflow. Error [[]]. Error Type[*errors.WorkflowErrorWithCause]","ts":"2024-11-13T08:12:42Z"}
E1113 08:12:42.070995       1 workers.go:103] error syncing '***masked-namespace***/***masked-exec-id-2***': Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: EventSinkError: Error sending event, caused by [rpc error: code = DeadlineExceeded desc = context deadline exceeded]
Basically it seemed like the connection b/w flyte-propeller and flyteadmin was broken maybe which caused these timeouts. Doing a simple pod restart fixed it. This has happened 2-3 times and pod restart always fixed it. Any suggestions how to fix this? Couldn’t find a way to add “keepalive timeout” in the helm chart/docs.
f
Are you using a load balancer /ingress system between admin and propeller
You should
n
yes, it goes via LB to istio ingress to flyteadmin
still there are issues. any suggestions how to mitigate? would it be possible to expose keepAliveTimeout as an input to propeller helm chart?
a
@nice-market-38632 I think the timeouts that Flyte exposes are at the task/workflow level but can't find something equivalent to keepAliveTimeout at the control plane level. It'd be good to know what's causing so much pressure on admin that propeller times out sending events. Any chance to maybe use the Grafana dashboards to capture flyteadmin metrics?
n
Got it, but this error is random. There ain’t any pressure on admin that I can tell. (when it got stuck) Basically it seems like the connection gets stuck and I am able to access the flyteconsole, it works. But all executions get stuck because every request from flytepropeller to flyteadmin gets timed out. Mostly some networking configuration. On reading more, I suspect the cause to be https://grpc.io/docs/guides/keepalive/ not being set. This should ideally be configured on client side (i.e. in flytepropeller). something like:
Copy code
<https://pkg.go.dev/google.golang.org/grpc#WithKeepaliveParams>
but I didnt find it here: https://github.com/flyteorg/flyte/blob/ba331fd493173682500bb1735bfa760715c64b23/flytepropeller/pkg/controller/controller.go#L315 there should be a with block here for keep alive config.