New RunLLM response generated: >>>*# Question* Fl...
# runllm-flyte-trial
j
New RunLLM response generated: >>>*# Question* Flyte propeller got stuck and stopped any execution of workflows with the following error:
Copy code
{"json":{"exec_id":"***masked-exec-id***","ns":"***masked-namespace***","res_ver":"***masked-ver***","routine":"worker-2","wf":"***masked-workflow-id***:***masked-workflow-id***:map_task.my_map_workflow"},"level":"error","msg":"Error when trying to reconcile workflow. Error [[]]. Error Type[*errors.WorkflowErrorWithCause]","ts":"2024-11-13T08:12:16Z"}
E1113 08:12:16.842540       1 workers.go:103] error syncing '***masked-namespace***/***masked-exec-id***': Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: EventSinkError: Error sending event, caused by [rpc error: code = DeadlineExceeded desc = context deadline exceeded]
{"json":{"exec_id":"***masked-exec-id-2***","ns":"***masked-namespace***","res_ver":"***masked-ver-2***","routine":"worker-3","wf":"***masked-workflow-id***:***masked-workflow-id***:map_task.my_map_workflow"},"level":"warning","msg":"Event recording failed. Error [EventSinkError: Error sending event, caused by [rpc error: code = DeadlineExceeded desc = context deadline exceeded]]","ts":"2024-11-13T08:12:42Z"}
{"json":{"exec_id":"***masked-exec-id-2***","ns":"***masked-namespace***","res_ver":"***masked-ver-2***","routine":"worker-3","wf":"***masked-workflow-id***:***masked-workflow-id***:map_task.my_map_workflow"},"level":"error","msg":"Error when trying to reconcile workflow. Error [[]]. Error Type[*errors.WorkflowErrorWithCause]","ts":"2024-11-13T08:12:42Z"}
E1113 08:12:42.070995       1 workers.go:103] error syncing '***masked-namespace***/***masked-exec-id-2***': Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: EventSinkError: Error sending event, caused by [rpc error: code = DeadlineExceeded desc = context deadline exceeded]
>>
Basically it seemed like the connection b/w flyte-propeller and flyteadmin was broken maybe which caused these timeouts. Doing a simple pod restart fixed it. This has happened 2-3 times and pod restart always fixed it. Any suggestions how to fix this? Couldn’t find a way to add “keepalive timeout” in the helm chart/docs. # Answer It seems likely that the issue you're experiencing with Flyte propeller getting stuck and the connection between Flyte propeller and Flyte admin timing out could be related to network or resource constraints. According to this GitHub issue, similar problems have been observed when there are issues with the Flyte deployment, such as tasks getting stuck in a queued state due to resource constraints or configuration issues. My understanding is that implementing a keepalive timeout might help maintain the connection between Flyte propeller and Flyte admin, but this specific configuration is not explicitly mentioned in the provided data sources. You might consider checking the network policies, resource allocations, and any potential bottlenecks in your Kubernetes cluster that could be affecting the communication between these components. Would you like more detailed guidance on troubleshooting network issues or configuring resource allocations in your Flyte deployment? Sources: - this GitHub issue Slack Link https://flyte.slack.com/archives/CP2HDHKE1/p1732874245.212799 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.