Hello. Could anyone explain the importance of lead...
# ask-the-community
t
Hello. Could anyone explain the importance of leader election for flytepropeller? We sometimes see flytepropeller restarting due to failures to renew the lease.
Copy code
E1106 17:54:54.406295       1 leaderelection.go:369] Failed to update lock: Put "<https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/infrastructure--helm--flyte/leases/propeller-leader?timeout=2m0s>": context deadline exceeded
I1106 17:54:54.406322       1 leaderelection.go:285] failed to renew lease infrastructure--helm--flyte/propeller-leader: timed out waiting for the condition
{"json":{},"level":"fatal","msg":"Lost leader state. Shutting down.","ts":"2023-11-06T17:54:54Z"}
Given that we are not currently using a shared flyte-propeller, I thinking we may be able to disable leader election entirely?
d
Leader election is used to keep a hot duplicate propeller env. If one fails, then k8s leader election allows the second to pick up very quickly. Only one propeller should be active at a time otherwise they will compete with each other and cause issues in creating duplicate Pods for task executions, updating the FlyteWorkflow CRD simultaneously, etc.
t
Thanks. It makes sense that multiple concurrent flytepropellers would cause problems. If I have a non-sharded flyte propeller with only one replica and rolling updates disabled though, then I should anyway never have more than one flyte propeller running simultaneously? How would the duplicate propeller env you describe be configured?
d
then I should anyway never have more than one flyte propeller running simultaneously?
Correct!
How would the duplicate propeller env you describe be configured?
I know the default deployment charts for a long time just set
replicas: 2
and enabled leader election. Then k8s starts 2 propeller instances automatically and the leader-election mechanism ensures only one is active at a time.
This is only necessary for specific usecases though. Because if there is only a single replica and the pod fails, k8s will recreate the Pod as long as it's defined as a deployment / replica set / etc. So there will be a little downtime while that transition happens and the only downside is that workflows will not be able to schedule new tasks. The transition between leader election is much quicker.
t
Thanks, that's really useful information. I think now the flyte-core helm is set to one replica but with rolling updates. I've also just discovered the update strategy is not configurable through the
flyte-core
helm. Probably it would be a trivial helm PR, but I think I will just make the leases and renew periods a bit longer to mitigate our problem.
k
Cc @Nikki Everett can we add this to docs somehow
n
yeah, i can make an issue to add to the deployment config docs