Hi, I've been running into "Death by etcd timeouts...
# ask-the-community
Hi, I've been running into "Death by etcd timeouts" situations on AKS. Symptoms mostly propeller unable to ListAndWatch fly crd objects thus -> halt and catch fire. Tried to remedy the situation by cleaning up what I think flyte's garbage collection is supposed to clean up and didnt (fly crd objects and task-pods), this worked once, but eventually the etcd behind AKS isn't responding any more. the number of flyteworkflow crd objects in etcd is in the 10k / 100k range after a couple days eventually. the version of flytadmin we're using is based on v1.1.39 and includes changes from this pr: https://github.com/flyteorg/flyteadmin/pull/504 however garbage collection works fine on our other AKS-based environments most of the time but it definitely looks unreliable sometimes, as in very irregularly cleans up. is this something we can configure/tune maybe? to really aggressively do garbage collection?
logs in propeller look like this
Copy code
E0207 13:40:41.336097       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167: Failed to watch *v1alpha1.FlyteWorkflow: failed to list *v1alpha1.FlyteWork
flow: Get "<>": context deadline excee
W0207 13:42:02.338334       1 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167: failed to list *v1alpha1.FlyteWorkflow: Get "<>
<http://yte.lyft.com/v1alpha1/flyteworkflows?labelSelector=termination-status+notin+%28terminated%29&limit=500&resourceVersion=0|yte.lyft.com/v1alpha1/flyteworkflows?labelSelector=termination-status+notin+%28terminated%29&limit=500&resourceVersion=0>": context deadline exceeded                                      
I0207 13:42:02.338403       1 trace.go:205] Trace[1771930903]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167 (07-Feb-2023 13:41:32.337) (tot
al time: 30000ms):                                                                                                                                                                        
Trace[1771930903]: ---"Objects listed" error:Get "<>
ourceVersion=0": context deadline exceeded 30000ms (13:42:02.338)                                                                                                                         
Trace[1771930903]: [30.000971812s] [30.000971812s] END
and flyteadmin logs slow queries as well which seems to be rather weird to me as well
Copy code
flyteadmin [260.964ms] [rows:25] SELECT * FROM "executions" WHERE executions.execution_project = '<redacted>' AND executions.execution_domain = 'development' AND executions.state = '
the query actually takes 200ms ish - i verified by going directly to the database executions table holdes 370k records atm:
Copy code
flyteadmin=# select count(*) from executions
flyteadmin-# ;
(1 row)
@Klaus Azesberger we do not have AKS supported in open source Flyte deployments. Is this a custom deployment?
Also IMO you folks should turn on workflow offloading. We have seen fantastic results
we'll give that a try for sure
is there any tuning option for flyteadmin's garbage collection?
or a way to inspect its health/state?
Yes you can tune the gc as well
But if etcd is timing out I think gc will fail too
Gc also has metrics
yes, indeed. once we're in that state it's hard/impossible to come back from
ah good to know, will check that
I think workflow Offloading will solve all
We do not have AKS and do not really have a clue
it certainly will improve things, but tbh in my dreams there is no such thing as flyteworkflow CRD and propoeller would just rely on flyteadmin-db 😅
Performance will be worst