Hi I ve been running into Death by etcd timeouts situations Flyte #flyte-support

Hi, I've been running into "Death by etcd timeouts...

orange-arm-76433

02/07/2023, 1:39 PM

Hi, I've been running into "Death by etcd timeouts" situations on AKS. Symptoms mostly propeller unable to ListAndWatch fly crd objects thus -> halt and catch fire. Tried to remedy the situation by cleaning up what I think flyte's garbage collection is supposed to clean up and didnt (fly crd objects and task-pods), this worked once, but eventually the etcd behind AKS isn't responding any more. the number of flyteworkflow crd objects in etcd is in the 10k / 100k range after a couple days eventually. the version of flytadmin we're using is based on v1.1.39 and includes changes from this pr: https://github.com/flyteorg/flyteadmin/pull/504 however garbage collection works fine on our other AKS-based environments most of the time but it definitely looks unreliable sometimes, as in very irregularly cleans up. is this something we can configure/tune maybe? to really aggressively do garbage collection?

orange-arm-76433

02/07/2023, 1:43 PM

logs in propeller look like this

Copy code

E0207 13:40:41.336097       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167: Failed to watch *v1alpha1.FlyteWorkflow: failed to list *v1alpha1.FlyteWork
flow: Get "<https://10.54.32.1:443/apis/flyte.lyft.com/v1alpha1/flyteworkflows?labelSelector=termination-status+notin+%28terminated%29&limit=500&resourceVersion=0>": context deadline excee
ded                                                                                                                                                                                       
W0207 13:42:02.338334       1 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167: failed to list *v1alpha1.FlyteWorkflow: Get "<https://10.54.32.1:443/apis/fl>
<http://yte.lyft.com/v1alpha1/flyteworkflows?labelSelector=termination-status+notin+%28terminated%29&limit=500&resourceVersion=0|yte.lyft.com/v1alpha1/flyteworkflows?labelSelector=termination-status+notin+%28terminated%29&limit=500&resourceVersion=0>": context deadline exceeded                                      
I0207 13:42:02.338403       1 trace.go:205] Trace[1771930903]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167 (07-Feb-2023 13:41:32.337) (tot
al time: 30000ms):                                                                                                                                                                        
Trace[1771930903]: ---"Objects listed" error:Get "<https://10.54.32.1:443/apis/flyte.lyft.com/v1alpha1/flyteworkflows?labelSelector=termination-status+notin+%28terminated%29&limit=500&res>
ourceVersion=0": context deadline exceeded 30000ms (13:42:02.338)                                                                                                                         
Trace[1771930903]: [30.000971812s] [30.000971812s] END

orange-arm-76433

02/07/2023, 1:52 PM

and flyteadmin logs slow queries as well which seems to be rather weird to me as well

Copy code

flyteadmin [260.964ms] [rows:25] SELECT * FROM "executions" WHERE executions.execution_project = '<redacted>' AND executions.execution_domain = 'development' AND executions.state = '
EXECUTION_ACTIVE' ORDER BY created_at desc LIMIT 25

the query actually takes 200ms ish - i verified by going directly to the database executions table holdes 370k records atm:

Copy code

flyteadmin=# select count(*) from executions
flyteadmin-# ;
 count  
--------
 373647
(1 row)

freezing-airport-6809

02/07/2023, 3:42 PM

@orange-arm-76433 we do not have AKS supported in open source Flyte deployments. Is this a custom deployment?

freezing-airport-6809

02/07/2023, 3:43 PM

Also IMO you folks should turn on workflow offloading. We have seen fantastic results

orange-arm-76433

02/07/2023, 3:44 PM

we'll give that a try for sure

orange-arm-76433

02/07/2023, 3:45 PM

is there any tuning option for flyteadmin's garbage collection?

orange-arm-76433

02/07/2023, 3:45 PM

or a way to inspect its health/state?

freezing-airport-6809

02/07/2023, 3:45 PM

Yes you can tune the gc as well

freezing-airport-6809

02/07/2023, 3:45 PM

But if etcd is timing out I think gc will fail too

freezing-airport-6809

02/07/2023, 3:46 PM

Gc also has metrics

orange-arm-76433

02/07/2023, 3:46 PM

yes, indeed. once we're in that state it's hard/impossible to come back from

orange-arm-76433

02/07/2023, 3:46 PM

ah good to know, will check that

👍 1

freezing-airport-6809

02/07/2023, 3:46 PM

I think workflow Offloading will solve all

freezing-airport-6809

02/07/2023, 3:47 PM

We do not have AKS and do not really have a clue

orange-arm-76433

02/07/2023, 3:47 PM

it certainly will improve things, but tbh in my dreams there is no such thing as flyteworkflow CRD and propoeller would just rely on flyteadmin-db 😅

freezing-airport-6809

02/07/2023, 4:10 PM

Performance will be worst

freezing-airport-6809

02/07/2023, 4:10 PM

Sadly

159 Views

Open in Slack

Previous Next