Hey! I have a Flyte deployment in GKE using <this great repo>. I've been scaling this quite a bit ov...

brainy-nail-23390

10/30/2024, 7:17 PM

Hey! I have a Flyte deployment in GKE using this great repo. I've been scaling this quite a bit over past few days via

map_tasks

but I've been having troubles with GKE's autoscaling and the interaction with Flyte. Unfortunately this pod rejection (see below) is killing my workflow executions in random patterns (sometimes multiple workflow executions running for 12 hours runs are fine, others fail after 2 hours). Has anyone encountered this and know of any solutions?

Copy code

[37][58][66][72][87][93][97][105][142][158][166][181][196][211][257][292-293][383][414][429][438][470][487][514][517][523][529][533][586][601][660][788][806][968][1002][1029][1048][1296][1313][1341][1360][1397][1421][1450][1587][1610][1618][1638][1692][1754][1774][1779][1863][1871][1886][1890][1938][1940][1943][1951][2020-2021][2031-2032][2083][2087][2168][2196][2299][2330]: Pod was terminated in response to imminent node shutdown.
[primary] terminated with exit code (241). Reason [Error]. Message: 
.
[1191][1561]: [1/1] currentAttempt done. Last Error: USER::Pod was rejected: Pod was rejected as the node is shutting down.

average-finland-92144

10/30/2024, 9:16 PM

@brainy-nail-23390 could you check if those pods had the

<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>

annotation and what is the value?

average-finland-92144

10/30/2024, 9:17 PM

seems like the CA is scaling down (to zero?) and evicting the Flyte pods

brainy-nail-23390

10/31/2024, 12:35 AM

Hey @average-finland-92144! I've had this happen in isolated cases with many concurrent workflow executions so scaling down the node pool to zero does sound odd. I can't grab the annotations for pods that weren't completed successfully AFAICT.

average-finland-92144

10/31/2024, 5:14 PM

does it happen only with

map_tasks

? can we test some task and see if that annotation is added?

freezing-airport-6809

11/05/2024, 12:04 AM

also you should enable

inject-finalizer

27 Views

Open in Slack

Previous Next

Flyte

Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.