Hey < faint rocket 14513> Is the backoff mechanism when reac Flyte #ask-ai

Join Slack

Hey <@U06HBGN1SA0> Is the backoff mechanism when r...

# ask-ai

hallowed-camera-82098

04/12/2024, 8:03 PM

Hey @faint-rocket-14513 Is the backoff mechanism when reaching resourcequota per pod instead of per namespace?

faint-rocket-14513

04/12/2024, 8:03 PM

Yes, the backoff mechanism when reaching a resource quota in Kubernetes, which Flyte utilizes, is typically enforced at the namespace level, not per individual pod. This means that if the resource quota is exceeded in a namespace, any new pods that attempt to start within that namespace will be affected by the backoff mechanism until resources are freed up or the quota is adjusted. However, the specific behavior can depend on how the Kubernetes cluster and Flyte are configured. --- This is an AI-generated response, take its recommendations with caution. Please help us improve the model accuracy by giving this response a thumbs up/thumbs down reaction. Sources https://discuss.flyte.org/t/6170/hi-all-i-tried-to-upgrade-my-google-gke-flyte-deployment-fro/ https://discuss.flyte.org/t/2759883/hi-everyone-i-was-wondering-where-i-could-look-inside-k8s-to/ Answered By Glime 👍 button 👎 button

hallowed-camera-82098

04/12/2024, 8:05 PM

@faint-rocket-14513 We see long periods with hundreds of rejected attempts per minute from pods of the same map task, is there something we need to enable the backoff

faint-rocket-14513

04/12/2024, 8:05 PM

To address the issue of long periods with hundreds of rejected attempts per minute from pods of the same map task, it's important to consider implementing a backoff strategy. This can help manage the rate at which pods are retried, potentially reducing the load on the Kubernetes API and improving the overall stability of your system. In Kubernetes, the backoff mechanism for pod retries can be influenced by settings such as

backoffLimit

and

restartPolicy

in the job configuration. However, specific configurations related to backoff mechanisms for Flyte tasks are not directly mentioned in the provided context. If you are experiencing high rates of pod failure and retries, you might want to look into: 1. Adjusting the

backoffLimit

and

restartPolicy

settings in your Kubernetes job configurations. 2. Ensuring that your resource quotas and limits are appropriately configured to handle the load. 3. Investigating any underlying issues that might be causing the high failure rate of pods, such as network instability, insufficient resources, or configuration errors. For more detailed guidance and to implement a solution tailored to Flyte, I recommend checking the Flyte documentation or reaching out on the Flyte community channels for specific advice related to backoff configurations in Flyte tasks. Sources https://discuss.flyte.org/t/12043119/we-are-regularly-hitting-the-issue-mentioned-in-this-resolve/ https://discuss.flyte.org/t/10974209/hey-all-i-m-back-with-a-quick-question-i-have-a-situation-wh/ Answered By Glime 👍 button 👎 button

average-finland-92144

04/12/2024, 9:11 PM

@hallowed-camera-82098 the backoff handler is enforced per Pod I'm looking for the config that would enable the mechanism in Propeller

average-finland-92144

04/12/2024, 9:26 PM

seems like the base config for the backoff handler is built using some configurable parameters that you should be able to add under the

tasks

section in your Helm values file. What I don't see is a way to enable/disable it. Seems like the backoff handler is enabled by default? cc @hallowed-mouse-14616

🙏 1

hallowed-camera-82098

04/12/2024, 9:38 PM

OK per pod explains the huge number of retries. I tried to set max backoff to 2m and noticed the same subtask pod creation being rejected at a faster rate than that (>1 minute but less than 2), but just put it up to 10m so will see if this still happens. Thanks!

🙌🏽 1

hallowed-camera-82098

04/15/2024, 11:15 PM

it didn't make any difference, same pods were being rejected because of resourcequotas <1m apart for long period even after setting it to 10m. Maybe this is one of those things that doesn't work for pods that are part of legacy map tasks ?

full-ram-17934

04/23/2024, 12:08 AM

@average-finland-92144 Wondering if you could help us understand whether this backoff ought to be working for legacy map tasks? As @hallowed-camera-82098 noted, it doesn't seem to have an effect, and we're really hammering the k8s api server when using resource quotas with workflows that have large fanouts -- which is kinda defeating the purpose.

average-finland-92144

04/23/2024, 5:42 PM

> this backoff ought to be working for legacy map tasks? Sorry for the delays. Legacy maptask don't support the backoff controller. ArrayNodes will include that functionality

🙏 1

3 Views

Open in Slack

Previous Next