Hey <@U06HBGN1SA0> Is the backoff mechanism when r...
# ask-ai
o
Hey @Glime Is the backoff mechanism when reaching resourcequota per pod instead of per namespace?
g
Yes, the backoff mechanism when reaching a resource quota in Kubernetes, which Flyte utilizes, is typically enforced at the namespace level, not per individual pod. This means that if the resource quota is exceeded in a namespace, any new pods that attempt to start within that namespace will be affected by the backoff mechanism until resources are freed up or the quota is adjusted. However, the specific behavior can depend on how the Kubernetes cluster and Flyte are configured. --- This is an AI-generated response, take its recommendations with caution. Please help us improve the model accuracy by giving this response a thumbs up/thumbs down reaction. Sources https://discuss.flyte.org/t/6170/hi-all-i-tried-to-upgrade-my-google-gke-flyte-deployment-fro/ https://discuss.flyte.org/t/2759883/hi-everyone-i-was-wondering-where-i-could-look-inside-k8s-to/ Answered By Glime 👍 button 👎 button
o
@Glime We see long periods with hundreds of rejected attempts per minute from pods of the same map task, is there something we need to enable the backoff
g
To address the issue of long periods with hundreds of rejected attempts per minute from pods of the same map task, it's important to consider implementing a backoff strategy. This can help manage the rate at which pods are retried, potentially reducing the load on the Kubernetes API and improving the overall stability of your system. In Kubernetes, the backoff mechanism for pod retries can be influenced by settings such as
backoffLimit
and
restartPolicy
in the job configuration. However, specific configurations related to backoff mechanisms for Flyte tasks are not directly mentioned in the provided context. If you are experiencing high rates of pod failure and retries, you might want to look into: 1. Adjusting the
backoffLimit
and
restartPolicy
settings in your Kubernetes job configurations. 2. Ensuring that your resource quotas and limits are appropriately configured to handle the load. 3. Investigating any underlying issues that might be causing the high failure rate of pods, such as network instability, insufficient resources, or configuration errors. For more detailed guidance and to implement a solution tailored to Flyte, I recommend checking the Flyte documentation or reaching out on the Flyte community channels for specific advice related to backoff configurations in Flyte tasks. Sources https://discuss.flyte.org/t/12043119/we-are-regularly-hitting-the-issue-mentioned-in-this-resolve/ https://discuss.flyte.org/t/10974209/hey-all-i-m-back-with-a-quick-question-i-have-a-situation-wh/ Answered By Glime 👍 button 👎 button
d
@Olivier Sevin the backoff handler is enforced per Pod I'm looking for the config that would enable the mechanism in Propeller
seems like the base config for the backoff handler is built using some configurable parameters that you should be able to add under the
tasks
section in your Helm values file. What I don't see is a way to enable/disable it. Seems like the backoff handler is enabled by default? cc @Dan Rammer (hamersaw)
o
OK per pod explains the huge number of retries. I tried to set max backoff to 2m and noticed the same subtask pod creation being rejected at a faster rate than that (>1 minute but less than 2), but just put it up to 10m so will see if this still happens. Thanks!
it didn't make any difference, same pods were being rejected because of resourcequotas <1m apart for long period even after setting it to 10m. Maybe this is one of those things that doesn't work for pods that are part of legacy map tasks ?
m
@David Espejo (he/him) Wondering if you could help us understand whether this backoff ought to be working for legacy map tasks? As @Olivier Sevin noted, it doesn't seem to have an effect, and we're really hammering the k8s api server when using resource quotas with workflows that have large fanouts -- which is kinda defeating the purpose.
d
> this backoff ought to be working for legacy map tasks? Sorry for the delays. Legacy maptask don't support the backoff controller. ArrayNodes will include that functionality