Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Hi Team, we have implemented some resource quota’s on flyte namespaces so that even when we run a lot of parallel workflows/tasks the cluster doesn’t fill up completely with flyte and stop other pods from being able to run.

It seems that for each task Flyte will ask Kubernetes for a pod, Kubernetes refuses and so Flyte just asks again without any kind of backoff, and it’s doing that for all pending tasks at once.
That loads the Kubernetes control plane a lot (which doesn’t matter in a way, since AWS provide the control plane and charge a fixed cost) but it also loads anything that has a webhook on the Pods API, so we were seeing crashes and OOMs on things like Datadog agent and Gatekeeper.

Is there any way to enable some kind of backoff so Flyte doesn’t accidentally bombard our other services?

<@U029U35LRDJ>, mind taking a look at this?