Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

we submitted around 10k job to the flyte. Not sure if it is loaded due to this, most of the jobs where stuck in the running stage or unknown state. Had to delete them using flyteclt. Most of them yet in Aborting state.

I don't see cpu, mem is the reason as I see consumption on flyte pods is well under the limit. Is there any way to see what could be the reason for flyte not able to process those msgs?

<https://www.union.ai/docs/flyte/deployment/flyte-configuration/performance/>

flyte propeller k8s api client rate limits would be my guess. getting the prometheus metrics up and running with a grafana dashboard helps a lot to see where things are breaking down

Flyte relies heavily on etcd for scheduling, which can get bottlenecked if you throw everything at it all at once.
If the jobs are longer running, look into setting `max_parallelism` or `concurrency` to something sensible. If they are short, consider aggregating them into large jobs / batch them and parallelize _within_ the task via python code.

what these values means, where I can update them? in etcd or in flyte?

These values are per workflow. Or can be defaulted