We've recently made an update to our flyte setup t...
# flyte-support
a
We've recently made an update to our flyte setup to use self-hosted agents for running more of our tasks, particularly because we often need to run thousands of the same type of task (eg. querying a database) and it feels more efficient running these from a centralised agent service. Since making this change, we've been trying to process a large number of workflows at scale (eg. 15K concurrent workflows) but are seeing very degraded performance. Some tasks spend multiple hours in a 'queued' state even when there are agents available for the tasks to be scheduled onto. We're having trouble isolating whether this is related to our agent configuration, or something unrelated around flytepropeller struggling with the load. We're running a self-hosted flyte cluster using the flyte-core helm chart. Are there any common gotchas or config settings we should look for in terms of being able to handle scheduling many concurrent agent tasks?
f
We consistently run 15k workflow completions per second on a single propeller
It does need specific tweaking
a
Is the tweaking on the propeller side, or on the agent (/connector) side? ie. is there some sort of limitation on network requests that agents can receive? If we run
pyflyte serve agent
in a container running on a well provisioned k8s pod is it expected to handle any traffic thrown at it by propeller, or are there specific settings we might need to change?
For clarity, we're not using the
flyteagent
deployment that comes with flyte-core but a standalone k8s deployment that runs the agent server and handles MyAgentTask types.
f
You should use asyncio
And you should tweak propeller too
So both