Just wondering about `flytepropeller` scaling / co...
# flyte-support
b
Just wondering about
flytepropeller
scaling / configuration: • is it generally better to work on saturating a single instance before sharding? I presume you just bump up the worker count? • If you do set
flytepropeller.manager
to
true
and set manager config to
type: Hash
and
shard-count: 4
are you supposed to set
flytePropeller.replicaCount
to match the shard count? or does the manager keep the right number of propellers running through some other means?
actually now just saw the RFC with...
[FlytePropeller] is highly optimized and a single instance can run thousands of concurrent workflows.
https://github.com/flyteorg/flyte/blob/master/rfc/system/1483-flytepropeller-horizontal-scaling.md
just wondering about all this because we had a close-to-default propeller config and it doesn't use a lot of CPU/RAM nor seem to run out of workers, but I feel like our larger workflows could be kicking things off faster
c
As another user of Flyte currently scale testing it I would use this guide: https://www.union.ai/docs/flyte/deployment/flyte-configuration/performance/ We have the Grafana dashboard setup which is very helpful for seeing scale issues. You will need to tweak workers, queue rate limits, k8s client rate limits. Key items to look for are unprocessed queue depth, node queueing latency. I also have a PR for review that heavily reduces k8s api calls and etcD load which helps a bit. With that PR we are easily doing running thousands of concurrent tasks under a single propeller.
b
Just noticed how the guide says
propeller.queue.capacity
. Default value:
10000
but clearly not in the flyte-core chart? https://github.com/flyteorg/flyte/blob/v1.15.3/charts/flyte-core/values.yaml#L862
c
They're not wrong that at the code level that is the default: https://github.com/flyteorg/flyte/blob/master/flytepropeller/pkg/controller/config/config.go#L82 flyte-core is just many ways to deploy flyte but it is weird that flyte-core has a lower value
b
I went through and found a lot of values more conservative for this deployment option: I've moved a bunch of them to the default or slightly beyond to see if it helps