abundant-judge-84756
05/02/2025, 10:26 AMwebApi
settings listed in the connector.Config
on this docs page? There's a small amount of info on the page, but not a lot.
We're still trying to understand why we're unable to use connectors/agents at scale - as soon as we try to send 1000+ tasks to our connectors, flytepropeller starts to significantly slow down - we see the unprocessed queue depth grow, flytepropeller CPU usage spikes, and the throughput of tasks is very slow.
It's not clear whether this is an issue with the connector setup (eg. the number of grpc worker threads?), something to do with the propeller web API, or something else. We're trying to identify which specific settings we need to modify to be able to improve propeller 🤝 connector throughput - any advice would be greatly appreciated 🙏damp-lion-88352
05/02/2025, 12:17 PMdamp-lion-88352
05/02/2025, 12:18 PMabundant-judge-84756
05/02/2025, 12:40 PMwebApi
setting might be relevant - I spotted there's also some deeper documentation on what these settings do in the code comments 👍
We'll try tweaking webApi.readRateLimiter.qps
, and I was also thinking of looking at the caching.workers
and resourceConstraints
?
Am I reading correctly that the default resourceConstraints
mean that only 50 tasks will be scheduled onto a connector per-namespace?damp-lion-88352
05/02/2025, 3:52 PMbut the new instances didn't receive tasks - only the original instanceyou have to setup round robin mechanism
damp-lion-88352
05/02/2025, 3:53 PMdamp-lion-88352
05/02/2025, 3:54 PMabundant-judge-84756
05/02/2025, 4:08 PMmy-custom-agent:
endpoint: "dns:///my-gke-cluster:8000"
insecure: true
defaultServiceConfig: '{"loadBalancingConfig": [{"round_robin":{}}]}'
damp-lion-88352
05/02/2025, 4:29 PMdamp-lion-88352
05/02/2025, 4:29 PMabundant-judge-84756
05/02/2025, 4:34 PMdamp-lion-88352
05/02/2025, 4:37 PMdamp-lion-88352
05/02/2025, 4:37 PMdamp-lion-88352
05/02/2025, 4:37 PMdamp-lion-88352
05/02/2025, 4:37 PMdamp-lion-88352
05/02/2025, 4:37 PMabundant-judge-84756
05/07/2025, 2:10 PMabundant-judge-84756
05/13/2025, 3:23 PMflytepropeller
completely slow down on scheduling new tasks to our connectors over time. Scheduling works fast after restarting flytepropeller
but then slows down again.
I've tried modifying all of the following settings, but I'm not sure if these are making a difference. We don't need a particular fast response from our connectors, we just need to be able to schedule a lot of tasks at once - for example, we might have 5K tasks that a connector is running which might take ~3 hours to complete, so we want to poll at a low rate over a long time.
pollInterval: 120s
resourceConstraints:
NamespaceScopeResourceConstraint:
Value: 500
ProjectScopeResourceConstraint:
Value: 1000
webApi:
caching:
maxSystemFailures: 10
workers: 40
resyncInterval: 120s
readRateLimiter:
burst: 1000
qps: 100
writeRateLimiter:
burst: 1000
qps: 100
resourceQuotas:
default: 10000
Do you think it would help if we increased the number of flytepropeller
instances? We're currently using the per-Project sharding strategy. Can you have more than one shard that points to the same project?damp-lion-88352
05/13/2025, 3:24 PMdamp-lion-88352
05/13/2025, 3:25 PMdamp-lion-88352
05/13/2025, 3:25 PMdamp-lion-88352
05/13/2025, 3:26 PMcaching:
maxSystemFailures: 10
workers: 40
resyncInterval: 120s
If I am you, I will add more workers and lower the resyncInterval
timedamp-lion-88352
05/13/2025, 3:27 PMresyncInterval
means the GET
operation in agentabundant-judge-84756
05/13/2025, 3:30 PMpollInterval
differ from resyncInterval
and do these need to be aligned?
We've also scaled the main propeller workers and a few other settings - these changes were done in the past, and were working quite well for task pod scheduling but not so well now that we're trying to run more tasks via connector:
propeller:
workers: 800
gc-interval: 1h
max-workflow-retries: 50
workflow-reeval-duration: 30s
downstream-eval-duration: 30s
max-streak-length: 8
kube-client-config:
qps: 4000 # Refers to max rate of requests (queries per second) to kube-apiserver
burst: 8000 # refers to max burst rate.
timeout: 120s # Refers to timeout when talking with the kube-apiserver
event:
rate: 10000
capacity: 200000
max-retries: 10
damp-lion-88352
05/13/2025, 3:51 PMdamp-lion-88352
05/13/2025, 3:52 PMabundant-judge-84756
05/13/2025, 3:57 PMdamp-lion-88352
05/13/2025, 3:59 PMdamp-lion-88352
05/13/2025, 3:59 PMabundant-judge-84756
05/13/2025, 3:59 PMdamp-lion-88352
05/13/2025, 4:00 PMabundant-judge-84756
05/13/2025, 4:05 PMdamp-lion-88352
05/13/2025, 4:06 PMdamp-lion-88352
05/13/2025, 4:07 PMabundant-judge-84756
05/13/2025, 4:07 PMabundant-judge-84756
05/14/2025, 12:45 PMdamp-lion-88352
05/14/2025, 12:50 PMdamp-lion-88352
05/14/2025, 12:50 PMdamp-lion-88352
05/14/2025, 12:50 PMdamp-lion-88352
05/14/2025, 12:50 PMabundant-judge-84756
05/14/2025, 12:58 PM