Is there anywhere we can find out more information on the `w Flyte #flyte-support

Is there anywhere we can find out more information...

abundant-judge-84756

05/02/2025, 10:26 AM

Is there anywhere we can find out more information on the

webApi

settings listed in the

connector.Config

on this docs page? There's a small amount of info on the page, but not a lot. We're still trying to understand why we're unable to use connectors/agents at scale - as soon as we try to send 1000+ tasks to our connectors, flytepropeller starts to significantly slow down - we see the unprocessed queue depth grow, flytepropeller CPU usage spikes, and the throughput of tasks is very slow. It's not clear whether this is an issue with the connector setup (eg. the number of grpc worker threads?), something to do with the propeller web API, or something else. We're trying to identify which specific settings we need to modify to be able to improve propeller 🤝 connector throughput - any advice would be greatly appreciated 🙏

damp-lion-88352

05/02/2025, 12:17 PM

did your connector autoscale?

damp-lion-88352

05/02/2025, 12:18 PM

also you can modify the QPS of webapi

abundant-judge-84756

05/02/2025, 12:40 PM

We've tried scaling the connector to run additional instances, but the new instances didn't receive tasks - only the original instance. We found that if we restarted flytepropeller it would schedule tasks against multiple connector instances but it didn't do it straight away. We'll run another test of this behaviour though - because I imagine we'll want to add connector autoscaling regardless. Restarting flytepropeller in any scenario provides a temporary boost to task scheduling speed, but then it slows down again. It looks like a few of the

webApi

setting might be relevant - I spotted there's also some deeper documentation on what these settings do in the code comments 👍 We'll try tweaking

webApi.readRateLimiter.qps

, and I was also thinking of looking at the

caching.workers

and

resourceConstraints

? Am I reading correctly that the default

resourceConstraints

mean that only 50 tasks will be scheduled onto a connector per-namespace?

damp-lion-88352

05/02/2025, 3:52 PM

but the new instances didn't receive tasks - only the original instance

you have to setup round robin mechanism

damp-lion-88352

05/02/2025, 3:53 PM

after you edit the configmap, you need to restart the propeller's deployment

damp-lion-88352

05/02/2025, 3:54 PM

https://github.com/flyteorg/flyte/blob/master/flyteplugins/go/tasks/plugins/webapi/agent/config.go#L46

abundant-judge-84756

05/02/2025, 4:08 PM

This is very helpful, thanks! I hadn't seen this setting before, I'll see if adding this to our connector config fixes the issue with not sending requests to all instances. To confirm, am I right in thinking this would look like below, or does the setup require deeper configuration?

Copy code

my-custom-agent:
  endpoint: "dns:///my-gke-cluster:8000"
  insecure: true
  defaultServiceConfig: '{"loadBalancingConfig": [{"round_robin":{}}]}'

damp-lion-88352

05/02/2025, 4:29 PM

The dns endpoint is wrong

damp-lion-88352

05/02/2025, 4:29 PM

You should use our k8s resolver

abundant-judge-84756

05/02/2025, 4:34 PM

Do you have an example? The actual endpoint we use is specific to our cluster, ie. it points to the k8s service endpoints that we've set up for our custom connectors.

damp-lion-88352

05/02/2025, 4:37 PM

https://github.com/flyteorg/flyte/pull/6179

damp-lion-88352

05/02/2025, 4:37 PM

k8s://flyteagent.flyte:8000 ("Resolver"."Service Name"."Service Namespace"."Port")

damp-lion-88352

05/02/2025, 4:37 PM

Here is the pattern

damp-lion-88352

05/02/2025, 4:37 PM

Going to sleep

damp-lion-88352

05/02/2025, 4:37 PM

Let me know if this work for you

abundant-judge-84756

05/07/2025, 2:10 PM

Thanks very much @damp-lion-88352, that example is very helpful! We still haven't confirmed if this fixes our slow throughput issue - but it's definitely seemed to improve the load balancing behaviour when we scale the agent instances up 👍 I have a list of other settings that we're working through to test if they further help with the slowdowns we see when under load 👀

abundant-judge-84756

05/13/2025, 3:23 PM

@damp-lion-88352 Still working on this problem - in particular, we're still seeing

flytepropeller

completely slow down on scheduling new tasks to our connectors over time. Scheduling works fast after restarting

flytepropeller

but then slows down again. I've tried modifying all of the following settings, but I'm not sure if these are making a difference. We don't need a particular fast response from our connectors, we just need to be able to schedule a lot of tasks at once - for example, we might have 5K tasks that a connector is running which might take ~3 hours to complete, so we want to poll at a low rate over a long time.

Copy code

pollInterval: 120s
resourceConstraints:
  NamespaceScopeResourceConstraint:
    Value: 500
  ProjectScopeResourceConstraint:
    Value: 1000
webApi:
  caching:
    maxSystemFailures: 10
    workers: 40
    resyncInterval: 120s
  readRateLimiter:
    burst: 1000
    qps: 100
  writeRateLimiter:
    burst: 1000
    qps: 100
  resourceQuotas:
    default: 10000

Do you think it would help if we increased the number of

flytepropeller

instances? We're currently using the per-Project sharding strategy. Can you have more than one shard that points to the same project?

damp-lion-88352

05/13/2025, 3:24 PM

I think you can add worker in flytepropeller

damp-lion-88352

05/13/2025, 3:25 PM

and BTW do you know the CPU usage and memory usage of flytepropeller?

damp-lion-88352

05/13/2025, 3:25 PM

@abundant-judge-84756

damp-lion-88352

05/13/2025, 3:26 PM

Copy code

caching:
    maxSystemFailures: 10
    workers: 40
    resyncInterval: 120s

If I am you, I will add more workers and lower the

resyncInterval

time

damp-lion-88352

05/13/2025, 3:27 PM

resyncInterval

means the

GET

operation in agent

abundant-judge-84756

05/13/2025, 3:30 PM

Thanks, that sounds helpful! I'll try adjusting those further 👍 How does

pollInterval

differ from

resyncInterval

and do these need to be aligned? We've also scaled the main propeller workers and a few other settings - these changes were done in the past, and were working quite well for task pod scheduling but not so well now that we're trying to run more tasks via connector:

Copy code

propeller:
  workers: 800
  gc-interval: 1h
  max-workflow-retries: 50
  workflow-reeval-duration: 30s
  downstream-eval-duration: 30s
  max-streak-length: 8
  kube-client-config:
    qps: 4000 # Refers to max rate of requests (queries per second) to kube-apiserver
    burst: 8000 # refers to max burst rate.
    timeout: 120s # Refers to timeout when talking with the kube-apiserver
  event:
    rate: 10000
    capacity: 200000
    max-retries: 10

damp-lion-88352

05/13/2025, 3:51 PM

I forgot what pollInterval is doing, investigating

damp-lion-88352

05/13/2025, 3:52 PM

and please figure out the resource usage

abundant-judge-84756

05/13/2025, 3:57 PM

Here are some stats on flytepropeller CPU/memory usage from the last 12 hours. We had a large spike in number of workflows running around 12pm, and have been consistently running around 10K workflows since then. Each workflow has at least 2-3 tasks that get sent to connectors to do the work.

damp-lion-88352

05/13/2025, 3:59 PM

you can forget about pollInterval

damp-lion-88352

05/13/2025, 3:59 PM

is for watching support agent task types

abundant-judge-84756

05/13/2025, 3:59 PM

Ah, great to know 👍

damp-lion-88352

05/13/2025, 4:00 PM

for resource usage I want to know about the percentage

abundant-judge-84756

05/13/2025, 4:05 PM

The blue line is the usage, green line is the requested values and red the limit; so I think in terms of percentage it's the ratio between the green and blue lines. For CPU this has been ranging anywhere between 50-200%, and memory around 10% (unless I've misunderstood the info you need... 😅 )

damp-lion-88352

05/13/2025, 4:06 PM

then I think you need stronger instance for CPU usage

damp-lion-88352

05/13/2025, 4:07 PM

going to bed, will help you tmr

abundant-judge-84756

05/13/2025, 4:07 PM

Thanks very much for the help 🙏 I'm also logging off for the day, but will leave things running with the higher number of webApi workers + lower resyncInterval and see if they help with the throughput overnight 🤞

abundant-judge-84756

05/14/2025, 12:45 PM

Things are currently looking quite promising - since increasing the workers and lowering the resyncInterval at around 5:00pm yesterday we haven't seen propeller completely grind to a halt or start hitting its CPU limits the way it did before making these changes 🤞 I want to do some more tests to see if we can improve connector task scheduling speed, but it's been a lot more stable today so far!

damp-lion-88352

05/14/2025, 12:50 PM

CNIE!!!

damp-lion-88352

05/14/2025, 12:50 PM

NICE!!

damp-lion-88352

05/14/2025, 12:50 PM

Let me know if you need help

damp-lion-88352

05/14/2025, 12:50 PM

HAPPY FOR YOU

abundant-judge-84756

05/14/2025, 12:58 PM

Thank you!! I'm going to run some tests to see if there's any further tweaking we might need to do - I'll let you know if we're still noticing any issues! 🙏

3 Views

Open in Slack

Previous Next