Is there anywhere we can find out more information...
# flyte-support
a
Is there anywhere we can find out more information on the
webApi
settings listed in the
connector.Config
on this docs page? There's a small amount of info on the page, but not a lot. We're still trying to understand why we're unable to use connectors/agents at scale - as soon as we try to send 1000+ tasks to our connectors, flytepropeller starts to significantly slow down - we see the unprocessed queue depth grow, flytepropeller CPU usage spikes, and the throughput of tasks is very slow. It's not clear whether this is an issue with the connector setup (eg. the number of grpc worker threads?), something to do with the propeller web API, or something else. We're trying to identify which specific settings we need to modify to be able to improve propeller 🤝 connector throughput - any advice would be greatly appreciated 🙏
d
did your connector autoscale?
also you can modify the QPS of webapi
a
We've tried scaling the connector to run additional instances, but the new instances didn't receive tasks - only the original instance. We found that if we restarted flytepropeller it would schedule tasks against multiple connector instances but it didn't do it straight away. We'll run another test of this behaviour though - because I imagine we'll want to add connector autoscaling regardless. Restarting flytepropeller in any scenario provides a temporary boost to task scheduling speed, but then it slows down again. It looks like a few of the
webApi
setting might be relevant - I spotted there's also some deeper documentation on what these settings do in the code comments 👍 We'll try tweaking
webApi.readRateLimiter.qps
, and I was also thinking of looking at the
caching.workers
and
resourceConstraints
? Am I reading correctly that the default
resourceConstraints
mean that only 50 tasks will be scheduled onto a connector per-namespace?
d
but the new instances didn't receive tasks - only the original instance
you have to setup round robin mechanism
after you edit the configmap, you need to restart the propeller's deployment
a
This is very helpful, thanks! I hadn't seen this setting before, I'll see if adding this to our connector config fixes the issue with not sending requests to all instances. To confirm, am I right in thinking this would look like below, or does the setup require deeper configuration?
Copy code
my-custom-agent:
  endpoint: "dns:///my-gke-cluster:8000"
  insecure: true
  defaultServiceConfig: '{"loadBalancingConfig": [{"round_robin":{}}]}'
d
The dns endpoint is wrong
You should use our k8s resolver
a
Do you have an example? The actual endpoint we use is specific to our cluster, ie. it points to the k8s service endpoints that we've set up for our custom connectors.
d
k8s://flyteagent.flyte:8000 ("Resolver"."Service Name"."Service Namespace"."Port")
Here is the pattern
Going to sleep
Let me know if this work for you
a
Thanks very much @damp-lion-88352, that example is very helpful! We still haven't confirmed if this fixes our slow throughput issue - but it's definitely seemed to improve the load balancing behaviour when we scale the agent instances up 👍 I have a list of other settings that we're working through to test if they further help with the slowdowns we see when under load 👀
@damp-lion-88352 Still working on this problem - in particular, we're still seeing
flytepropeller
completely slow down on scheduling new tasks to our connectors over time. Scheduling works fast after restarting
flytepropeller
but then slows down again. I've tried modifying all of the following settings, but I'm not sure if these are making a difference. We don't need a particular fast response from our connectors, we just need to be able to schedule a lot of tasks at once - for example, we might have 5K tasks that a connector is running which might take ~3 hours to complete, so we want to poll at a low rate over a long time.
Copy code
pollInterval: 120s
resourceConstraints:
  NamespaceScopeResourceConstraint:
    Value: 500
  ProjectScopeResourceConstraint:
    Value: 1000
webApi:
  caching:
    maxSystemFailures: 10
    workers: 40
    resyncInterval: 120s
  readRateLimiter:
    burst: 1000
    qps: 100
  writeRateLimiter:
    burst: 1000
    qps: 100
  resourceQuotas:
    default: 10000
Do you think it would help if we increased the number of
flytepropeller
instances? We're currently using the per-Project sharding strategy. Can you have more than one shard that points to the same project?
d
I think you can add worker in flytepropeller
and BTW do you know the CPU usage and memory usage of flytepropeller?
@abundant-judge-84756
Copy code
caching:
    maxSystemFailures: 10
    workers: 40
    resyncInterval: 120s
If I am you, I will add more workers and lower the
resyncInterval
time
resyncInterval
means the
GET
operation in agent
a
Thanks, that sounds helpful! I'll try adjusting those further 👍 How does
pollInterval
differ from
resyncInterval
and do these need to be aligned? We've also scaled the main propeller workers and a few other settings - these changes were done in the past, and were working quite well for task pod scheduling but not so well now that we're trying to run more tasks via connector:
Copy code
propeller:
  workers: 800
  gc-interval: 1h
  max-workflow-retries: 50
  workflow-reeval-duration: 30s
  downstream-eval-duration: 30s
  max-streak-length: 8
  kube-client-config:
    qps: 4000 # Refers to max rate of requests (queries per second) to kube-apiserver
    burst: 8000 # refers to max burst rate.
    timeout: 120s # Refers to timeout when talking with the kube-apiserver
  event:
    rate: 10000
    capacity: 200000
    max-retries: 10
d
I forgot what pollInterval is doing, investigating
and please figure out the resource usage
a
Here are some stats on flytepropeller CPU/memory usage from the last 12 hours. We had a large spike in number of workflows running around 12pm, and have been consistently running around 10K workflows since then. Each workflow has at least 2-3 tasks that get sent to connectors to do the work.
d
you can forget about pollInterval
is for watching support agent task types
a
Ah, great to know 👍
d
for resource usage I want to know about the percentage
a
The blue line is the usage, green line is the requested values and red the limit; so I think in terms of percentage it's the ratio between the green and blue lines. For CPU this has been ranging anywhere between 50-200%, and memory around 10% (unless I've misunderstood the info you need... 😅 )
d
then I think you need stronger instance for CPU usage
going to bed, will help you tmr
a
Thanks very much for the help 🙏 I'm also logging off for the day, but will leave things running with the higher number of webApi workers + lower resyncInterval and see if they help with the throughput overnight 🤞
Things are currently looking quite promising - since increasing the workers and lowering the resyncInterval at around 5:00pm yesterday we haven't seen propeller completely grind to a halt or start hitting its CPU limits the way it did before making these changes 🤞 I want to do some more tests to see if we can improve connector task scheduling speed, but it's been a lot more stable today so far!
d
CNIE!!!
NICE!!
Let me know if you need help
HAPPY FOR YOU
a
Thank you!! I'm going to run some tests to see if there's any further tweaking we might need to do - I'll let you know if we're still noticing any issues! 🙏