hi team any pointers or guidelines that will help us scaling Flyte #flyte-support

hi team, any pointers or guidelines that will help...

square-carpet-13590

04/04/2025, 3:41 PM

hi team, any pointers or guidelines that will help us scaling flyte propeller ? attaching the metrics below. The issue is not all workflows are getting picked by workers, more than 60-65% workers are available, while the workflow acceptance and node transition latencies are on higher side. the flyte-propeller has been scaled to 3 shards and resource utilisation is low. below is the propeller config as well. Thank you !

Copy code

core:
    propeller:
      rawoutput-prefix: will-be-replaced
      workers: 60
      gc-interval: 2h
      max-workflow-retries: 50
      workflow-reeval-duration: 7s        
      downstream-eval-duration: 3s
      max-streak-length: 10
      kube-client-config:
        qps: 200
        burst: 50
        timeout: 30s
      queue:
        type: batch
        batching-interval: 2s
        batch-size: -1
        queue:
          type: maxof
          rate: 200
          capacity: 2000
          base-delay: 5s
          max-delay: 120s
        sub-queue:
          type: bucket
          rate: 100
          capacity: 1000
      workflowStore:
        policy: ResourceVersionCache
      storage: 
        cache: 
          max_size_mbs: 1024
          target_gc_percent: 60

freezing-airport-6809

04/04/2025, 4:04 PM

Not sure how we can help. Something seems to be wrong. May I recommend using Flyte support by union as we will need to take a deeper look

square-carpet-13590

04/04/2025, 4:05 PM

cc @glamorous-rainbow-77959

glamorous-rainbow-77959

04/04/2025, 4:08 PM

@freezing-airport-6809 not asking for a complete solution, maybe just a direction we can dig towards.

glamorous-rainbow-77959

04/04/2025, 4:09 PM

And we will definitely evaluate support services if you have them for custom Flyte deployments, maybe you or someone from Union team can DM me the details about it

clean-glass-36808

04/04/2025, 4:52 PM

The key metric for us is to look at the unprocessed queue depth and the worker count. From my experience (ie. last week) we've seen the unprocessed queue depth increase while workers are available when FlytePropeller was getting CPU throttled. It only seems to happen under high load.

clean-glass-36808

04/04/2025, 5:01 PM

Screenshot 2025-04-04 at 10.00.45 AM.png,Screenshot 2025-04-04 at 10.00.08 AM.png

square-carpet-13590

04/04/2025, 5:14 PM

Thank you @clean-glass-36808 for the input, let me check in this direction

freezing-airport-6809

04/05/2025, 5:43 AM

Ohh do you have very few CPUs allocated ?

glamorous-rainbow-77959

04/08/2025, 11:11 AM

@freezing-airport-6809 You mean allocated to flyte propeller. No, I think we have a decent amount. @square-carpet-13590, could you clarify?

square-carpet-13590

04/08/2025, 2:03 PM

@freezing-airport-6809 we have set request-1cpu & limit-2cpu per pod, there is some amount of throttling but now alot tho

freezing-airport-6809

04/08/2025, 2:04 PM

1cpu is small - but I don’t like 1-2, as it will cause throttling as Jason said

square-carpet-13590

04/08/2025, 2:06 PM

got it, will try with more cpus. Thank you

freezing-airport-6809

04/08/2025, 2:06 PM

Keep it 2

freezing-airport-6809

04/08/2025, 2:06 PM

But that’s not the problem

freezing-airport-6809

04/08/2025, 2:06 PM

If things are not getting picked up it has to be something else

square-carpet-13590

04/08/2025, 2:08 PM

yes, could be, may be the kube-client-config , not sure the ones we set above is sufficient

square-carpet-13590

04/08/2025, 2:10 PM

due to less throttling i eliminated cpu as the issue

freezing-airport-6809

04/08/2025, 2:37 PM

200

freezing-airport-6809

04/08/2025, 2:37 PM

That should be ok depending on the load

freezing-airport-6809

04/08/2025, 2:37 PM

But also you have 3 propellers

freezing-airport-6809

04/08/2025, 2:37 PM

So that’s 600 ps

square-carpet-13590

04/08/2025, 3:06 PM

okay

2 Views

Open in Slack

Previous Next