Another question about scale For a test we have ~70K workflo Flyte #flyte-support

Another question about scale. For a test we have ~...

ripe-smartphone-56353

09/04/2024, 1:26 PM

Another question about scale. For a test we have ~70K workflows running. We've scaled up the propeller workers and other settings - see

yml

at the end of this post. What I'm observing seems a bit strange. When flyte first starts up it starts to schedule a lot of pods. I see up to 10K as

pending

but once the

unprocessed queue depth

goes down the system is scheduling hardly any work. I would expect it to try to schedule 10s of thousands of pods with 70K workflows running. The roundtrip latency looks ok to my untrained eyes. We checked our k8s API and can't find any throttling going on there - but maybe we are looking at the wrong things. On the other hand I also see some tasks that are marked as

Running

for over 1 hour with the status message

Sent to K8s...

but no pod log link is available yet (see image below as well)

Copy code

inline:
    propeller:
      workers: 1800
      max-workflow-retries: 50
      kube-client-config:
        qps: 4000 # Refers to max rate of requests (queries per second) to kube-apiserver
        burst: 1200 # refers to max burst rate.
        timeout: 30s # Refers to timeout when talking with the kube-apiserver
      max-streak-lenght: 2
      event:
        rate: 10000
        capacity: 20000
        max-retries: 10

ripe-smartphone-56353

09/04/2024, 2:21 PM

The behaviour is repeatable. If I restart the flyte-binary deployment a few thousand tasks get scheduled and when that is done flyte seems to "only" keep ~250 pods running 🤔

average-finland-92144

09/04/2024, 5:13 PM

@ripe-smartphone-56353 still trying to find something helpful for you. In the meantime, something that we missed mentioning in that docs section is that

burst >= qps

so try changing those values in the config and see if it has any effect

ripe-smartphone-56353

09/04/2024, 6:42 PM

Thanks @average-finland-92144. I can give that a try. Does that also mean that the default values are not set correctly?

🎯 1

freezing-airport-6809

09/05/2024, 4:02 AM

cc @high-park-82026 we know this problem, i think one of your configs needs to be updated

freezing-airport-6809

09/05/2024, 4:03 AM

@ripe-smartphone-56353 can we jump on a call?

ripe-smartphone-56353

09/05/2024, 6:54 AM

Hi, Sure happy to jump on a call. I'm in UTC+2 Amsterdam time.

ripe-smartphone-56353

09/05/2024, 7:42 AM

> we know this problem, i think one of your configs needs to be updated Extremely curious. I tried a bunch already but they didn't seem to do much...

Copy code

inline:
  plugins:
    catalogcache:
      reader:
        maxItems: 200000
        maxRetries: 3
        workers: 100
      writer:
        maxItems: 200000
        maxRetries: 3
        workers: 100
    workqueue:
      workers: 1200
      maxItems: 200000
      config:
        workers: 1200
        maxItems: 200000

propeller:
  queue:
    batch-size: -1
    batching-interval: 1s
    queue:
      base-delay: 0s
      capacity: 100000
      max-delay: 1m0s
      rate: 1000
      type: maxof
    sub-queue:
      base-delay: 0s
      capacity: 100000
      max-delay: 0s
      rate: 1000
      type: bucket
    type: batch

freezing-airport-6809

09/05/2024, 10:35 PM

can you check the logs

freezing-airport-6809

09/05/2024, 10:36 PM

are you seeing errors from kube api?

ripe-smartphone-56353

09/06/2024, 7:06 AM

The logs don't show any errors from kube api. FWIW I've got loglevel set to 1. Could that hide those errors? Our kube API is deployed with an autoscaler as well so it should scale up with demand.

high-park-82026

09/06/2024, 5:03 PM

It should show errors, I'm concerned you maybe hitting backoff and that's causing propeller to just wait for a very long time... Would you want to pair program on this? I know it's getting late for you but happy to jump on a call

average-finland-92144

09/06/2024, 5:07 PM

please add me to that call when it happens 🙂

ripe-smartphone-56353

09/10/2024, 1:45 PM

FWIW - I've switched our deployment to

flyte-core

at it seems more stable. So we'll test some more tomorrow and I'll be back if we encounter other weird behaviour. Thanks.

freezing-airport-6809

09/11/2024, 4:13 AM

@ripe-smartphone-56353 please let us know how it goes here to understand.

ripe-smartphone-56353

09/11/2024, 12:47 PM

We are just running our loadtest again and we are still seeing the same behaviour. If I restart flytepropeller it starts scheduling jobs but then it tapers off rapidly. I do have one basic question just to make sure I'm not missing something obvious: 1. In optimizing performance the

plugins.workqueue.config.workers

and

plugins.workqueue.config.maxItems

are mentioned. Where are those actually configured in

flyte-core

? I'm using

core.propeller.workers

but can't really find where I would configure `plugins.workqueue.config.maxItems`other than just putting it under

core

. Even if I put them there they don't seem to do anything. Some errors we are seeing 1. flytepropeller - very often / constantly:

containerStatus IndexOutOfBound, requested [0], but total containerStatuses [0] in pod phase [Pending]

2. flytepropeller - seldom :`Trace[177840877]: "DeltaFIFO Pop Process" IDfc tile orchestrator next/lst amsr2 v1 0 1000 n040e000 6x6 2017 07 30 desc n6ieprsx,Depth11,Reason:slow event handlers blocking the queue (11-Sep-2024 093037.760) (total time: 151ms):` 3. Database:

db=flyteadmin,user=flyteadmin ERROR:  duplicate key value violates unique constraint "tags_pkey"

and

db=flyteadmin,user=flyteadmin ERROR:  duplicate key value violates unique constraint "datasets_pkey"

ripe-smartphone-56353

09/11/2024, 1:49 PM

Ok, one more update. I might have found a sweetspot in the config since the system was able to complete about 55K workflows in the last 90 minutes. But unfortunately it broke our prometheus metrics system so I can't really see what's going on right now....

freezing-airport-6809

09/11/2024, 1:49 PM

You will need a few things, inject-finalizer

freezing-airport-6809

09/11/2024, 1:49 PM

And properly size propeller and propeller cachesizembs

freezing-airport-6809

09/11/2024, 1:50 PM

I would use atleast 4 cores and 16GB and cache around 8GB

ripe-smartphone-56353

09/11/2024, 1:51 PM

I've got

Copy code

k8s:
        inject-finalizer: true
        delete-resource-on-finalize: true

set. Propeller is scaled up to 12 cores and 28 gb of memory and has been running stable cache is at 5000 mb

freezing-airport-6809

09/11/2024, 1:54 PM

Wow 12 cores I don’t think it will use it

freezing-airport-6809

09/11/2024, 1:54 PM

At some point you may want to shard it

average-finland-92144

09/11/2024, 3:10 PM

@ripe-smartphone-56353 this may be a mistake in the docs. The actual parameters for propeller's workQueue are under

propeller

(including

workers

) Specifically for the queue, these are the base settings For flyte-core, they should be set on

configmap.core.propeller

ripe-smartphone-56353

09/11/2024, 3:15 PM

Thanks @average-finland-92144 that is very good to hear. It was driving me nuts trying to find this configuration and not seeing any effect in the system. Should have asked earlier.

average-finland-92144

09/11/2024, 3:47 PM

update us if it makes any difference. For now I'll create the docs issue

high-park-82026

09/11/2024, 4:46 PM

@ripe-smartphone-56353 are you running on GKE?

ripe-smartphone-56353

09/11/2024, 5:42 PM

@high-park-82026 Yes GKE

freezing-airport-6809

09/13/2024, 5:31 AM

@ripe-smartphone-56353 was your test successful, would love to know how did things progress

ripe-smartphone-56353

09/13/2024, 7:27 AM

I might have found a sweetspot in the config since the system was able to complete about 55K workflows in the last 90 minutes.

Yeah it was. The observation above was after I figured out he bug in the flyte-core helm chart and enabled the caching. So we are running a longer loadtest today and should have results after the weekend. I'll let you know how that went.

ripe-smartphone-56353

09/13/2024, 9:46 AM

First observations from the loadtest that is currently running. We are at about 85K workflows now and flyte is consistently running ~4K pods and seems quite stable. It seems to me that flyte prioritizes scheduling tasks for new workflows instead of scheduling tasks for already running workflows. We tried to tune that using these settings

Copy code

workflow-reeval-duration: 60s
      downstream-eval-duration: 10s
      max-streak-lenght: 100

Is that the right approach to take? From the description here we were honestly not sure if we should increase or decrease

max-streak-length

. I also noticed that the metrics

flyte:propeller:all:workflow:completion_latency_ms

and

flyte:propeller:all:workflow:completion_latency_unlabeled_ms

don't show any values so it's hard to keep track of workflow completions. Is there another way of seeing how many workflows complete / time unit?

ripe-smartphone-56353

09/13/2024, 11:28 AM

Another observation. At about 100K workflows we got

---"Objects listed" error:unexpected error when reading response body. Please retry. Original error: net/http: request canceled (Client.Timeout or context cancellation while reading body) 30013ms (11:05:48.756)

. After increasing

kube-client-config: timeout

that was fixed. Would that be improved by sharding propeller or does it always need to get all the workflows in one API request?

ripe-smartphone-56353

09/13/2024, 1:42 PM

> It seems to me that flyte prioritizes scheduling tasks for new workflows instead of scheduling tasks for already running workflows. Or maybe alternatively. Is there a way to set a limit of how many workflows are in progress at any one time? Or is this not possible at all until this would be implemented on a workflow level?

freezing-airport-6809

09/13/2024, 2:01 PM

Yes today cannot prioritize and or limit on one propeller This is actually doable, just not a requirement I like them though Sharding will help, as this is the exact problem it solves, reduces each propellers working set so you won’t get timeouts trying to watch all

freezing-airport-6809

09/13/2024, 2:02 PM

Cc @high-park-82026 @ripe-smartphone-56353 this is hard to follow in thread, let me pull to a channel?

31 Views

Open in Slack

Previous Next