Another question about scale. For a test we have ~...
# flyte-support
r
Another question about scale. For a test we have ~70K workflows running. We've scaled up the propeller workers and other settings - see
yml
at the end of this post. What I'm observing seems a bit strange. When flyte first starts up it starts to schedule a lot of pods. I see up to 10K as
pending
but once the
unprocessed queue depth
goes down the system is scheduling hardly any work. I would expect it to try to schedule 10s of thousands of pods with 70K workflows running. The roundtrip latency looks ok to my untrained eyes. We checked our k8s API and can't find any throttling going on there - but maybe we are looking at the wrong things. On the other hand I also see some tasks that are marked as
Running
for over 1 hour with the status message
Sent to K8s...
but no pod log link is available yet (see image below as well)
Copy code
inline:
    propeller:
      workers: 1800
      max-workflow-retries: 50
      kube-client-config:
        qps: 4000 # Refers to max rate of requests (queries per second) to kube-apiserver
        burst: 1200 # refers to max burst rate.
        timeout: 30s # Refers to timeout when talking with the kube-apiserver
      max-streak-lenght: 2
      event:
        rate: 10000
        capacity: 20000
        max-retries: 10
The behaviour is repeatable. If I restart the flyte-binary deployment a few thousand tasks get scheduled and when that is done flyte seems to "only" keep ~250 pods running 🤔
a
@ripe-smartphone-56353 still trying to find something helpful for you. In the meantime, something that we missed mentioning in that docs section is that
burst >= qps
so try changing those values in the config and see if it has any effect
r
Thanks @average-finland-92144. I can give that a try. Does that also mean that the default values are not set correctly?
🎯 1
f
cc @high-park-82026 we know this problem, i think one of your configs needs to be updated
@ripe-smartphone-56353 can we jump on a call?
r
Hi, Sure happy to jump on a call. I'm in UTC+2 Amsterdam time.
> we know this problem, i think one of your configs needs to be updated Extremely curious. I tried a bunch already but they didn't seem to do much...
Copy code
inline:
  plugins:
    catalogcache:
      reader:
        maxItems: 200000
        maxRetries: 3
        workers: 100
      writer:
        maxItems: 200000
        maxRetries: 3
        workers: 100
    workqueue:
      workers: 1200
      maxItems: 200000
      config:
        workers: 1200
        maxItems: 200000

propeller:
  queue:
    batch-size: -1
    batching-interval: 1s
    queue:
      base-delay: 0s
      capacity: 100000
      max-delay: 1m0s
      rate: 1000
      type: maxof
    sub-queue:
      base-delay: 0s
      capacity: 100000
      max-delay: 0s
      rate: 1000
      type: bucket
    type: batch
f
can you check the logs
are you seeing errors from kube api?
r
The logs don't show any errors from kube api. FWIW I've got loglevel set to 1. Could that hide those errors? Our kube API is deployed with an autoscaler as well so it should scale up with demand.
h
It should show errors, I'm concerned you maybe hitting backoff and that's causing propeller to just wait for a very long time... Would you want to pair program on this? I know it's getting late for you but happy to jump on a call
a
please add me to that call when it happens 🙂
r
FWIW - I've switched our deployment to
flyte-core
at it seems more stable. So we'll test some more tomorrow and I'll be back if we encounter other weird behaviour. Thanks.
f
@ripe-smartphone-56353 please let us know how it goes here to understand.
r
We are just running our loadtest again and we are still seeing the same behaviour. If I restart flytepropeller it starts scheduling jobs but then it tapers off rapidly. I do have one basic question just to make sure I'm not missing something obvious: 1. In optimizing performance the
plugins.workqueue.config.workers
and
plugins.workqueue.config.maxItems
are mentioned. Where are those actually configured in
flyte-core
? I'm using
core.propeller.workers
but can't really find where I would configure `plugins.workqueue.config.maxItems`other than just putting it under
core
. Even if I put them there they don't seem to do anything. Some errors we are seeing 1. flytepropeller - very often / constantly:
containerStatus IndexOutOfBound, requested [0], but total containerStatuses [0] in pod phase [Pending]
2. flytepropeller - seldom :`Trace[177840877]: "DeltaFIFO Pop Process" IDfc tile orchestrator next/lst amsr2 v1 0 1000 n040e000 6x6 2017 07 30 desc n6ieprsx,Depth11,Reason:slow event handlers blocking the queue (11-Sep-2024 093037.760) (total time: 151ms):` 3. Database:
db=flyteadmin,user=flyteadmin ERROR:  duplicate key value violates unique constraint "tags_pkey"
and
db=flyteadmin,user=flyteadmin ERROR:  duplicate key value violates unique constraint "datasets_pkey"
Ok, one more update. I might have found a sweetspot in the config since the system was able to complete about 55K workflows in the last 90 minutes. But unfortunately it broke our prometheus metrics system so I can't really see what's going on right now....
f
You will need a few things, inject-finalizer
And properly size propeller and propeller cachesizembs
I would use atleast 4 cores and 16GB and cache around 8GB
r
I've got
Copy code
k8s:
        inject-finalizer: true
        delete-resource-on-finalize: true
set. Propeller is scaled up to 12 cores and 28 gb of memory and has been running stable cache is at 5000 mb
f
Wow 12 cores I don’t think it will use it
At some point you may want to shard it
a
@ripe-smartphone-56353 this may be a mistake in the docs. The actual parameters for propeller's workQueue are under
propeller
(including
workers
) Specifically for the queue, these are the base settings For flyte-core, they should be set on
configmap.core.propeller
r
Thanks @average-finland-92144 that is very good to hear. It was driving me nuts trying to find this configuration and not seeing any effect in the system. Should have asked earlier.
a
update us if it makes any difference. For now I'll create the docs issue
h
@ripe-smartphone-56353 are you running on GKE?
r
@high-park-82026 Yes GKE
f
@ripe-smartphone-56353 was your test successful, would love to know how did things progress
r
I might have found a sweetspot in the config since the system was able to complete about 55K workflows in the last 90 minutes.
Yeah it was. The observation above was after I figured out he bug in the flyte-core helm chart and enabled the caching. So we are running a longer loadtest today and should have results after the weekend. I'll let you know how that went.
First observations from the loadtest that is currently running. We are at about 85K workflows now and flyte is consistently running ~4K pods and seems quite stable. It seems to me that flyte prioritizes scheduling tasks for new workflows instead of scheduling tasks for already running workflows. We tried to tune that using these settings
Copy code
workflow-reeval-duration: 60s
      downstream-eval-duration: 10s
      max-streak-lenght: 100
Is that the right approach to take? From the description here we were honestly not sure if we should increase or decrease
max-streak-length
. I also noticed that the metrics
flyte:propeller:all:workflow:completion_latency_ms
and
flyte:propeller:all:workflow:completion_latency_unlabeled_ms
don't show any values so it's hard to keep track of workflow completions. Is there another way of seeing how many workflows complete / time unit?
Another observation. At about 100K workflows we got
---"Objects listed" error:unexpected error when reading response body. Please retry. Original error: net/http: request canceled (Client.Timeout or context cancellation while reading body) 30013ms (11:05:48.756)
. After increasing
kube-client-config: timeout
that was fixed. Would that be improved by sharding propeller or does it always need to get all the workflows in one API request?
> It seems to me that flyte prioritizes scheduling tasks for new workflows instead of scheduling tasks for already running workflows. Or maybe alternatively. Is there a way to set a limit of how many workflows are in progress at any one time? Or is this not possible at all until this would be implemented on a workflow level?
f
Yes today cannot prioritize and or limit on one propeller This is actually doable, just not a requirement I like them though Sharding will help, as this is the exact problem it solves, reduces each propellers working set so you won’t get timeouts trying to watch all
Cc @high-park-82026 @ripe-smartphone-56353 this is hard to follow in thread, let me pull to a channel?