ripe-smartphone-56353
09/04/2024, 1:26 PMyml
at the end of this post.
What I'm observing seems a bit strange. When flyte first starts up it starts to schedule a lot of pods. I see up to 10K as pending
but once the unprocessed queue depth
goes down the system is scheduling hardly any work.
I would expect it to try to schedule 10s of thousands of pods with 70K workflows running. The roundtrip latency looks ok to my untrained eyes.
We checked our k8s API and can't find any throttling going on there - but maybe we are looking at the wrong things.
On the other hand I also see some tasks that are marked as Running
for over 1 hour with the status message Sent to K8s...
but no pod log link is available yet (see image below as well)
inline:
propeller:
workers: 1800
max-workflow-retries: 50
kube-client-config:
qps: 4000 # Refers to max rate of requests (queries per second) to kube-apiserver
burst: 1200 # refers to max burst rate.
timeout: 30s # Refers to timeout when talking with the kube-apiserver
max-streak-lenght: 2
event:
rate: 10000
capacity: 20000
max-retries: 10
ripe-smartphone-56353
09/04/2024, 2:21 PMaverage-finland-92144
09/04/2024, 5:13 PMburst >= qps
so try changing those values in the config and see if it has any effectripe-smartphone-56353
09/04/2024, 6:42 PMfreezing-airport-6809
freezing-airport-6809
ripe-smartphone-56353
09/05/2024, 6:54 AMripe-smartphone-56353
09/05/2024, 7:42 AMinline:
plugins:
catalogcache:
reader:
maxItems: 200000
maxRetries: 3
workers: 100
writer:
maxItems: 200000
maxRetries: 3
workers: 100
workqueue:
workers: 1200
maxItems: 200000
config:
workers: 1200
maxItems: 200000
propeller:
queue:
batch-size: -1
batching-interval: 1s
queue:
base-delay: 0s
capacity: 100000
max-delay: 1m0s
rate: 1000
type: maxof
sub-queue:
base-delay: 0s
capacity: 100000
max-delay: 0s
rate: 1000
type: bucket
type: batch
freezing-airport-6809
freezing-airport-6809
ripe-smartphone-56353
09/06/2024, 7:06 AMhigh-park-82026
average-finland-92144
09/06/2024, 5:07 PMripe-smartphone-56353
09/10/2024, 1:45 PMflyte-core
at it seems more stable. So we'll test some more tomorrow and I'll be back if we encounter other weird behaviour. Thanks.freezing-airport-6809
ripe-smartphone-56353
09/11/2024, 12:47 PMplugins.workqueue.config.workers
and plugins.workqueue.config.maxItems
are mentioned. Where are those actually configured in flyte-core
? I'm using core.propeller.workers
but can't really find where I would configure `plugins.workqueue.config.maxItems`other than just putting it under core
. Even if I put them there they don't seem to do anything.
Some errors we are seeing
1. flytepropeller - very often / constantly: containerStatus IndexOutOfBound, requested [0], but total containerStatuses [0] in pod phase [Pending]
2. flytepropeller - seldom :`Trace[177840877]: "DeltaFIFO Pop Process" IDfc tile orchestrator next/lst amsr2 v1 0 1000 n040e000 6x6 2017 07 30 desc n6ieprsx,Depth11,Reason:slow event handlers blocking the queue (11-Sep-2024 093037.760) (total time: 151ms):`
3. Database: db=flyteadmin,user=flyteadmin ERROR: duplicate key value violates unique constraint "tags_pkey"
and db=flyteadmin,user=flyteadmin ERROR: duplicate key value violates unique constraint "datasets_pkey"
ripe-smartphone-56353
09/11/2024, 1:49 PMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
ripe-smartphone-56353
09/11/2024, 1:51 PMk8s:
inject-finalizer: true
delete-resource-on-finalize: true
set.
Propeller is scaled up to 12 cores and 28 gb of memory and has been running stable
cache is at 5000 mbfreezing-airport-6809
freezing-airport-6809
average-finland-92144
09/11/2024, 3:10 PMpropeller
(including workers
)
Specifically for the queue, these are the base settings
For flyte-core, they should be set on configmap.core.propeller
ripe-smartphone-56353
09/11/2024, 3:15 PMaverage-finland-92144
09/11/2024, 3:47 PMhigh-park-82026
ripe-smartphone-56353
09/11/2024, 5:42 PMfreezing-airport-6809
ripe-smartphone-56353
09/13/2024, 7:27 AMI might have found a sweetspot in the config since the system was able to complete about 55K workflows in the last 90 minutes.Yeah it was. The observation above was after I figured out he bug in the flyte-core helm chart and enabled the caching. So we are running a longer loadtest today and should have results after the weekend. I'll let you know how that went.
ripe-smartphone-56353
09/13/2024, 9:46 AMworkflow-reeval-duration: 60s
downstream-eval-duration: 10s
max-streak-lenght: 100
Is that the right approach to take? From the description here we were honestly not sure if we should increase or decrease max-streak-length
.
I also noticed that the metrics flyte:propeller:all:workflow:completion_latency_ms
and flyte:propeller:all:workflow:completion_latency_unlabeled_ms
don't show any values so it's hard to keep track of workflow completions. Is there another way of seeing how many workflows complete / time unit?ripe-smartphone-56353
09/13/2024, 11:28 AM---"Objects listed" error:unexpected error when reading response body. Please retry. Original error: net/http: request canceled (Client.Timeout or context cancellation while reading body) 30013ms (11:05:48.756)
. After increasing kube-client-config: timeout
that was fixed. Would that be improved by sharding propeller or does it always need to get all the workflows in one API request?ripe-smartphone-56353
09/13/2024, 1:42 PMfreezing-airport-6809
freezing-airport-6809