Hello recently our flyte cluster `v1 12 0` has been throwing Flyte #flyte-support

Hello, recently our flyte cluster (`v1.12.0`) has ...

broad-train-34581

07/12/2024, 9:13 AM

Hello, recently our flyte cluster (

v1.12.0

) has been throwing some intermittent error while starting task as our workload increase where they are many pod starting at the same time. This happen once every 1-2 days, which isn't that bad but still breaks our workflow requiring manual intervention on alerts. Retries doesn't seem to work, because the task are not even started.

Copy code

Grace period [3m0s] exceeded|containers with unready status: [f6140adc65c3a4d47000-n1-0]|failed to reserve container name "f6140adc65c3a4d47000-n1-0...": name "f6140adc65c3a4d47000-n1-0..." is reserved for "189f5dbee3fbc4a26ff7a619fcea34ad..."

Copy code

Grace period [0s] exceeded|containers with unready status: [f9fd2612618924083000-n2-0]|failed to sync secret cache: timed out waiting for the condition

So far, we have increased the grace period for

config.K8sPluginConfig

Copy code

create-container-config-error-grace-period: 0s
create-container-error-grace-period: 3m0s

Beside upgrade the cluster resource for the load, are there other config that we can tweak to improve this or to have retries kicking in? Thanks!

freezing-airport-6809

07/12/2024, 1:51 PM

Cc @white-painting-22485 fyi (on gcp) @broad-train-34581 this is on gke right? The k8s cluster is having some problem, starting pods. @freezing-boots-56761 do you remember this

broad-train-34581

07/12/2024, 2:07 PM

Yes gke

freezing-airport-6809

07/12/2024, 2:08 PM

Cc @high-park-82026 also

freezing-airport-6809

07/12/2024, 2:09 PM

@broad-train-34581 maybe we search on discuss some setting in gke solved this. We should remember what’s the cluster size and what’s the pod creation rate

freezing-airport-6809

07/12/2024, 2:09 PM

Also @brief-window-55364 do you remember

freezing-boots-56761

07/12/2024, 2:10 PM

yes i remember this. this is the relevant PR: https://github.com/flyteorg/flyte/commit/355d383d27dab5b8c11067cae9848d172ffad12f we had it set to 10m (bursting to ~10k pods, ~1.5k nodes). almost 3 years ago but IIRC, basically the issue is that the creation of the container by the runtime takes too long. flyte used to see this error and either marked the task as failed or recreated the pod, adding further to the load on the cluster/nodes. just waiting a bit to let it auto-resolve worked best for us. i’m a bit surprised this error STILL exists though lol

freezing-boots-56761

07/12/2024, 2:11 PM

@broad-train-34581 try cranking it up to 10m. it resolves in around 5m most of the time IIRC. but didn’t measure p99

❤️ 1

freezing-airport-6809

07/12/2024, 2:14 PM

lol gke, but I think image streaming may solve it

freezing-boots-56761

07/12/2024, 2:15 PM

this is after the image is fetched and the container is started iirc. the name conflicts happen because the context deadline is reached while waiting for the container to come up and the runtime recreates the container, but the previous one is still coming up and didn’t actually fail 😬

freezing-boots-56761

07/12/2024, 2:17 PM

it does eventually resolve on its own though. i think i had more explanation on the PR, but can’t find the actual PR because of the monorepo merge :(

freezing-boots-56761

07/12/2024, 2:18 PM

@broad-train-34581 the create container config error grace also looks to be too short. that’s a different issue from what i had

🙏 1

broad-train-34581

07/12/2024, 2:20 PM

yes after setting to 10mins for

create-container-error-grace-period: 3m0s

, we almost never seen this the error

failed to reserve container name

anymore for 3 days so far. today there was just one

failed to sync secret cache

and I ended up increasing

create-container-config-error-grace-period: 0s

👍 1

freezing-airport-6809

07/12/2024, 2:21 PM

Is this g visor error

freezing-airport-6809

07/12/2024, 2:21 PM

Seems like a complete GCP problem

freezing-airport-6809

07/12/2024, 2:22 PM

@broad-train-34581 I would recommend you escalate it to your GCP rep

broad-train-34581

07/12/2024, 2:22 PM

just checking here to see if i missed anything or if this is known

❤️ 1

freezing-airport-6809

07/12/2024, 2:22 PM

We have run at 10 higher scale on AWS and never seen it or even on Azure

broad-train-34581

07/12/2024, 2:22 PM

will do, thanks folks!

freezing-airport-6809

07/12/2024, 2:22 PM

@broad-train-34581 please keep us in the loop

brief-window-55364

07/12/2024, 2:23 PM

Yeah I've never seen this too and we run some heavy loads.

👍 1

freezing-boots-56761

07/12/2024, 2:23 PM

we didn’t have gvisor on, just the custom gcp container runtime running on the COS images for us.

freezing-airport-6809

07/12/2024, 2:26 PM

@broad-train-34581 shameless plug, but in union we have a new system that can reuse containers, so dramatically speeds up workflows to run in milliseconds and does not create so many pods 😊 . We would love to work with you

😂 1

135 Views

Open in Slack

Previous Next