Hello, recently our flyte cluster (`v1.12.0`) has ...
# flyte-support
b
Hello, recently our flyte cluster (
v1.12.0
) has been throwing some intermittent error while starting task as our workload increase where they are many pod starting at the same time. This happen once every 1-2 days, which isn't that bad but still breaks our workflow requiring manual intervention on alerts. Retries doesn't seem to work, because the task are not even started.
Copy code
Grace period [3m0s] exceeded|containers with unready status: [f6140adc65c3a4d47000-n1-0]|failed to reserve container name "f6140adc65c3a4d47000-n1-0...": name "f6140adc65c3a4d47000-n1-0..." is reserved for "189f5dbee3fbc4a26ff7a619fcea34ad..."
Copy code
Grace period [0s] exceeded|containers with unready status: [f9fd2612618924083000-n2-0]|failed to sync secret cache: timed out waiting for the condition
So far, we have increased the grace period for
config.K8sPluginConfig
Copy code
create-container-config-error-grace-period: 0s
create-container-error-grace-period: 3m0s
Beside upgrade the cluster resource for the load, are there other config that we can tweak to improve this or to have retries kicking in? Thanks!
f
Cc @white-painting-22485 fyi (on gcp) @broad-train-34581 this is on gke right? The k8s cluster is having some problem, starting pods. @freezing-boots-56761 do you remember this
b
Yes gke
f
Cc @high-park-82026 also
@broad-train-34581 maybe we search on discuss some setting in gke solved this. We should remember what’s the cluster size and what’s the pod creation rate
Also @brief-window-55364 do you remember
f
yes i remember this. this is the relevant PR: https://github.com/flyteorg/flyte/commit/355d383d27dab5b8c11067cae9848d172ffad12f we had it set to 10m (bursting to ~10k pods, ~1.5k nodes). almost 3 years ago but IIRC, basically the issue is that the creation of the container by the runtime takes too long. flyte used to see this error and either marked the task as failed or recreated the pod, adding further to the load on the cluster/nodes. just waiting a bit to let it auto-resolve worked best for us. i’m a bit surprised this error STILL exists though lol
@broad-train-34581 try cranking it up to 10m. it resolves in around 5m most of the time IIRC. but didn’t measure p99
❤️ 1
f
lol gke, but I think image streaming may solve it
f
this is after the image is fetched and the container is started iirc. the name conflicts happen because the context deadline is reached while waiting for the container to come up and the runtime recreates the container, but the previous one is still coming up and didn’t actually fail 😬
it does eventually resolve on its own though. i think i had more explanation on the PR, but can’t find the actual PR because of the monorepo merge :(
@broad-train-34581 the create container config error grace also looks to be too short. that’s a different issue from what i had
🙏 1
b
yes after setting to 10mins for
create-container-error-grace-period: 3m0s
, we almost never seen this the error
failed to reserve container name
anymore for 3 days so far. today there was just one
failed to sync secret cache
and I ended up increasing
create-container-config-error-grace-period: 0s
👍 1
f
Is this g visor error
Seems like a complete GCP problem
@broad-train-34581 I would recommend you escalate it to your GCP rep
b
just checking here to see if i missed anything or if this is known
❤️ 1
f
We have run at 10 higher scale on AWS and never seen it or even on Azure
b
will do, thanks folks!
f
@broad-train-34581 please keep us in the loop
b
Yeah I've never seen this too and we run some heavy loads.
👍 1
f
we didn’t have gvisor on, just the custom gcp container runtime running on the COS images for us.
f
@broad-train-34581 shameless plug, but in union we have a new system that can reuse containers, so dramatically speeds up workflows to run in milliseconds and does not create so many pods 😊 . We would love to work with you
😂 1