broad-train-34581
07/12/2024, 9:13 AMv1.12.0
) has been throwing some intermittent error while starting task as our workload increase where they are many pod starting at the same time. This happen once every 1-2 days, which isn't that bad but still breaks our workflow requiring manual intervention on alerts. Retries doesn't seem to work, because the task are not even started.
Grace period [3m0s] exceeded|containers with unready status: [f6140adc65c3a4d47000-n1-0]|failed to reserve container name "f6140adc65c3a4d47000-n1-0...": name "f6140adc65c3a4d47000-n1-0..." is reserved for "189f5dbee3fbc4a26ff7a619fcea34ad..."
Grace period [0s] exceeded|containers with unready status: [f9fd2612618924083000-n2-0]|failed to sync secret cache: timed out waiting for the condition
So far, we have increased the grace period for config.K8sPluginConfig
create-container-config-error-grace-period: 0s
create-container-error-grace-period: 3m0s
Beside upgrade the cluster resource for the load, are there other config that we can tweak to improve this or to have retries kicking in? Thanks!freezing-airport-6809
broad-train-34581
07/12/2024, 2:07 PMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-boots-56761
freezing-boots-56761
freezing-airport-6809
freezing-boots-56761
freezing-boots-56761
freezing-boots-56761
broad-train-34581
07/12/2024, 2:20 PMcreate-container-error-grace-period: 3m0s
, we almost never seen this the error failed to reserve container name
anymore for 3 days so far.
today there was just one failed to sync secret cache
and I ended up increasing create-container-config-error-grace-period: 0s
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
broad-train-34581
07/12/2024, 2:22 PMfreezing-airport-6809
broad-train-34581
07/12/2024, 2:22 PMfreezing-airport-6809
brief-window-55364
07/12/2024, 2:23 PMfreezing-boots-56761
freezing-airport-6809