hi team has anyone faced the project quota object modificati Flyte #flyte-support

hi team, has anyone faced the project-quota object...

square-carpet-13590

06/12/2025, 2:25 PM

hi team, has anyone faced the project-quota object modification issue ? looks like race condition issue the resource-quota is being updated in k8s and the propeller invalidates/rejects the node config as the version is not same as of the one with which the node was created ? Thank you

Copy code

│ {"json":{"exec_id":"a96pdwqvmh4gpb4cw8cz","node":"n14","ns":"flyte-pai-staging","res_ver":"620035405","routine":"worker-15","wf":"flyte-pai:staging:optimization_engine_workflows.wf.main_wf"},"level":"error","ms │
│ g":"failed Execute for node. Error: failed at Node[n14]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [Conflict] failed to create resource,  │
│ caused by: Operation cannot be fulfilled on resourcequotas \"project-quota\": the object has been modified; please apply your changes to the latest version and try again","ts":"2025-06-11T07:57:26Z"}

clean-glass-36808

06/12/2025, 5:35 PM

This looks similar to errors we have seen before when etcD is under load and propeller is reading stale workflow state and trying to re-process old workflate data. I'm not familiar with how resource quotas work exactly though.

square-carpet-13590

06/13/2025, 8:13 AM

oh that makes sense @clean-glass-36808, Thank you !

square-carpet-13590

06/13/2025, 8:31 AM

@clean-glass-36808 did you manage to solve etcd load and how ? there are some guidelines here to offload static info https://www.union.ai/docs/flyte/deployment/flyte-configuration/performance/#improving-etcd-performance

square-carpet-13590

06/13/2025, 1:21 PM

or this thread you started ? https://flyte-org.slack.com/archives/CP2HDHKE1/p1749005130684229

clean-glass-36808

06/13/2025, 1:38 PM

Yeah I have a custom resource cache we will be testing soon but ultimately it’s still an issue that there is so much lag with the informer. We haven’t offloaded the static info but we likely will, we expect that to just buy us some more time before we scale out the etcD usage with multiple data plane clusters.

clean-glass-36808

06/13/2025, 1:39 PM

I can probably share the resource version cache code I wrote next week if you’re interested in getting rid of the errors at least.

clean-glass-36808

06/13/2025, 1:45 PM

I have an implementation that uses CPU time and the pod id to avoid reprocessing.

square-carpet-13590

06/13/2025, 1:47 PM

oh thats nice !

square-carpet-13590

06/13/2025, 1:50 PM

also how does other metrics looks for your deployment when running ~450 concurrent tasks, i.e. round latency and workflow latencies ? for us on load usually the etcd latency increases along with some write failures which results in alot node event recording error rate

clean-glass-36808

06/13/2025, 1:51 PM

I won’t have access to my work laptop until next week as I’m on vacation but I can let you know when I get back.

square-carpet-13590

06/13/2025, 1:52 PM

alright, Thank you @clean-glass-36808

Open in Slack

Previous Next