hi team, has anyone faced the project-quota object...
# flyte-support
s
hi team, has anyone faced the project-quota object modification issue ? looks like race condition issue the resource-quota is being updated in k8s and the propeller invalidates/rejects the node config as the version is not same as of the one with which the node was created ? Thank you
Copy code
│ {"json":{"exec_id":"a96pdwqvmh4gpb4cw8cz","node":"n14","ns":"flyte-pai-staging","res_ver":"620035405","routine":"worker-15","wf":"flyte-pai:staging:optimization_engine_workflows.wf.main_wf"},"level":"error","ms │
│ g":"failed Execute for node. Error: failed at Node[n14]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [Conflict] failed to create resource,  │
│ caused by: Operation cannot be fulfilled on resourcequotas \"project-quota\": the object has been modified; please apply your changes to the latest version and try again","ts":"2025-06-11T07:57:26Z"}
c
This looks similar to errors we have seen before when etcD is under load and propeller is reading stale workflow state and trying to re-process old workflate data. I'm not familiar with how resource quotas work exactly though.
s
oh that makes sense @clean-glass-36808, Thank you !
@clean-glass-36808 did you manage to solve etcd load and how ? there are some guidelines here to offload static info https://www.union.ai/docs/flyte/deployment/flyte-configuration/performance/#improving-etcd-performance
c
Yeah I have a custom resource cache we will be testing soon but ultimately it’s still an issue that there is so much lag with the informer. We haven’t offloaded the static info but we likely will, we expect that to just buy us some more time before we scale out the etcD usage with multiple data plane clusters.
I can probably share the resource version cache code I wrote next week if you’re interested in getting rid of the errors at least.
I have an implementation that uses CPU time and the pod id to avoid reprocessing.
s
oh thats nice !
also how does other metrics looks for your deployment when running ~450 concurrent tasks, i.e. round latency and workflow latencies ? for us on load usually the etcd latency increases along with some write failures which results in alot node event recording error rate
c
I won’t have access to my work laptop until next week as I’m on vacation but I can let you know when I get back.
s
alright, Thank you @clean-glass-36808