We are seeing k8s informer lag with 450 concurrent tasks/nodes. As a result we are (re)processing many stale workflow CRDs to the point where we're seeing >10k etcD conflicts an hour ( 10% of etcD writes). The resource version cache isn't doing that great a job at detecting stale workflows, especially when propeller does a streak and writes a bunch of updates it seems.
This might be a silly question but wouldn't it make sense to write a monotonically increasing number (ie. cpu time) + node ID into the workflow to make better comparisons in the resource version cache as to whether a workflow CR is stale or not?
note: this makes assumptions that the same propeller node should be processing the same flyte workflow, but I think those implementation details can be worked out