Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

We are seeing k8s informer lag with ~450 concurrent tasks/nodes. As a result we are (re)processing many stale workflow CRDs to the point where we're seeing >10k etcD conflicts an hour (~ 10% of etcD writes). The resource version cache isn't doing that great a job at detecting stale workflows, especially when propeller does a streak and writes a bunch of updates it seems.

This might be a silly question but wouldn't it make sense to write a monotonically increasing number (ie. cpu time) + node ID into the workflow to make better comparisons in the resource version cache as to whether a workflow CR is stale or not?

note: this makes assumptions that the same propeller node should be processing the same flyte workflow, but I think those implementation details can be worked out

You have to be very careful with the revision numbers.

the revision numbers are for `what` etcd thinks the version is

I am referring to writing a separate value from a monotonic clock (tied to a flyte propeller node) into the workflow CR itself as a different field, sorry if that wasn't clear.

but if you write that, everytime you write, it will change the CRD and time is not a good proxy

You'd inject the monotonic clock time into the CR when you write to etcd for existing updates. Not sure what you mean by time is not a good proxy since this is not wall clock time. It's nanosecond precision and monotonic. Might not work if workers are operating on a workflow concurrently tho, or might need more consideration/locking.

We might just update our fork of the resource version cache to parse the resource versions and do numeric comparisons with the understood risk that the implementation of resource version may change in future versions of k8s.