Hello I ve been getting Flyte working in production over the Flyte #flyte-deployment

Hello! I've been getting Flyte working in producti...

wide-soccer-37846

08/01/2024, 9:23 AM

Hello! I've been getting Flyte working in production over the last few days, all going quite well. I keep getting one issue that I haven't been able to figure out. Occasionally, tasks will fail (after 10 retries) with the error

the object has been modified; please apply your changes to the latest version and try again

. This error is also reported in the logs of the running flyte-binary pod as something like

Copy code

E0801 09:00:58.664274       7 workers.go:103] error syncing 'flytesnacks-development/fd1af753191f54b7598f': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "fd1af753191f54b7598f": the object has been modified; please apply your changes to the latest version and try again

Versions of this error that I have found online seem to normally be when trying to modify some k8s resource that has a version mismatch. I'm not doing that here - the error just sometimes arises part way through a workflow, and may cause the whole thing to abort. It is typically recoverable afterwards. Some details on the deployment: • Running on EKS with Fargate profiles • Using the single flyte-binary (scaled to 4 replicas) • Using the built-in auth My guess is that it has something to do with either how Flyte itself is creating/updating nodes/pods, or how Fargate handles the nodes underneath, but I'm not really sure (I'm a beginner with kubernetes so very much figuring it out as I go). Any help hugely appreciated, and let me know if I can give any more detail to help diagnose.

average-finland-92144

08/01/2024, 3:31 PM

@wide-soccer-37846 are these MapTasks?

wide-soccer-37846

08/01/2024, 4:02 PM

Not in general, I believe. There are MapTasks in the workflow, on which this failure may occur, but it may occur on other tasks too.

⌛ 1

average-finland-92144

08/01/2024, 4:21 PM

ok, and are there multiple concurrent executions? we're seeing these same error in other environments where there are MapTasks with wide fanout

wide-soccer-37846

08/01/2024, 4:25 PM

Yeah it's definitely quite a concurrent workflow - about a dozen (fairly high-memory) tasks running at the same time, then all the results passed to a map task with 8 or so concurrent sub-tasks. I'm also running that workflow as part of a dynamic workflow with (so far) three copies of it

average-finland-92144

08/01/2024, 5:05 PM

ok yeah, like concurrency hell😅 So, I'm looking at

max-streak-length

and how it could help you. The

the object has been modified; please apply your changes to the latest version and try again

error is reported by K8s, and it seems to come if a controller (like flytepropeller) tries to write to a resource that has changed but the controller was not aware yet of that change (this is what is called a "stale" write, which is blocked by etcd) Happy to have a call and go over a bit of the background here (assuming that helps) Could you try enabling

max-streak-length

and set it to

to begin with? for flyte-binary that'd be

Copy code

configuration:
  propeller:
    max-streak-length: '12'

Also please take a look at this section of the docs and feel free to ask any question

wide-soccer-37846

08/01/2024, 5:11 PM

Thanks, that makes sense, and I'll have a go with that parameter 👍 Also sounds like forcing the workflow to be sequential may help? I'm off for the weekend for now but I'll get back to this on Monday, so I'll be back in touch then with how it goes - and a call might be good if you're still happy to do that. Thanks again for your help, appreciate it :)

average-finland-92144

08/01/2024, 5:13 PM

Also sounds like forcing the workflow to be sequential may help?

Not sure if that's needed, we can figure out first if the current structure can work

average-finland-92144

08/01/2024, 5:13 PM

Cool, let us know any update and we can continue from there

wide-soccer-37846

08/05/2024, 1:41 PM

Hi David, just getting back to you on this. Made a couple of updates to the deployment, and I think things are mainly running smoothly now. • Added that max-streak-length parameter • Scaled the flyte-binary replicas back down to 1 • Allowed the vertical pod autoscaler to do its thing I now think the problem was likely having multiple deployments of flyte-binary running simultaneously, I think they were interacting with one another in an invalid race-condition-y way. I'd originally tried scaling them to deal with what looked like memory bottlenecks, but I know understand that that should probably be handled in a vertical rather than horizontal way, hence the vertical autoscaler. Does that make sense or am I barking up the wrong tree?

average-finland-92144

08/05/2024, 2:18 PM

Hey Ben, I think the multiple binary Pods had a lot to do with that behavior! Only one propeller instance should be running, with flyte-core there's a leader election mechanism but not with binary. If any other problem arises let me know!

👍 1

119 Views

Open in Slack

Previous Next