Hello! I've been getting Flyte working in producti...
# flyte-deployment
w
Hello! I've been getting Flyte working in production over the last few days, all going quite well. I keep getting one issue that I haven't been able to figure out. Occasionally, tasks will fail (after 10 retries) with the error
the object has been modified; please apply your changes to the latest version and try again
. This error is also reported in the logs of the running flyte-binary pod as something like
Copy code
E0801 09:00:58.664274       7 workers.go:103] error syncing 'flytesnacks-development/fd1af753191f54b7598f': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "fd1af753191f54b7598f": the object has been modified; please apply your changes to the latest version and try again
Versions of this error that I have found online seem to normally be when trying to modify some k8s resource that has a version mismatch. I'm not doing that here - the error just sometimes arises part way through a workflow, and may cause the whole thing to abort. It is typically recoverable afterwards. Some details on the deployment: • Running on EKS with Fargate profiles • Using the single flyte-binary (scaled to 4 replicas) • Using the built-in auth My guess is that it has something to do with either how Flyte itself is creating/updating nodes/pods, or how Fargate handles the nodes underneath, but I'm not really sure (I'm a beginner with kubernetes so very much figuring it out as I go). Any help hugely appreciated, and let me know if I can give any more detail to help diagnose.
a
@wide-soccer-37846 are these MapTasks?
w
Not in general, I believe. There are MapTasks in the workflow, on which this failure may occur, but it may occur on other tasks too.
āŒ› 1
a
ok, and are there multiple concurrent executions? we're seeing these same error in other environments where there are MapTasks with wide fanout
w
Yeah it's definitely quite a concurrent workflow - about a dozen (fairly high-memory) tasks running at the same time, then all the results passed to a map task with 8 or so concurrent sub-tasks. I'm also running that workflow as part of a dynamic workflow with (so far) three copies of it
a
ok yeah, like concurrency hellšŸ˜… So, I'm looking at
max-streak-length
and how it could help you. The
the object has been modified; please apply your changes to the latest version and try again
error is reported by K8s, and it seems to come if a controller (like flytepropeller) tries to write to a resource that has changed but the controller was not aware yet of that change (this is what is called a "stale" write, which is blocked by etcd) Happy to have a call and go over a bit of the background here (assuming that helps) Could you try enabling
max-streak-length
and set it to
12
to begin with? for flyte-binary that'd be
Copy code
configuration:
  propeller:
    max-streak-length: '12'
Also please take a look at this section of the docs and feel free to ask any question
w
Thanks, that makes sense, and I'll have a go with that parameter šŸ‘ Also sounds like forcing the workflow to be sequential may help? I'm off for the weekend for now but I'll get back to this on Monday, so I'll be back in touch then with how it goes - and a call might be good if you're still happy to do that. Thanks again for your help, appreciate it :)
a
Also sounds like forcing the workflow to be sequential may help?
Not sure if that's needed, we can figure out first if the current structure can work
Cool, let us know any update and we can continue from there
w
Hi David, just getting back to you on this. Made a couple of updates to the deployment, and I think things are mainly running smoothly now. • Added that max-streak-length parameter • Scaled the flyte-binary replicas back down to 1 • Allowed the vertical pod autoscaler to do its thing I now think the problem was likely having multiple deployments of flyte-binary running simultaneously, I think they were interacting with one another in an invalid race-condition-y way. I'd originally tried scaling them to deal with what looked like memory bottlenecks, but I know understand that that should probably be handled in a vertical rather than horizontal way, hence the vertical autoscaler. Does that make sense or am I barking up the wrong tree?
a
Hey Ben, I think the multiple binary Pods had a lot to do with that behavior! Only one propeller instance should be running, with flyte-core there's a leader election mechanism but not with binary. If any other problem arises let me know!
šŸ‘ 1