Hey flyte friends, sorry for the spam but I'm runn...
# announcements
s
Hey flyte friends, sorry for the spam but I'm running into yet another weird error. Might be some misconfiguration on my side as it used to work but I wasn't able to figure out yet what change causes the issue. So what's happening is that suddenly mosts tasks fail almost immediately after they started running. The high-level error shown in flyteconsole is this:
Copy code
Some node execution failed, auto-abort.
The task pods start, go into running state. I can see a few lines of task logs sometimes but then it looks like the pod is just deleted long before it's finished. It terminates and is cleaned up. Log says:
Copy code
Stopping container m4dydd0r8y-n0-0
So i checked flytepropeller logs and there's one error log a few seconds before the pod is deleted. Not sure if it's related:
Copy code
Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "m4dydd0r8y": the object has been modified; please apply your changes to the latest version and try again]
I'm a bit at a loss here how to debug this further. Any hints much appreciated. I can also provide the full propeller log if that could help tracing this down.
h
Hey @Sören Brunk, Do you have
inject-finalizer: true
for k8s config? If not, can you set it so that the pods stick around even if the k8s node (machine) gets deleted or something, this will enable you to inspect the Pod even after it fails. The Failure to update Workflow issue tells me one of two things: 1. Maybe something/someone issued an out of band Delete operation on the CRD.. this can be that someone directly deleted the CRD or that someone issued an abort in the UI/Admin Api/flytectl… 2. Another propeller is running/competing in processing that CRD (unlikely unless there is an issue with the deployment)
s
Thanks @Haytham Abuelfutuh. Seems like a competing propeller caused the issue indeed. I had another flyte instance running in the same cluster in another namespace and after disabling it, it works again.
So I assume propeller listens to
flyteworkflow
crd instances in all namespaces right? Is there a way to restrict it to configured namespaces? If not, do you think it's feasible to add such a config option?
Or perhaps there's another option because right now that restricts us to one flyte installation per cluster right?
I couldn't find it in the docs, but I found a
limit-namespace
options in the propeller config which says "Namespaces to watch for this propeller". I suspect that could be what I'm looking for @Haytham Abuelfutuh https://github.com/flyteorg/flytepropeller/blob/c016dabbfef6037bead59590b42326dabe89f957/pkg/controller/config/config.go#L124
I wonder if we could eventually use the same logic as in flyteadmin to derive the namespace names for projects/domains via namespace_mapping. WDYT?
I'm looking into this again and it seems like
limitNamespace
is restricted can only be a single namespace (or
all
), but not multiple namespaces. Is this assumption correct?
h
sorry for the late response. yes that’s correct… unfortunately… I think allowing a label selector for selecting namespaces is a clean and k8s-y way of doing it
165 Views