Hey flyte friends sorry for the spam but I m running into ye Flyte #announcements

Hey flyte friends, sorry for the spam but I'm runn...

boundless-pizza-95864

03/09/2022, 10:15 PM

Hey flyte friends, sorry for the spam but I'm running into yet another weird error. Might be some misconfiguration on my side as it used to work but I wasn't able to figure out yet what change causes the issue. So what's happening is that suddenly mosts tasks fail almost immediately after they started running. The high-level error shown in flyteconsole is this:

Copy code

Some node execution failed, auto-abort.

The task pods start, go into running state. I can see a few lines of task logs sometimes but then it looks like the pod is just deleted long before it's finished. It terminates and is cleaned up. Log says:

Copy code

Stopping container m4dydd0r8y-n0-0

So i checked flytepropeller logs and there's one error log a few seconds before the pod is deleted. Not sure if it's related:

Copy code

Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "m4dydd0r8y": the object has been modified; please apply your changes to the latest version and try again]

I'm a bit at a loss here how to debug this further. Any hints much appreciated. I can also provide the full propeller log if that could help tracing this down.

high-park-82026

03/09/2022, 10:31 PM

Hey @boundless-pizza-95864, Do you have

inject-finalizer: true

for k8s config? If not, can you set it so that the pods stick around even if the k8s node (machine) gets deleted or something, this will enable you to inspect the Pod even after it fails. The Failure to update Workflow issue tells me one of two things: 1. Maybe something/someone issued an out of band Delete operation on the CRD.. this can be that someone directly deleted the CRD or that someone issued an abort in the UI/Admin Api/flytectl… 2. Another propeller is running/competing in processing that CRD (unlikely unless there is an issue with the deployment)

boundless-pizza-95864

03/10/2022, 7:42 AM

Thanks @high-park-82026. Seems like a competing propeller caused the issue indeed. I had another flyte instance running in the same cluster in another namespace and after disabling it, it works again.

boundless-pizza-95864

03/10/2022, 7:58 AM

So I assume propeller listens to

flyteworkflow

crd instances in all namespaces right? Is there a way to restrict it to configured namespaces? If not, do you think it's feasible to add such a config option?

boundless-pizza-95864

03/10/2022, 10:52 AM

Or perhaps there's another option because right now that restricts us to one flyte installation per cluster right?

boundless-pizza-95864

03/16/2022, 11:33 AM

I couldn't find it in the docs, but I found a

limit-namespace

options in the propeller config which says "Namespaces to watch for this propeller". I suspect that could be what I'm looking for @high-park-82026 https://github.com/flyteorg/flytepropeller/blob/c016dabbfef6037bead59590b42326dabe89f957/pkg/controller/config/config.go#L124

boundless-pizza-95864

03/16/2022, 12:50 PM

I wonder if we could eventually use the same logic as in flyteadmin to derive the namespace names for projects/domains via namespace_mapping. WDYT?

boundless-pizza-95864

04/05/2022, 12:02 PM

I'm looking into this again and it seems like

limitNamespace

is restricted can only be a single namespace (or

all

), but not multiple namespaces. Is this assumption correct?

high-park-82026

04/09/2022, 4:21 PM

sorry for the late response. yes that’s correct… unfortunately… I think allowing a label selector for selecting namespaces is a clean and k8s-y way of doing it

171 Views

Open in Slack

Previous Next