• s

    Sören Brunk

    5 months ago
    Hey flyte friends, sorry for the spam but I'm running into yet another weird error. Might be some misconfiguration on my side as it used to work but I wasn't able to figure out yet what change causes the issue. So what's happening is that suddenly mosts tasks fail almost immediately after they started running. The high-level error shown in flyteconsole is this:
    Some node execution failed, auto-abort.
    The task pods start, go into running state. I can see a few lines of task logs sometimes but then it looks like the pod is just deleted long before it's finished. It terminates and is cleaned up. Log says:
    Stopping container m4dydd0r8y-n0-0
    So i checked flytepropeller logs and there's one error log a few seconds before the pod is deleted. Not sure if it's related:
    Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "m4dydd0r8y": the object has been modified; please apply your changes to the latest version and try again]
    I'm a bit at a loss here how to debug this further. Any hints much appreciated. I can also provide the full propeller log if that could help tracing this down.
  • Haytham Abuelfutuh

    Haytham Abuelfutuh

    5 months ago
    Hey @Sören Brunk, Do you have
    inject-finalizer: true
    for k8s config? If not, can you set it so that the pods stick around even if the k8s node (machine) gets deleted or something, this will enable you to inspect the Pod even after it fails. The Failure to update Workflow issue tells me one of two things:1. Maybe something/someone issued an out of band Delete operation on the CRD.. this can be that someone directly deleted the CRD or that someone issued an abort in the UI/Admin Api/flytectl… 2. Another propeller is running/competing in processing that CRD (unlikely unless there is an issue with the deployment)
  • s

    Sören Brunk

    5 months ago
    Thanks @Haytham Abuelfutuh. Seems like a competing propeller caused the issue indeed. I had another flyte instance running in the same cluster in another namespace and after disabling it, it works again.
  • So I assume propeller listens to
    flyteworkflow
    crd instances in all namespaces right? Is there a way to restrict it to configured namespaces? If not, do you think it's feasible to add such a config option?
  • Or perhaps there's another option because right now that restricts us to one flyte installation per cluster right?
  • I couldn't find it in the docs, but I found a
    limit-namespace
    options in the propeller config which says "Namespaces to watch for this propeller". I suspect that could be what I'm looking for @Haytham Abuelfutuh https://github.com/flyteorg/flytepropeller/blob/c016dabbfef6037bead59590b42326dabe89f957/pkg/controller/config/config.go#L124
  • I wonder if we could eventually use the same logic as in flyteadmin to derive the namespace names for projects/domains via namespace_mapping. WDYT?
  • I'm looking into this again and it seems like
    limitNamespace
    is restricted can only be a single namespace (or
    all
    ), but not multiple namespaces. Is this assumption correct?
  • Haytham Abuelfutuh

    Haytham Abuelfutuh

    4 months ago
    sorry for the late response. yes that’s correct… unfortunately… I think allowing a label selector for selecting namespaces is a clean and k8s-y way of doing it