Hi guys we re seeing an issue with Flyte v1 13 3 flyte binar Flyte #flyte-support

Hi guys, we’re seeing an issue with Flyte (v1.13.3...

steep-summer-33106

09/23/2025, 6:03 PM

Hi guys, we’re seeing an issue with Flyte (v1.13.3, flyte-binary on EKS). A workflow with ~200 nodes gets stuck in RUNNING indefinitely. • Symptom: No pods are running (

kubectl get pods

). In the FlyteWorkflow CRD, many nodes remain in QUEUED and never transition. • Checks so far: ◦ Cluster has plenty of CPU/memory. ◦ No signs of pod eviction or autoscaler interference. ◦ Individual tasks succeed when run alone. ◦ Problem only shows up when running the full DAG (~200 tasks). Questions: • Should flyte-binary/propeller handle a 200-node DAG without extra tuning, or do configs like

max-parallelism

queue-batch-size

, or

workers

need adjusting? • What’s the best way to debug why nodes stay in QUEUED (beyond FlyteWorkflow CRD, flyte-binary logs in debug, and propeller logs)? Any metrics or DB queries worth checking? Happy to share logs or CRD dumps if helpful. Thanks!

clean-glass-36808

09/23/2025, 6:27 PM

max-parallelism

can definitely prevent other nodes in the DAG from running if some nodes are stuck in a running state. I'd look at the Flyte propeller logs and grep for the execution ID to try and see what might have happened. Flyte propeller is the state machine that progresses the workflow and send update events to Flyte Admin which are reflected in the UI. Are the underlying pods still running or no?

steep-summer-33106

09/23/2025, 7:14 PM

Hmm, we are using flyte-binary deploy and I couldn't find any propeller logs. There is no underlying pods running. Checking the flyte database I can find two nodes stuck in QUEUED state.

clean-glass-36808

09/23/2025, 7:16 PM

Oh, in flyte-binary I think the default log level for propeller is WARN or something..

clean-glass-36808

09/23/2025, 7:24 PM

https://github.com/flyteorg/flyte/blob/master/charts/flyte-binary/values.yaml#L103-L105

clean-glass-36808

09/23/2025, 7:25 PM

https://github.com/flyteorg/flyte/blob/master/flytestdlib/logger/config.go#L75-L92

clean-glass-36808

09/23/2025, 7:25 PM

So its actually only logging fatal stuff so I think you'd need to repro with logging configured to something like 4

steep-summer-33106

09/23/2025, 7:37 PM

Oh thanks, I'm going to try to set it here

steep-summer-33106

09/25/2025, 8:10 PM

Hey I have set the logging level to 5 but flyte propeller doesn't log anything

clean-glass-36808

09/25/2025, 8:18 PM

If you ssh into the pod what does this show?

Copy code

~ $ cat /etc/flyte/config/logger.yaml 
logger:
  level: 10

steep-summer-33106

09/26/2025, 1:04 PM

There are only those two folders inside /etc/flyte

Copy code

cluster-resource-templates  config.d

And inside the

config.d

there are

Copy code

000-core.yaml  001-plugins.yaml  002-database.yaml  003-storage.yaml  012-database-secrets.yaml  100-inline-config.yaml

The `000-core.yam`l content is

Copy code

admin:
  endpoint: localhost:8089
  insecure: true
catalog-cache:
  endpoint: localhost:8081
  insecure: true
  type: datacatalog
cluster_resources:
  standaloneDeployment: false
  templatePath: /etc/flyte/cluster-resource-templates
logger:
  show-source: true
  level: 5
propeller:
  create-flyteworkflow-crd: true
webhook:
  certDir: /var/run/flyte/certs
  localCert: true
  secretName: flyte-flyte-binary-webhook-secret
  serviceName: flyte-flyte-binary-webhook
  servicePort: 443
flyte:
  admin:
    disableClusterResourceManager: false
    disableScheduler: false
    disabled: false
    seedProjects:
    - flytesnacks
  dataCatalog:
    disabled: false
  propeller:
    disableWebhook: false
    disabled: false
    logger:
      level: 5

clean-glass-36808

09/26/2025, 4:30 PM

Hm I'm not sure then. I am not familiar with

flyte-binary

and use

flyte-core

steep-summer-33106

09/26/2025, 9:23 PM

Hmm I see, thanks anyway. What I'm experiencing right now is that there are pods that completed but in flyte workflow execution the node the pod references takes hours to have its state change to SUCCEEDED. It would keep it in RUNNING state for hours even we having the pod completed. Also I have noticed a node that that only appears in the flyteworkflow crd as RUNNING but have no pod in RUNNING state and this node is not shown in the output of the

flytectl get execution

command

echoing-account-76888

09/29/2025, 6:07 AM

Hi @steep-summer-33106, When using flyte-binary, all components' logs are mixed together. Could you please share the full flyte-binary logs here?

echoing-account-76888

09/29/2025, 6:09 AM

There might be an issue in Flyte Admin causing Propeller events to keep failing to update

steep-summer-33106

09/29/2025, 5:45 PM

Hi @echoing-account-76888 this is the full logs that flyte binary is printing, I have set it to log level 5 ``````

logs

clean-glass-36808

09/29/2025, 5:47 PM

This looks problematic

Copy code

Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from SUCCEEDED to ABORTED for task execution task_id:{resource_type:TASK  project:\"e7c3d16f-a418-44b1-b9a9-3b6d438e5698\"  domain:\"production\"  name:\"dex-task\"  version:\"v1.0.4\"}  node_execution_id:{node_id:\"fdigi2yq\"  execution_id:{project:\"e7c3d16f-a418-44b1-b9a9-3b6d438e5698\"  domain:\"production\"  name:\"a2rntgxw5lkkkhpwfd4v\"}}]]. Trying to record state: ABORTED. Ignoring this error!

clean-glass-36808

09/29/2025, 5:50 PM

It looks like these aren't the full logs. Are you able to relaunch the execution and then grab the full logs for the execution ID?

steep-summer-33106

09/29/2025, 6:05 PM

This execution is being runnig for about 4 days, the full logs for it are really huge (size in GB's)

steep-summer-33106

09/29/2025, 6:14 PM

Something really strange I noticed is that after 2-3 days a new pod was created.

Copy code

azlr25dbszjpk4cdhvcx-fekmryya-0       0/1     Completed   0          3d4h
azlr25dbszjpk4cdhvcx-fhupkgwa-0       0/1     Completed   0          3d2h
azlr25dbszjpk4cdhvcx-fnqegaqi-0       0/1     Completed   0          3d4h
azlr25dbszjpk4cdhvcx-fr2optgi-0       0/1     Completed   0          6h1m
azlr25dbszjpk4cdhvcx-frnsjhpi-1       0/1     Completed   0          2d11h
azlr25dbszjpk4cdhvcx-fv1fncfy-0       0/1     Completed   0          3d4h
azlr25dbszjpk4cdhvcx-fyb61mtq-0       0/1     Completed   0          3d4h
azlr25dbszjpk4cdhvcx-fyegn4ai-0       0/1     Completed   0          3d4h

This pod references a node that doesn't apper in the

flytectl get execution

output and in the flytworkflow CRD it appears in the QUEUED state

Copy code

"169f3fea-1c82-424d-8963-85ff04094a63__model.ake.gold_product_daily_kpis": {
                "lastUpdatedAt": "2025-09-28T21:10:51Z",
                "message": "node queued",
                "phase": 1,
                "queuedAt": "2025-09-28T21:10:51Z"
            },

echoing-account-76888

09/30/2025, 6:36 AM

Hi @steep-summer-33106 Just wondering did you abort the task during the task run? Maybe it's because the task is aborted after task succeed and before propeller update node status So that propeller will keep getting error "`invalid phase change from SUCCEEDED to ABORTED for task execution task_id`" and cannot update node status

echoing-account-76888

09/30/2025, 6:37 AM

Could you try to restart single binary and see if the error goes away?

6 Views

Open in Slack

Previous Next