steep-summer-33106
09/23/2025, 6:03 PMkubectl get pods
). In the FlyteWorkflow CRD, many nodes remain in QUEUED and never transition.
• Checks so far:
◦ Cluster has plenty of CPU/memory.
◦ No signs of pod eviction or autoscaler interference.
◦ Individual tasks succeed when run alone.
◦ Problem only shows up when running the full DAG (~200 tasks).
Questions:
• Should flyte-binary/propeller handle a 200-node DAG without extra tuning, or do configs like max-parallelism
, queue-batch-size
, or workers
need adjusting?
• What’s the best way to debug why nodes stay in QUEUED (beyond FlyteWorkflow CRD, flyte-binary logs in debug, and propeller logs)? Any metrics or DB queries worth checking?
Happy to share logs or CRD dumps if helpful. Thanks!clean-glass-36808
09/23/2025, 6:27 PMmax-parallelism
can definitely prevent other nodes in the DAG from running if some nodes are stuck in a running state.
I'd look at the Flyte propeller logs and grep for the execution ID to try and see what might have happened. Flyte propeller is the state machine that progresses the workflow and send update events to Flyte Admin which are reflected in the UI.
Are the underlying pods still running or no?steep-summer-33106
09/23/2025, 7:14 PMclean-glass-36808
09/23/2025, 7:16 PMclean-glass-36808
09/23/2025, 7:24 PMclean-glass-36808
09/23/2025, 7:25 PMclean-glass-36808
09/23/2025, 7:25 PMsteep-summer-33106
09/23/2025, 7:37 PMsteep-summer-33106
09/25/2025, 8:10 PMclean-glass-36808
09/25/2025, 8:18 PM~ $ cat /etc/flyte/config/logger.yaml
logger:
level: 10
steep-summer-33106
09/26/2025, 1:04 PMcluster-resource-templates config.d
And inside the config.d
there are
000-core.yaml 001-plugins.yaml 002-database.yaml 003-storage.yaml 012-database-secrets.yaml 100-inline-config.yaml
The `000-core.yam`l content is
admin:
endpoint: localhost:8089
insecure: true
catalog-cache:
endpoint: localhost:8081
insecure: true
type: datacatalog
cluster_resources:
standaloneDeployment: false
templatePath: /etc/flyte/cluster-resource-templates
logger:
show-source: true
level: 5
propeller:
create-flyteworkflow-crd: true
webhook:
certDir: /var/run/flyte/certs
localCert: true
secretName: flyte-flyte-binary-webhook-secret
serviceName: flyte-flyte-binary-webhook
servicePort: 443
flyte:
admin:
disableClusterResourceManager: false
disableScheduler: false
disabled: false
seedProjects:
- flytesnacks
dataCatalog:
disabled: false
propeller:
disableWebhook: false
disabled: false
logger:
level: 5
clean-glass-36808
09/26/2025, 4:30 PMflyte-binary
and use flyte-core
steep-summer-33106
09/26/2025, 9:23 PMflytectl get execution
commandechoing-account-76888
09/29/2025, 6:07 AMechoing-account-76888
09/29/2025, 6:09 AMsteep-summer-33106
09/29/2025, 5:45 PMclean-glass-36808
09/29/2025, 5:47 PMFailed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from SUCCEEDED to ABORTED for task execution task_id:{resource_type:TASK project:\"e7c3d16f-a418-44b1-b9a9-3b6d438e5698\" domain:\"production\" name:\"dex-task\" version:\"v1.0.4\"} node_execution_id:{node_id:\"fdigi2yq\" execution_id:{project:\"e7c3d16f-a418-44b1-b9a9-3b6d438e5698\" domain:\"production\" name:\"a2rntgxw5lkkkhpwfd4v\"}}]]. Trying to record state: ABORTED. Ignoring this error!
clean-glass-36808
09/29/2025, 5:50 PMsteep-summer-33106
09/29/2025, 6:05 PMsteep-summer-33106
09/29/2025, 6:14 PMazlr25dbszjpk4cdhvcx-fekmryya-0 0/1 Completed 0 3d4h
azlr25dbszjpk4cdhvcx-fhupkgwa-0 0/1 Completed 0 3d2h
azlr25dbszjpk4cdhvcx-fnqegaqi-0 0/1 Completed 0 3d4h
azlr25dbszjpk4cdhvcx-fr2optgi-0 0/1 Completed 0 6h1m
azlr25dbszjpk4cdhvcx-frnsjhpi-1 0/1 Completed 0 2d11h
azlr25dbszjpk4cdhvcx-fv1fncfy-0 0/1 Completed 0 3d4h
azlr25dbszjpk4cdhvcx-fyb61mtq-0 0/1 Completed 0 3d4h
azlr25dbszjpk4cdhvcx-fyegn4ai-0 0/1 Completed 0 3d4h
This pod references a node that doesn't apper in the flytectl get execution
output and in the flytworkflow CRD it appears in the QUEUED state
"169f3fea-1c82-424d-8963-85ff04094a63__model.ake.gold_product_daily_kpis": {
"lastUpdatedAt": "2025-09-28T21:10:51Z",
"message": "node queued",
"phase": 1,
"queuedAt": "2025-09-28T21:10:51Z"
},