Hi guys, we’re seeing an issue with Flyte (v1.13.3...
# flyte-support
s
Hi guys, we’re seeing an issue with Flyte (v1.13.3, flyte-binary on EKS). A workflow with ~200 nodes gets stuck in RUNNING indefinitely. • Symptom: No pods are running (
kubectl get pods
). In the FlyteWorkflow CRD, many nodes remain in QUEUED and never transition. • Checks so far: ◦ Cluster has plenty of CPU/memory. ◦ No signs of pod eviction or autoscaler interference. ◦ Individual tasks succeed when run alone. ◦ Problem only shows up when running the full DAG (~200 tasks). Questions: • Should flyte-binary/propeller handle a 200-node DAG without extra tuning, or do configs like
max-parallelism
,
queue-batch-size
, or
workers
need adjusting? • What’s the best way to debug why nodes stay in QUEUED (beyond FlyteWorkflow CRD, flyte-binary logs in debug, and propeller logs)? Any metrics or DB queries worth checking? Happy to share logs or CRD dumps if helpful. Thanks!
c
max-parallelism
can definitely prevent other nodes in the DAG from running if some nodes are stuck in a running state. I'd look at the Flyte propeller logs and grep for the execution ID to try and see what might have happened. Flyte propeller is the state machine that progresses the workflow and send update events to Flyte Admin which are reflected in the UI. Are the underlying pods still running or no?
s
Hmm, we are using flyte-binary deploy and I couldn't find any propeller logs. There is no underlying pods running. Checking the flyte database I can find two nodes stuck in QUEUED state.
c
Oh, in flyte-binary I think the default log level for propeller is WARN or something..
So its actually only logging fatal stuff so I think you'd need to repro with logging configured to something like 4
s
Oh thanks, I'm going to try to set it here
Hey I have set the logging level to 5 but flyte propeller doesn't log anything
c
If you ssh into the pod what does this show?
Copy code
~ $ cat /etc/flyte/config/logger.yaml 
logger:
  level: 10
s
There are only those two folders inside /etc/flyte
Copy code
cluster-resource-templates  config.d
And inside the
config.d
there are
Copy code
000-core.yaml  001-plugins.yaml  002-database.yaml  003-storage.yaml  012-database-secrets.yaml  100-inline-config.yaml
The `000-core.yam`l content is
Copy code
admin:
  endpoint: localhost:8089
  insecure: true
catalog-cache:
  endpoint: localhost:8081
  insecure: true
  type: datacatalog
cluster_resources:
  standaloneDeployment: false
  templatePath: /etc/flyte/cluster-resource-templates
logger:
  show-source: true
  level: 5
propeller:
  create-flyteworkflow-crd: true
webhook:
  certDir: /var/run/flyte/certs
  localCert: true
  secretName: flyte-flyte-binary-webhook-secret
  serviceName: flyte-flyte-binary-webhook
  servicePort: 443
flyte:
  admin:
    disableClusterResourceManager: false
    disableScheduler: false
    disabled: false
    seedProjects:
    - flytesnacks
  dataCatalog:
    disabled: false
  propeller:
    disableWebhook: false
    disabled: false
    logger:
      level: 5
c
Hm I'm not sure then. I am not familiar with
flyte-binary
and use
flyte-core
s
Hmm I see, thanks anyway. What I'm experiencing right now is that there are pods that completed but in flyte workflow execution the node the pod references takes hours to have its state change to SUCCEEDED. It would keep it in RUNNING state for hours even we having the pod completed. Also I have noticed a node that that only appears in the flyteworkflow crd as RUNNING but have no pod in RUNNING state and this node is not shown in the output of the
flytectl get execution
command
e
Hi @steep-summer-33106, When using flyte-binary, all components' logs are mixed together. Could you please share the full flyte-binary logs here?
There might be an issue in Flyte Admin causing Propeller events to keep failing to update
s
Hi @echoing-account-76888 this is the full logs that flyte binary is printing, I have set it to log level 5 ``````
c
This looks problematic
Copy code
Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from SUCCEEDED to ABORTED for task execution task_id:{resource_type:TASK  project:\"e7c3d16f-a418-44b1-b9a9-3b6d438e5698\"  domain:\"production\"  name:\"dex-task\"  version:\"v1.0.4\"}  node_execution_id:{node_id:\"fdigi2yq\"  execution_id:{project:\"e7c3d16f-a418-44b1-b9a9-3b6d438e5698\"  domain:\"production\"  name:\"a2rntgxw5lkkkhpwfd4v\"}}]]. Trying to record state: ABORTED. Ignoring this error!
It looks like these aren't the full logs. Are you able to relaunch the execution and then grab the full logs for the execution ID?
s
This execution is being runnig for about 4 days, the full logs for it are really huge (size in GB's)
Something really strange I noticed is that after 2-3 days a new pod was created.
Copy code
azlr25dbszjpk4cdhvcx-fekmryya-0       0/1     Completed   0          3d4h
azlr25dbszjpk4cdhvcx-fhupkgwa-0       0/1     Completed   0          3d2h
azlr25dbszjpk4cdhvcx-fnqegaqi-0       0/1     Completed   0          3d4h
azlr25dbszjpk4cdhvcx-fr2optgi-0       0/1     Completed   0          6h1m
azlr25dbszjpk4cdhvcx-frnsjhpi-1       0/1     Completed   0          2d11h
azlr25dbszjpk4cdhvcx-fv1fncfy-0       0/1     Completed   0          3d4h
azlr25dbszjpk4cdhvcx-fyb61mtq-0       0/1     Completed   0          3d4h
azlr25dbszjpk4cdhvcx-fyegn4ai-0       0/1     Completed   0          3d4h
This pod references a node that doesn't apper in the
flytectl get execution
output and in the flytworkflow CRD it appears in the QUEUED state
Copy code
"169f3fea-1c82-424d-8963-85ff04094a63__model.ake.gold_product_daily_kpis": {
                "lastUpdatedAt": "2025-09-28T21:10:51Z",
                "message": "node queued",
                "phase": 1,
                "queuedAt": "2025-09-28T21:10:51Z"
            },