<@U06RKAMK0UC>, Seeing below errors logs in flyte ...
# ask-ai
v
@ancient-wolf-19325, Seeing below errors logs in flyte propeller pod when i trigger a execution W0729 093948.966687 1 warnings.go:70] metadata.finalizers: "flyte-finalizer": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
j
This warning means Flyte is using a non-domain-qualified finalizer ("flyte-finalizer") on Kubernetes resources, which Kubernetes now discourages in favor of domain-qualified names (e.g., "flyte.org/finalizer") to avoid conflicts. Recent Flyte versions have switched to domain-qualified finalizers to address this warning, but older deployments or resources may still use the old name. The warning is not harmful but should be resolved for future compatibility and cleaner logs. See issue #6019 and PR #6023 for details. Would you like more detail on how to update your deployment or migrate existing resources to use the new finalizer? Sources: - Flyte Issue #6019 - Flyte PR #6023 πŸ‘ 0 button πŸ‘Ž 0 button πŸ’‘ Hint: Mention @runllm in the thread for followups.
v
@ancient-wolf-19325, Also this log E0729 093928.888441 1 workers.go:103] error syncing 'duploservices-x-forge/as2rzpf4pmdllvqhbmkc': Operation cannot be fulfilled on flyteworkflows.flyte.lyft.com "as2rzpf4pmdllvqhbmkc": the object has been modified; please apply your changes to the latest version and try again
j
This error means FlytePropeller tried to update a FlyteWorkflow CRD, but the resource was modified by another process before the update, causing a resource version conflict. This is a known issue in Flyte's reconciliation loop and can occur during normal concurrent operations; FlytePropeller will retry automatically. The error is not fatal but may cause log noise or, in rare cases, workflow reprocessing delays. See Flyte Issue #2546 and Flyte Issue #4020 for more context. Would you like more detail on mitigation strategies or recent fixes related to this error? Sources: - Flyte Issue #2546 - Flyte Issue #4020 πŸ‘ 0 button πŸ‘Ž 0 button πŸ’‘ Hint: Mention @runllm in the thread for followups.
v
@ancient-wolf-19325 FlyteWorkflow CRD ?? where exactly i need to check this ?
j
The FlyteWorkflow CRD (Custom Resource Definition) is a Kubernetes resource that represents each workflow execution in Flyte. You can check FlyteWorkflow CRDs using
kubectl
in the namespace where your Flyte project runs. For example, run:
Copy code
kubectl get flyteworkflows.flyte.lyft.com -A
This command lists all FlyteWorkflow CRD instances across all namespaces. To inspect a specific workflow, use:
Copy code
kubectl describe flyteworkflows.flyte.lyft.com <workflow-name> -n <namespace>
Would you like more detail on interpreting the CRD fields or troubleshooting specific issues? Sources: - FlytePropeller Architecture - FlyteWorkflow CRD Example πŸ‘ 0 button πŸ‘Ž 0 button πŸ’‘ Hint: Mention @runllm in the thread for followups.
v
@ancient-wolf-19325 kubectl get flyteworkflows.flyte.lyft.com -A NAMESPACE NAME AGE duploservices-x-forge a4bjsf922k5dk9dfmkjx 44m duploservices-x-forge agq78hrwkhxkj9tzlzqz 6m36s duploservices-x-forge ajbwmbq94dwz74gwqrg5 12m duploservices-x-forge as2rzpf4pmdllvqhbmkc 30m duploservices-x-forge azts9fwq56xbppbn2fvx 37m
j
You are correctly listing all FlyteWorkflow CRD instances in the duploservices-x-forge namespace. Each NAME corresponds to a unique workflow execution managed by FlytePropeller. These resources track the state and progress of your Flyte workflows in Kubernetes. Would you like more detail on how to inspect the status or debug a specific workflow execution? Sources: - FlytePropeller Architecture πŸ‘ 0 button πŸ‘Ž 0 button πŸ’‘ Hint: Mention @runllm in the thread for followups.
v
@ancient-wolf-19325, yes i would you like more detail on how to inspect the status or debug a specific workflow execution.
j
To inspect and debug a specific Flyte workflow execution, use the following steps: 1. Describe the FlyteWorkflow CRD: Run
Copy code
kubectl describe flyteworkflows.flyte.lyft.com <workflow-name> -n <namespace>
This shows the workflow's spec (DAG, nodes) and status (phases, errors, timestamps). Look for the
status
section for node/task phases and error messages. 2. Check Pod Status: List pods in the namespace to see if any are stuck or failed:
Copy code
kubectl get pods -n <namespace>
Describe a pod for more details:
Copy code
kubectl describe pod <pod-name> -n <namespace>
3. Use Flyte UI or CLI: - In the Flyte UI, navigate to the workflow execution to view the execution graph, node statuses, logs, and errors. - With Flytectl, run:
Copy code
flytectl get execution -p <project> -d <domain> <execid> --details
This provides node/task execution details and log links. 4. Check Logs: Access logs for FlytePropeller and task pods for error messages. Would you like a step-by-step example with sample output or more detail on interpreting specific fields? Sources: - FlytePropeller Architecture - Debugging Executions - Inspecting Executions πŸ‘ 0 button πŸ‘Ž 0 button πŸ’‘ Hint: Mention @runllm in the thread for followups.
a
Hey @victorious-jelly-28729 What's the current status? Could you share logs of propeller in case you still need help?
v
@average-finland-92144, Now workflow execution is not happening until i restart the flytepropeller pods kubectl describe flyteworkflows.flyte.lyft.com auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf -n duploservices-kpmg Error from server (Forbidden): flyteworkflows.flyte.lyft.com "auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf" is forbidden: User "systemserviceaccountduploservices-kpmg:duploservices-kpmg-edit-user" cannot get resource "flyteworkflows" in API group "flyte.lyft.com" in the namespace "duploservices-kpmg"
@average-finland-92144, let me share the logs
Propeller logs: @average-finland-92144
Copy code
time="2025-08-14T11:58:41Z" level=info msg=------------------------------------------------------------------------
time="2025-08-14T11:58:41Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2025-08-14 11:58:41.676459895 +0000 UTC m=+0.001183164]"
time="2025-08-14T11:58:41Z" level=info msg=------------------------------------------------------------------------
time="2025-08-14T11:58:41Z" level=info msg="Detected: 4 CPU's\n"
{"json":{},"level":"warning","msg":"stow configuration section missing, defaulting to legacy s3/minio connection config","ts":"2025-08-14T11:58:41Z"}
I0814 11:58:42.024579       1 leaderelection.go:250] attempting to acquire leader lease duploservices-kpmg/propeller-leader...
time="2025-08-14T11:58:43Z" level=info msg=------------------------------------------------------------------------
time="2025-08-14T11:58:43Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2025-08-14 11:58:43.351075208 +0000 UTC m=+0.001096395]"
time="2025-08-14T11:58:43Z" level=info msg=------------------------------------------------------------------------
time="2025-08-14T11:58:43Z" level=info msg="Detected: 4 CPU's\n"
{"json":{},"level":"warning","msg":"stow configuration section missing, defaulting to legacy s3/minio connection config","ts":"2025-08-14T11:58:43Z"}
I0814 11:58:43.725951       1 leaderelection.go:250] attempting to acquire leader lease duploservices-kpmg/propeller-leader...
I0814 11:58:58.713172       1 leaderelection.go:260] successfully acquired lease duploservices-kpmg/propeller-leader
{"json":{"exec_id":"n6-ymujj8y7e","node":"n0","ns":"duploservices-kpmg","res_ver":"120055912","routine":"worker-1","tasktype":"python-task","wf":"duploservices-kpmg:development:auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf"},"level":"warning","msg":"Failed to record taskEvent, error [AlreadyExists: Event already exists, caused by [rpc error: code = AlreadyExists desc = have already recorded task execution phase SUCCEEDED (version: 0) for {{{} [] [] <nil>} 0 [] resource_type:TASK project:\"duploservices-kpmg\" domain:\"development\" name:\"auto_fine_tuning.data.finetuning_data.generate_finetuning_data\" version:\"ft1.3\" node_id:\"n0\" execution_id:{project:\"duploservices-kpmg\" domain:\"development\" name:\"n6-ymujj8y7e\"} 0}]]. Trying to record state: SUCCEEDED. Ignoring this error!","ts":"2025-08-14T11:58:59Z"}
{"json":{"exec_id":"n6-ymujj8y7e","node":"n1","ns":"duploservices-kpmg","res_ver":"120058691","routine":"worker-20","tasktype":"python-task","wf":"duploservices-kpmg:development:auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf"},"level":"warning","msg":"Failed to cast contentMD5 [] to string","ts":"2025-08-14T12:01:37Z"}
{"json":{"exec_id":"v1n-f2mqw9zi","ns":"duploservices-kpmg","routine":"worker-17"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"v1n-f2mqw9zi\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:09:37Z"}
E0814 12:09:37.983709       1 workers.go:103] error syncing 'duploservices-kpmg/v1n-f2mqw9zi': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "v1n-f2mqw9zi": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"v1n-f2mqw9zi","ns":"duploservices-kpmg","routine":"worker-19"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"v1n-f2mqw9zi\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:09:48Z"}
E0814 12:09:48.071878       1 workers.go:103] error syncing 'duploservices-kpmg/v1n-f2mqw9zi': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "v1n-f2mqw9zi": the object has been modified; please apply your changes to the latest version and try again
W0814 12:10:28.124746       1 warnings.go:70] metadata.finalizers: "flyte-finalizer": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
{"json":{"exec_id":"v1n-f2mqw9zi","node":"n0","ns":"duploservices-kpmg","res_ver":"120067077","routine":"worker-13","tasktype":"python-task","wf":"duploservices-kpmg:development:auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf"},"level":"warning","msg":"Failed to cast contentMD5 [] to string","ts":"2025-08-14T12:10:47Z"}
{"json":{"exec_id":"v1n-f2mqw9zi","node":"n0","ns":"duploservices-kpmg","res_ver":"120067292","routine":"worker-13","wf":"duploservices-kpmg:development:auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf"},"level":"warning","msg":"Trying to abort a node in state [Failed]","ts":"2025-08-14T12:10:47Z"}
{"json":{"exec_id":"v1n-f2mqw9zi","ns":"duploservices-kpmg","routine":"worker-25"},"level":"warning","msg":"Workflow namespace[duploservices-kpmg]/name[v1n-f2mqw9zi] has already been terminated.","ts":"2025-08-14T12:11:07Z"}
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-3"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:15:31Z"}
E0814 12:15:31.210232       1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-5"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:15:41Z"}
E0814 12:15:41.300136       1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-8"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:16:01Z"}
E0814 12:16:01.305906       1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
W0814 12:17:21.377989       1 warnings.go:70] metadata.finalizers: "flyte-finalizer": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-21"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:17:21Z"}
E0814 12:17:21.378197       1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
W0814 12:19:21.442075       1 warnings.go:70] metadata.finalizers: "flyte-finalizer": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-30"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:19:21Z"}
E0814 12:19:21.442244       1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-33"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:21:21Z"}
E0814 12:21:21.518997       1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-7"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:23:21Z"}
E0814 12:23:21.610229       1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
observation:
Copy code
apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
kind: FlyteWorkflow
metadata:
  creationTimestamp: '2025-08-14T12:15:26Z'
  generation: 2
  labels:
    domain: development
    execution-id: kui6eqjcj2r6
    project: duploservices-kpmg
    shard-key: '19'
    workflow-name: auto-fine-tuning-workflows-auto-fine-tuning-auto-fine-tuning-wf
  name: kui6eqjcj2r6
  namespace: duploservices-kpmg
  resourceVersion: '120071686'
  uid: 378fd071-2294-4b4a-a414-d81193ae0a0b
  selfLink: >-
    /apis/flyte.lyft.com/v1alpha1/namespaces/duploservices-kpmg/flyteworkflows/kui6eqjcj2r6
status:
  failedAttempts: 1
  message: "failed at Node[start-node]. CausedByError: Failed to store workflow inputs (as start node), caused by: Failed to write data [0b] to path [metadata/propeller/duploservices-kpmg-development-kui6eqjcj2r6/start-node/data/0/outputs.pb].: PutObject, putting object: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 41483abe-6ab8-4312-83c4-965f42a96ae5"
  phase: 0
spec:
  connections:
    n0:
      - n1
    n1:
      - n2
    n2:
      - end-node
    start-node:
      - n0
  edges:
    downstream:
      n0:
        - n1
      n1:
        - n2
      n2:
        - end-node
      start-node:
        - n0
    upstream:
      end-node:
        - n2
      n0:
        - start-node
      n1:
        - n0
      n2:
        - n1
  id: >-
    duploservices-kpmg:development:auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf
  nodes:
    end-node:
      id: end-node
      kind: end
      resources: {}
    n0:
      id: n0
      kind: task
      name: generate_finetuning_data
      resources: {}
      task: >-
        resource_type:TASK project:"duploservices-kpmg" domain:"development"
        name:"auto_fine_tuning.data.finetuning_data.generate_finetuning_data"
        version:"ft1.3"
    n1:
      id: n1
      kind: task
      name: combined_dataset_preparation
      resources: {}
      task: >-
        resource_type:TASK project:"duploservices-kpmg" domain:"development"
        name:"auto_fine_tuning.dataset_preparation.combine_dataset.combined_dataset_preparation"
        version:"ft1.3"
    n2:
      id: n2
      inputBindings:
        - binding:
            promise:
              attrPath:
                - stringValue: o0
              nodeId: n1
              var: o0
          var: job_id
      kind: task
      name: fine_tune
      resources: {}
      task: >-
        resource_type:TASK project:"duploservices-kpmg" domain:"development"
        name:"auto_fine_tuning.fine_tuning.fine_tune.fine_tune" version:"ft1.3"
    start-node:
      id: start-node
      kind: start
      resources: {}
  outputs: {}
acceptedAt: '2025-08-14T12:15:25Z'
executionConfig:
  EnvironmentVariables:
    AUTH_TOKEN: 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
    CHUNK_COUNT_TO_BE_PROCESSED: '500'
    FTM_POSTGRES_DATABASE: forge
    FTM_POSTGRES_HOST: <http://duplopgdb-kpmg.clastecrvnqt.us-west-2.rds.amazonaws.com|duplopgdb-kpmg.clastecrvnqt.us-west-2.rds.amazonaws.com>
    FTM_POSTGRES_PASSWORD: W3g2P5FT&6p[Z5pU
    FTM_POSTGRES_USER: xforgeuser
    KB_ID: 9d491af8-0835-4c5f-b9f5-76d09f925a3b
    LLAMA_MODEL_URL: ''
    MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct
    MODEL_TYPE: generator
    OPEN_API_KEY: >-
      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
    S3_SOURCE_BUCKET: sagemaker-us-west-2-089910700032
    TENANT_ID: 8b89b65f-d0fd-4e2a-8e9a-acc507e0c118
    USER_ID: e58503eb-e551-48fe-aa2f-ac652984281d
  Interruptible: null
  MaxParallelism: 25
  OverwriteCache: false
  RecoveryExecution: {}
  TaskPluginImpls: {}
  TaskResources:
    Limits:
      CPU: '24'
      EphemeralStorage: '0'
      GPU: '1'
      Memory: 50Gi
      Storage: '0'
    Requests:
      CPU: '1'
      EphemeralStorage: '0'
      GPU: '0'
      Memory: 1000Mi
      Storage: '0'
executionId:
  domain: development
  name: kui6eqjcj2r6
  project: duploservices-kpmg
inputs: {}
node-defaults: {}
rawOutputDataConfig: {}
securityContext:
  run_as: {}
tasks:
  resource_type:TASK project:"duploservices-kpmg" domain:"development" name:"auto_fine_tuning.data.finetuning_data.generate_finetuning_data" version:"ft1.3":
    container:
      args:
        - pyflyte-fast-execute
        - '--additional-distribution'
        - >-
          <s3://duploservices-kpmg-flyte-089910700032/duploservices-kpmg/development/64ALQVW3AMMCNCDTHFM2U5NDDY======/fast4734933ef38579c5bbf523ce20455b56.tar.gz>
        - '--dest-dir'
        - /root
        - '--'
        - pyflyte-execute
        - '--inputs'
        - '{{.input}}'
        - '--output-prefix'
        - '{{.outputPrefix}}'
        - '--raw-output-data-prefix'
        - '{{.rawOutputDataPrefix}}'
        - '--checkpoint-path'
        - '{{.checkpointOutputPrefix}}'
        - '--prev-checkpoint'
        - '{{.prevCheckpointPrefix}}'
        - '--resolver'
        - flytekit.core.python_auto_container.default_task_resolver
        - '--'
        - task-module
        - auto_fine_tuning.data.finetuning_data
        - task-name
        - generate_finetuning_data
      image: <http://registry.uniphore.com/uniphore-manual/instruction-dataset:v-5|registry.uniphore.com/uniphore-manual/instruction-dataset:v-5>
      resources:
        limits:
          - name: CPU
            value: 1500m
          - name: MEMORY
            value: 4000Mi
        requests:
          - name: CPU
            value: 1500m
          - name: MEMORY
            value: 4000Mi
    id:
      domain: development
      name: auto_fine_tuning.data.finetuning_data.generate_finetuning_data
      project: duploservices-kpmg
      resourceType: TASK
      version: ft1.3
    interface:
      inputs: {}
      outputs: {}
    metadata:
      retries: {}
      runtime:
        flavor: python
        type: FLYTE_SDK
        version: 1.13.5
    securityContext:
      secrets:
        - group: hugging-face-token
          key: auth_token
    type: python-task
  resource_type:TASK project:"duploservices-kpmg" domain:"development" name:"auto_fine_tuning.dataset_preparation.combine_dataset.combined_dataset_preparation" version:"ft1.3":
    container:
      args:
        - pyflyte-fast-execute
        - '--additional-distribution'
        - >-
          <s3://duploservices-kpmg-flyte-089910700032/duploservices-kpmg/development/64ALQVW3AMMCNCDTHFM2U5NDDY======/fast4734933ef38579c5bbf523ce20455b56.tar.gz>
        - '--dest-dir'
        - /root
        - '--'
        - pyflyte-execute
        - '--inputs'
        - '{{.input}}'
        - '--output-prefix'
        - '{{.outputPrefix}}'
        - '--raw-output-data-prefix'
        - '{{.rawOutputDataPrefix}}'
        - '--checkpoint-path'
        - '{{.checkpointOutputPrefix}}'
        - '--prev-checkpoint'
        - '{{.prevCheckpointPrefix}}'
        - '--resolver'
        - flytekit.core.python_auto_container.default_task_resolver
        - '--'
        - task-module
        - auto_fine_tuning.dataset_preparation.combine_dataset
        - task-name
        - combined_dataset_preparation
      image: <http://registry.uniphore.com/uniphore-manual/instruction-dataset:v-5|registry.uniphore.com/uniphore-manual/instruction-dataset:v-5>
      resources:
        limits:
          - name: CPU
            value: 1500m
          - name: MEMORY
            value: 4000Mi
        requests:
          - name: CPU
            value: 1500m
          - name: MEMORY
            value: 4000Mi
    id:
      domain: development
      name: >-
        auto_fine_tuning.dataset_preparation.combine_dataset.combined_dataset_preparation
      project: duploservices-kpmg
      resourceType: TASK
      version: ft1.3
    interface:
      inputs: {}
      outputs:
        variables:
          o0:
            type:
              simple: STRING
    metadata:
      retries: {}
      runtime:
        flavor: python
        type: FLYTE_SDK
        version: 1.13.5
    securityContext:
      secrets:
        - group: hugging-face-token
          key: auth_token
    type: python-task
  resource_type:TASK project:"duploservices-kpmg" domain:"development" name:"auto_fine_tuning.fine_tuning.fine_tune.fine_tune" version:"ft1.3":
    container:
      args:
        - pyflyte-fast-execute
        - '--additional-distribution'
        - >-
          <s3://duploservices-kpmg-flyte-089910700032/duploservices-kpmg/development/64ALQVW3AMMCNCDTHFM2U5NDDY======/fast4734933ef38579c5bbf523ce20455b56.tar.gz>
        - '--dest-dir'
        - /root
        - '--'
        - pyflyte-execute
        - '--inputs'
        - '{{.input}}'
        - '--output-prefix'
        - '{{.outputPrefix}}'
        - '--raw-output-data-prefix'
        - '{{.rawOutputDataPrefix}}'
        - '--checkpoint-path'
        - '{{.checkpointOutputPrefix}}'
        - '--prev-checkpoint'
        - '{{.prevCheckpointPrefix}}'
        - '--resolver'
        - flytekit.core.python_auto_container.default_task_resolver
        - '--'
        - task-module
        - auto_fine_tuning.fine_tuning.fine_tune
        - task-name
        - fine_tune
      image: <http://registry.uniphore.com/uniphore-manual/auto-fine-tuning:v-53|registry.uniphore.com/uniphore-manual/auto-fine-tuning:v-53>
      resources:
        limits:
          - name: CPU
            value: '3'
          - name: MEMORY
            value: 14000Mi
        requests:
          - name: CPU
            value: '3'
          - name: MEMORY
            value: 14000Mi
    id:
      domain: development
      name: auto_fine_tuning.fine_tuning.fine_tune.fine_tune
      project: duploservices-kpmg
      resourceType: TASK
      version: ft1.3
    interface:
      inputs:
        variables:
          job_id:
            type:
              simple: STRING
      outputs: {}
    metadata:
      retries: {}
      runtime:
        flavor: python
        type: FLYTE_SDK
        version: 1.13.5
    securityContext:
      secrets:
        - group: hugging-face-token
          key: auth_token
    type: python-task
workflowMeta:
  eventVersion: 2
Copy code
failed at Node[start-node]. CausedByError: Failed to store workflow inputs (as start node), caused by: Failed to write data [0b] to path [metadata/propeller/duploservices-kpmg-development-kui6eqjcj2r6/start-node/data/0/outputs.pb].: PutObject, putting object: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 41483abe-6ab8-4312-83c4-965f42a96ae5
if i restart flytepropeller it is working
@ancient-wolf-19325, failed at Node[start-node]. CausedByError: Failed to store workflow inputs (as start node), caused by: Failed to write data [0b] to path [metadata/propeller/duploservices-kpmg-development-kui6eqjcj2r6/start-node/data/0/outputs.pb].: PutObject, putting object: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 41483abe-6ab8-4312-83c4-965f42a96ae5 After restarting the flyte propeller it is working and again after sometime issue is there any config to the keep AWS STS token auto update or something like that ?
j
Reasoning Complete! πŸ€– πŸ” Searched knowledge base: You are seeing AWS credential errors that go away after restarting FlytePropeller, which suggests the credentials are expiring and not being refreshed automatically. I'll look up whether FlytePropeller or its environment can be configured to refresh AWS STS tokens automatically, so you don't have to restart the pod. β€’ Searching over 10 data sources from the following: File, Github File, Github Issue, Web β€’ Found 57 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Web Content. β€’ Filtered down to the top 30 most relevant documents. ✍️ Generated answer: We have enough information to explain why this happens and how to fix it. I'll provide a detailed answer on how to ensure FlytePropeller's AWS credentials are refreshed automatically so you don't have to keep restarting the pod.
It could be possible that FlytePropeller is not refreshing its AWS STS credentials (for AssumeRoleWithWebIdentity) automatically, leading to expired or invalid tokens and resulting in AccessDenied errors. Restarting FlytePropeller forces it to fetch new credentials, which temporarily resolves the issue. This behavior is not explicitly documented as configurable in the provided data sources, and there is no mention of a built-in FlytePropeller config to auto-refresh AWS STS tokens or handle token rotation seamlessly. Though I'm not entirely certain, it might be that this is a known limitation when using IRSA (IAM Roles for Service Accounts) or similar mechanisms, and the recommended approach is to ensure the pod's service account and IAM role are set up correctly so that the AWS SDK in FlytePropeller can refresh tokens as needed. If the problem persists, you may need to check your Kubernetes service account, IAM role trust policy, and pod annotations to ensure proper integration, as described in Flyte EKS deployment examples. Would you like more detail on how to verify or configure these settings? Sources: - Flyte EKS deployment examples πŸ‘ 0 button πŸ‘Ž 0 button πŸ’‘ Hint: Mention @runllm in the thread for followups.
v
@ancient-wolf-19325, yes please. Also how to know whether i am flyte binary or different installation
j
Reasoning Complete! πŸ€– πŸ” Searched knowledge base: I'll look up how you can check if your Flyte deployment is 'flyte-binary' or modular, and also gather more details on ensuring FlytePropeller's AWS credentials refresh automatically. β€’ Searching over 10 data sources from the following: File, Github File, Github Issue, Web β€’ Found 62 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Github Pull Request, Web Content. β€’ Filtered down to the top 30 most relevant documents. ✍️ Generated answer: I have all the details needed to explain both how to check your Flyte installation type and how to address the AWS credential refresh issue. I'll provide a complete answer now.
It seems likely that to verify and configure AWS STS token refresh for FlytePropeller, you should ensure your Kubernetes service account is annotated with the correct IAM role (for IRSA), and that the IAM role trust policy allows the EKS OIDC provider. In your Flyte deployment, check for annotations like
<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>
on the service account used by FlytePropeller, as shown in eks-production.yaml. The AWS SDK in FlytePropeller should then automatically refresh tokens, but if the pod is not using the correct service account or the IAM role is misconfigured, token refresh may fail. To determine if you are using Flyte Binary or a different installation, check your deployment method: - If you installed using the
flyte-binary
Helm chart or see a single pod running all Flyte components (admin, propeller, datacatalog, console), you are using Flyte Binary. - If you see separate pods for
flyteadmin
,
flytepropeller
,
datacatalog
, etc., you are using the multi-component (core) installation, often deployed with the
flyte-core
Helm chart. You can also check your Helm releases with
helm list -n <namespace>
and look for
flyte-binary
or
flyte-core
. Would you like step-by-step instructions to check your service account and IAM role configuration? Sources: - Flyte binary vs core discussion - eks-production.yaml πŸ‘ 0 button πŸ‘Ž 0 button πŸ’‘ Hint: Mention @runllm in the thread for followups.
v
@ancient-wolf-19325, we are using flyte core not flyte binary
j
Reasoning Complete! πŸ€– πŸ” Searched knowledge base: Since you're using Flyte Core, I'll look up best practices and troubleshooting steps specific to Flyte Core deployments for ensuring AWS credentials are refreshed automatically and avoiding the need to restart FlytePropeller. β€’ Searching over 10 data sources from the following: File, Github File, Github Issue, Web β€’ Found 63 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Github Pull Request, Web Content. β€’ Filtered down to the top 30 most relevant documents. ✍️ Generated answer: I have all the details needed to explain how to ensure AWS credentials are refreshed automatically in your Flyte Core deployment and what to check in your configuration. I'll provide a complete answer now.
It seems likely that since you are using Flyte Core (not Flyte Binary), your deployment runs separate pods for each component (flyteadmin, flytepropeller, datacatalog, etc.), and the configuration for IRSA (IAM Roles for Service Accounts) is managed per component. To ensure FlytePropeller can refresh AWS STS tokens, verify that the Kubernetes service account used by the flytepropeller pod is annotated with the correct IAM role, and that the IAM role trust policy allows the EKS OIDC provider. You can check this with:
Copy code
kubectl get serviceaccount <propeller-sa> -n <namespace> -o yaml
Look for the
<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>
annotation. Also, ensure your IAM role trust policy includes the EKS OIDC provider as described in AWS docs. Would you like more detail on updating the service account or IAM role trust policy? Sources: - Flyte core Helm values πŸ‘ 0 button πŸ‘Ž 0 button πŸ’‘ Hint: Mention @runllm in the thread for followups.
v
@ancient-wolf-19325 ,
Copy code
apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
kind: FlyteWorkflow
metadata:
  creationTimestamp: '2025-08-14T13:16:05Z'
  generation: 2
  labels:
    domain: development
    execution-id: uu730hr059jx
    project: duploservices-kpmg
    shard-key: '19'
    workflow-name: auto-fine-tuning-workflows-auto-fine-tuning-auto-fine-tuning-wf
  name: uu730hr059jx
  namespace: duploservices-kpmg
  resourceVersion: '120146002'
  uid: 7b5cfaf6-a5d0-4cfe-b7fd-77a7237384ff
  selfLink: >-
    /apis/flyte.lyft.com/v1alpha1/namespaces/duploservices-kpmg/flyteworkflows/uu730hr059jx
status:
  failedAttempts: 1
  message: "failed at Node[start-node]. CausedByError: Failed to store workflow inputs (as start node), caused by: Failed to write data [0b] to path [metadata/propeller/duploservices-kpmg-development-uu730hr059jx/start-node/data/0/outputs.pb].: PutObject, putting object: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 5dd28e4e-8040-4d0b-b86d-1c9e0fd0d2e4"
  phase: 0
spec:
  connections:
    n0:
      - n1
    n1:
      - n2
    n2:
      - end-node
    start-node:
      - n0
  edges:
    downstream:
      n0:
        - n1
      n1:
        - n2
      n2:
        - end-node
      start-node:
        - n0
    upstream:
      end-node:
        - n2
      n0:
        - start-node
      n1:
        - n0
      n2:
        - n1
  id: >-
    duploservices-kpmg:development:auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf
  nodes:
    end-node:
      id: end-node
      kind: end
      resources: {}
    n0:
      id: n0
      kind: task
      name: generate_finetuning_data
      resources: {}
      task: >-
        resource_type:TASK project:"duploservices-kpmg" domain:"development"
        name:"auto_fine_tuning.data.finetuning_data.generate_finetuning_data"
        version:"ft1.4"
    n1:
      id: n1
      kind: task
      name: combined_dataset_preparation
      resources: {}
      task: >-
        resource_type:TASK project:"duploservices-kpmg" domain:"development"
        name:"auto_fine_tuning.dataset_preparation.combine_dataset.combined_dataset_preparation"
        version:"ft1.4"
    n2:
      id: n2
      inputBindings:
        - binding:
            promise:
              attrPath:
                - stringValue: o0
              nodeId: n1
              var: o0
          var: job_id
      kind: task
      name: fine_tune
      resources: {}
      task: >-
        resource_type:TASK project:"duploservices-kpmg" domain:"development"
        name:"auto_fine_tuning.fine_tuning.fine_tune.fine_tune" version:"ft1.4"
    start-node:
      id: start-node
      kind: start
      resources: {}
  outputs: {}
acceptedAt: '2025-08-14T13:16:05Z'
executionConfig:
  EnvironmentVariables:
    AUTH_TOKEN: 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
    CHUNK_COUNT_TO_BE_PROCESSED: '500'
    FTM_POSTGRES_DATABASE: forge
    FTM_POSTGRES_HOST: <http://duplopgdb-kpmg.clastecrvnqt.us-west-2.rds.amazonaws.com|duplopgdb-kpmg.clastecrvnqt.us-west-2.rds.amazonaws.com>
    FTM_POSTGRES_PASSWORD: W3g2P5FT&6p[Z5pU
    FTM_POSTGRES_USER: xforgeuser
    KB_ID: 9d491af8-0835-4c5f-b9f5-76d09f925a3b
    LLAMA_MODEL_URL: ''
    MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct
    MODEL_TYPE: generator
    OPEN_API_KEY: >-
      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
    S3_SOURCE_BUCKET: sagemaker-us-west-2-089910700032
    TENANT_ID: 8b89b65f-d0fd-4e2a-8e9a-acc507e0c118
    USER_ID: e58503eb-e551-48fe-aa2f-ac652984281d
  Interruptible: null
  MaxParallelism: 25
  OverwriteCache: false
  RecoveryExecution: {}
  TaskPluginImpls: {}
  TaskResources:
    Limits:
      CPU: '24'
      EphemeralStorage: '0'
      GPU: '1'
      Memory: 50Gi
      Storage: '0'
    Requests:
      CPU: '1'
      EphemeralStorage: '0'
      GPU: '0'
      Memory: 1000Mi
      Storage: '0'
executionId:
  domain: development
  name: uu730hr059jx
  project: duploservices-kpmg
inputs: {}
node-defaults: {}
rawOutputDataConfig: {}
securityContext:
  run_as: {}
tasks:
  resource_type:TASK project:"duploservices-kpmg" domain:"development" name:"auto_fine_tuning.data.finetuning_data.generate_finetuning_data" version:"ft1.4":
    container:
      args:
        - pyflyte-fast-execute
        - '--additional-distribution'
        - >-
          <s3://duploservices-kpmg-flyte-089910700032/duploservices-kpmg/development/DYHKDKBSRRSXUGKKLDNJ7ZWNKQ======/fast5d983a4338db9fbc2756c3d55bf29e4c.tar.gz>
        - '--dest-dir'
        - /root
        - '--'
        - pyflyte-execute
        - '--inputs'
        - '{{.input}}'
        - '--output-prefix'
        - '{{.outputPrefix}}'
        - '--raw-output-data-prefix'
        - '{{.rawOutputDataPrefix}}'
        - '--checkpoint-path'
        - '{{.checkpointOutputPrefix}}'
        - '--prev-checkpoint'
        - '{{.prevCheckpointPrefix}}'
        - '--resolver'
        - flytekit.core.python_auto_container.default_task_resolver
        - '--'
        - task-module
        - auto_fine_tuning.data.finetuning_data
        - task-name
        - generate_finetuning_data
      image: <http://registry.uniphore.com/uniphore-manual/instruction-dataset:v-5|registry.uniphore.com/uniphore-manual/instruction-dataset:v-5>
      resources:
        limits:
          - name: CPU
            value: 1500m
          - name: MEMORY
            value: 4000Mi
        requests:
          - name: CPU
            value: 1500m
          - name: MEMORY
            value: 4000Mi
    id:
      domain: development
      name: auto_fine_tuning.data.finetuning_data.generate_finetuning_data
      project: duploservices-kpmg
      resourceType: TASK
      version: ft1.4
    interface:
      inputs: {}
      outputs: {}
    metadata:
      retries: {}
      runtime:
        flavor: python
        type: FLYTE_SDK
        version: 1.13.5
    securityContext:
      secrets:
        - group: hugging-face-token
          key: auth_token
    type: python-task
  resource_type:TASK project:"duploservices-kpmg" domain:"development" name:"auto_fine_tuning.dataset_preparation.combine_dataset.combined_dataset_preparation" version:"ft1.4":
    container:
      args:
        - pyflyte-fast-execute
        - '--additional-distribution'
        - >-
          <s3://duploservices-kpmg-flyte-089910700032/duploservices-kpmg/development/DYHKDKBSRRSXUGKKLDNJ7ZWNKQ======/fast5d983a4338db9fbc2756c3d55bf29e4c.tar.gz>
        - '--dest-dir'
        - /root
        - '--'
        - pyflyte-execute
        - '--inputs'
        - '{{.input}}'
        - '--output-prefix'
        - '{{.outputPrefix}}'
        - '--raw-output-data-prefix'
        - '{{.rawOutputDataPrefix}}'
        - '--checkpoint-path'
        - '{{.checkpointOutputPrefix}}'
        - '--prev-checkpoint'
        - '{{.prevCheckpointPrefix}}'
        - '--resolver'
        - flytekit.core.python_auto_container.default_task_resolver
        - '--'
        - task-module
        - auto_fine_tuning.dataset_preparation.combine_dataset
        - task-name
        - combined_dataset_preparation
      image: <http://registry.uniphore.com/uniphore-manual/instruction-dataset:v-5|registry.uniphore.com/uniphore-manual/instruction-dataset:v-5>
      resources:
        limits:
          - name: CPU
            value: 1500m
          - name: MEMORY
            value: 4000Mi
        requests:
          - name: CPU
            value: 1500m
          - name: MEMORY
            value: 4000Mi
    id:
      domain: development
      name: >-
        auto_fine_tuning.dataset_preparation.combine_dataset.combined_dataset_preparation
      project: duploservices-kpmg
      resourceType: TASK
      version: ft1.4
    interface:
      inputs: {}
      outputs:
        variables:
          o0:
            type:
              simple: STRING
    metadata:
      retries: {}
      runtime:
        flavor: python
        type: FLYTE_SDK
        version: 1.13.5
    securityContext:
      secrets:
        - group: hugging-face-token
          key: auth_token
    type: python-task
  resource_type:TASK project:"duploservices-kpmg" domain:"development" name:"auto_fine_tuning.fine_tuning.fine_tune.fine_tune" version:"ft1.4":
    container:
      args:
        - pyflyte-fast-execute
        - '--additional-distribution'
        - >-
          <s3://duploservices-kpmg-flyte-089910700032/duploservices-kpmg/development/DYHKDKBSRRSXUGKKLDNJ7ZWNKQ======/fast5d983a4338db9fbc2756c3d55bf29e4c.tar.gz>
        - '--dest-dir'
        - /root
        - '--'
        - pyflyte-execute
        - '--inputs'
        - '{{.input}}'
        - '--output-prefix'
        - '{{.outputPrefix}}'
        - '--raw-output-data-prefix'
        - '{{.rawOutputDataPrefix}}'
        - '--checkpoint-path'
        - '{{.checkpointOutputPrefix}}'
        - '--prev-checkpoint'
        - '{{.prevCheckpointPrefix}}'
        - '--resolver'
        - flytekit.core.python_auto_container.default_task_resolver
        - '--'
        - task-module
        - auto_fine_tuning.fine_tuning.fine_tune
        - task-name
        - fine_tune
      image: <http://registry.uniphore.com/uniphore-manual/auto-fine-tuning:v-53|registry.uniphore.com/uniphore-manual/auto-fine-tuning:v-53>
      resources:
        limits:
          - name: CPU
            value: '3'
          - name: MEMORY
            value: 14000Mi
        requests:
          - name: CPU
            value: '3'
          - name: MEMORY
            value: 14000Mi
    id:
      domain: development
      name: auto_fine_tuning.fine_tuning.fine_tune.fine_tune
      project: duploservices-kpmg
      resourceType: TASK
      version: ft1.4
    interface:
      inputs:
        variables:
          job_id:
            type:
              simple: STRING
      outputs: {}
    metadata:
      retries: {}
      runtime:
        flavor: python
        type: FLYTE_SDK
        version: 1.13.5
    securityContext:
      secrets:
        - group: hugging-face-token
          key: auth_token
    type: python-task
workflowMeta:
  eventVersion: 2
j
Reasoning Complete! πŸ€– πŸ” Searched knowledge base: You're seeing a workflow failure due to an AWS credentials error (AccessDenied for sts:AssumeRoleWithWebIdentity). I'll look up Flyte Core-specific troubleshooting steps and known issues for this error to help you resolve it. β€’ Searching over 10 data sources from the following: File, Github File, Github Issue, Web β€’ Found 60 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Github Pull Request, Web Content. β€’ Filtered down to the top 30 most relevant documents. ✍️ Generated answer: I have all the details needed to explain why your workflow is failing and how to fix the AWS credential issue in your Flyte Core deployment. I'll provide a complete, step-by-step answer now.
It seems likely that your workflow is failing at the start node due to an AWS IAM/STS permissions issue:
AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
. This means FlytePropeller (or the pod running the workflow) is unable to assume the IAM role via IRSA (IAM Roles for Service Accounts), so it cannot write to S3. To resolve this, check the following: 1. The Kubernetes service account used by FlytePropeller (and/or task pods) must be annotated with the correct IAM role ARN (e.g., `eks.amazonaws.com/role-arn`). 2. The IAM role trust policy must allow the EKS OIDC provider and the service account as a principal. 3. The IAM role must have permissions for
sts:AssumeRoleWithWebIdentity
and S3 access. You can check the service account annotation with:
Copy code
kubectl get serviceaccount <propeller-sa> -n <namespace> -o yaml
and verify the IAM role trust policy in the AWS console. Would you like step-by-step instructions for checking and fixing these settings? Sources: - https://github.com/flyteorg/flyte/issues/2417 πŸ‘ 0 button πŸ‘Ž 0 button πŸ’‘ Hint: Mention @runllm in the thread for followups.
v
@ancient-wolf-19325,
Copy code
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::089910700032:role/duploservices-kpmg
    <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte
    <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: duploservices-kpmg
  creationTimestamp: "2025-07-10T19:54:47Z"
  labels:
    <http://app.kubernetes.io/instance|app.kubernetes.io/instance>: flyte
    <http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: flytepropeller
    <http://helm.sh/chart|helm.sh/chart>: flyte-core-v1.14.1
  name: flytepropeller
  namespace: duploservices-kpmg
  resourceVersion: "75821561"
  uid: 1aa7433b-2f9b-4838-844d-47ea3517d575
j
Reasoning Complete! πŸ€– πŸ” Searched knowledge base: You've shown that your flytepropeller ServiceAccount is annotated with the correct IAM role for IRSA, but you're still seeing AWS credential errors. I'll look up known issues and required settings for Flyte Core and EKS IRSA when the annotation is present but credential refresh still fails. β€’ Searching over 10 data sources from the following: File, Github File, Github Issue, Web β€’ Found 60 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Github Pull Request, Web Content. β€’ Filtered down to the top 30 most relevant documents. β€’ You've shown that your flytepropeller ServiceAccount is annotated with the correct IAM role for IRSA, but you're still seeing AWS credential errors. I'll look up known issues and required settings for Flyte Core and EKS IRSA when the annotation is present but credential refresh still fails. β€’ Searching over 10 data sources from the following: File, Github File, Github Issue, Web β€’ Found 60 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Github Pull Request, Web Content. β€’ Filtered down to the top 30 most relevant documents. β€’ You've shown that your flytepropeller ServiceAccount is annotated with the correct IAM role for IRSA, but you're still seeing AWS credential errors. I'll look up known issues and required settings for Flyte Core and EKS IRSA when the annotation is present but credential refresh still fails. β€’ Searching over 10 data sources from the following: File, Github File, Github Issue, Web β€’ Found 62 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Github Pull Request, Web Content. β€’ Filtered down to the top 30 most relevant documents. ✍️ Generated answer: Based on your service account manifest and the errors you're seeing, I have enough information to provide a complete answer on how to troubleshoot and resolve the AWS credential refresh issue in your Flyte Core deployment.
Your FlytePropeller service account is correctly annotated with the IAM role ARN (
<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::089910700032:role/duploservices-kpmg
). This is required for IRSA to work, allowing the pod to assume the specified IAM role and access AWS resources. If you are still seeing
AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
, the issue is likely with the IAM role's trust policy or permissions. Ensure the IAM role trust policy allows the EKS OIDC provider and the service account as a principal, and that the role has the necessary S3 and STS permissions. Would you like step-by-step instructions to verify and update the IAM role trust policy? Sources: - Flyte core Helm values πŸ‘ 0 button πŸ‘Ž 0 button πŸ’‘ Hint: Mention @runllm in the thread for followups.
v
@ancient-wolf-19325, can u point in flytepropeller source code where exactly we are getting the token and refresh token for webroleidentity related
j
Reasoning Complete! πŸ€– πŸ” Searched knowledge base: You're asking for the exact place in the FlytePropeller source code where AWS WebIdentity tokens are obtained and refreshed. I'll search the codebase and documentation to find the relevant files and functions for you. β€’ Searching over 10 data sources from the following: File, Github File, Github Issue, Web β€’ Found 59 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Web Content. β€’ Filtered down to the top 30 most relevant documents. ✍️ Generated answer: I have enough information to explain exactly how and where FlytePropeller obtains and refreshes AWS WebIdentity tokens in the source code. I'll provide a clear answer now.
It could be possible that FlytePropeller itself does not directly handle the retrieval or refresh of AWS WebIdentity tokens in its source code. Instead, FlytePropeller relies on the AWS SDK's default credential provider chain, which automatically uses the projected service account token (via IRSA) when the pod is annotated with
<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>
. The actual token retrieval and refresh logic is managed by the AWS SDK, not by explicit FlytePropeller code. There is no explicit FlytePropeller code for fetching or refreshing WebIdentity tokens. If you want to see where FlytePropeller interacts with AWS, you might look for S3 storage client initialization, but the token handling is abstracted by the SDK. Would you like more detail on where FlytePropeller initializes the S3 client or how the AWS SDK handles token refresh? Sources: - https://github.com/flyteorg/flyte/issues/5606 - https://github.com/flyteorg/flyte/issues/5012 Note: I wasn’t able to find highly relevant data sources, but above are a few potentially relevant links. πŸ‘ 0 button πŸ‘Ž 0 button πŸ’‘ Hint: Mention @runllm in the thread for followups.
v
@ancient-wolf-19325, does propeller auto kills pod ? if so how to enable the logs in propeller to see the exact reason why it auto kills the pod ?
a
Reasoning Complete! πŸ€– πŸ” Searched knowledge base: I'll look up whether FlytePropeller is responsible for automatically killing pods, and if so, how you can enable or find detailed logs that explain why a pod was terminated. β€’ Searching over 10 data sources from the following: File, Github File, Github Issue, Web β€’ Found 56 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Github Pull Request. β€’ Filtered down to the top 30 most relevant documents. ✍️ Generated answer: I have enough information to explain how FlytePropeller handles pod termination and how you can view detailed logs about why a pod was stopped. I'll provide a clear answer now.
It seems likely that FlytePropeller can terminate (kill) pods as part of workflow/task lifecycle management, for example when aborting workflows, handling failures, or enforcing retries. The pod termination is managed by FlytePropeller’s controller logic, which interacts with the Kubernetes API to delete pods when necessary. To see detailed reasons for pod termination, you can increase the log verbosity of FlytePropeller and check its logs for messages related to node or task aborts, failures, or deletions. To enable more detailed logs, set the log level to debug in the FlytePropeller configuration (e.g., in the Helm values or configmap, set
logger.level: 4
or higher). Then, review the FlytePropeller logs for messages about pod deletions or node aborts. The logs will typically include the workflow execution ID, node ID, and the reason for termination. For more on debugging executions and viewing logs, see Debugging Executions. Would you like step-by-step instructions for changing the log level or examples of what to look for in the logs? Sources: - Debugging Executions πŸ‘ 0 button πŸ‘Ž 0 button πŸ’‘ Hint: Mention @runllm in the thread for followups.