victorious-jelly-28729
07/29/2025, 9:46 AMjolly-florist-66728
07/29/2025, 9:46 AMvictorious-jelly-28729
07/29/2025, 9:47 AMjolly-florist-66728
07/29/2025, 9:47 AMvictorious-jelly-28729
07/29/2025, 10:05 AMjolly-florist-66728
07/29/2025, 10:05 AMkubectl
in the namespace where your Flyte project runs. For example, run:
kubectl get flyteworkflows.flyte.lyft.com -A
This command lists all FlyteWorkflow CRD instances across all namespaces. To inspect a specific workflow, use:
kubectl describe flyteworkflows.flyte.lyft.com <workflow-name> -n <namespace>
Would you like more detail on interpreting the CRD fields or troubleshooting specific issues?
Sources:
- FlytePropeller Architecture
- FlyteWorkflow CRD Example π 0 button π 0 button π‘ Hint: Mention @runllm in the thread for followups.victorious-jelly-28729
07/29/2025, 10:13 AMjolly-florist-66728
07/29/2025, 10:13 AMvictorious-jelly-28729
07/29/2025, 11:11 AMjolly-florist-66728
07/29/2025, 11:11 AMkubectl describe flyteworkflows.flyte.lyft.com <workflow-name> -n <namespace>
This shows the workflow's spec (DAG, nodes) and status (phases, errors, timestamps). Look for the status
section for node/task phases and error messages.
2. Check Pod Status:
List pods in the namespace to see if any are stuck or failed:
kubectl get pods -n <namespace>
Describe a pod for more details:
kubectl describe pod <pod-name> -n <namespace>
3. Use Flyte UI or CLI:
- In the Flyte UI, navigate to the workflow execution to view the execution graph, node statuses, logs, and errors.
- With Flytectl, run:
flytectl get execution -p <project> -d <domain> <execid> --details
This provides node/task execution details and log links.
4. Check Logs:
Access logs for FlytePropeller and task pods for error messages.
Would you like a step-by-step example with sample output or more detail on interpreting specific fields?
Sources:
- FlytePropeller Architecture
- Debugging Executions
- Inspecting Executions π 0 button π 0 button π‘ Hint: Mention @runllm in the thread for followups.average-finland-92144
07/31/2025, 4:31 PMvictorious-jelly-28729
08/14/2025, 12:07 PMvictorious-jelly-28729
08/14/2025, 12:07 PMvictorious-jelly-28729
08/14/2025, 12:50 PMtime="2025-08-14T11:58:41Z" level=info msg=------------------------------------------------------------------------
time="2025-08-14T11:58:41Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2025-08-14 11:58:41.676459895 +0000 UTC m=+0.001183164]"
time="2025-08-14T11:58:41Z" level=info msg=------------------------------------------------------------------------
time="2025-08-14T11:58:41Z" level=info msg="Detected: 4 CPU's\n"
{"json":{},"level":"warning","msg":"stow configuration section missing, defaulting to legacy s3/minio connection config","ts":"2025-08-14T11:58:41Z"}
I0814 11:58:42.024579 1 leaderelection.go:250] attempting to acquire leader lease duploservices-kpmg/propeller-leader...
time="2025-08-14T11:58:43Z" level=info msg=------------------------------------------------------------------------
time="2025-08-14T11:58:43Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2025-08-14 11:58:43.351075208 +0000 UTC m=+0.001096395]"
time="2025-08-14T11:58:43Z" level=info msg=------------------------------------------------------------------------
time="2025-08-14T11:58:43Z" level=info msg="Detected: 4 CPU's\n"
{"json":{},"level":"warning","msg":"stow configuration section missing, defaulting to legacy s3/minio connection config","ts":"2025-08-14T11:58:43Z"}
I0814 11:58:43.725951 1 leaderelection.go:250] attempting to acquire leader lease duploservices-kpmg/propeller-leader...
I0814 11:58:58.713172 1 leaderelection.go:260] successfully acquired lease duploservices-kpmg/propeller-leader
{"json":{"exec_id":"n6-ymujj8y7e","node":"n0","ns":"duploservices-kpmg","res_ver":"120055912","routine":"worker-1","tasktype":"python-task","wf":"duploservices-kpmg:development:auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf"},"level":"warning","msg":"Failed to record taskEvent, error [AlreadyExists: Event already exists, caused by [rpc error: code = AlreadyExists desc = have already recorded task execution phase SUCCEEDED (version: 0) for {{{} [] [] <nil>} 0 [] resource_type:TASK project:\"duploservices-kpmg\" domain:\"development\" name:\"auto_fine_tuning.data.finetuning_data.generate_finetuning_data\" version:\"ft1.3\" node_id:\"n0\" execution_id:{project:\"duploservices-kpmg\" domain:\"development\" name:\"n6-ymujj8y7e\"} 0}]]. Trying to record state: SUCCEEDED. Ignoring this error!","ts":"2025-08-14T11:58:59Z"}
{"json":{"exec_id":"n6-ymujj8y7e","node":"n1","ns":"duploservices-kpmg","res_ver":"120058691","routine":"worker-20","tasktype":"python-task","wf":"duploservices-kpmg:development:auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf"},"level":"warning","msg":"Failed to cast contentMD5 [] to string","ts":"2025-08-14T12:01:37Z"}
{"json":{"exec_id":"v1n-f2mqw9zi","ns":"duploservices-kpmg","routine":"worker-17"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"v1n-f2mqw9zi\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:09:37Z"}
E0814 12:09:37.983709 1 workers.go:103] error syncing 'duploservices-kpmg/v1n-f2mqw9zi': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "v1n-f2mqw9zi": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"v1n-f2mqw9zi","ns":"duploservices-kpmg","routine":"worker-19"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"v1n-f2mqw9zi\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:09:48Z"}
E0814 12:09:48.071878 1 workers.go:103] error syncing 'duploservices-kpmg/v1n-f2mqw9zi': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "v1n-f2mqw9zi": the object has been modified; please apply your changes to the latest version and try again
W0814 12:10:28.124746 1 warnings.go:70] metadata.finalizers: "flyte-finalizer": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
{"json":{"exec_id":"v1n-f2mqw9zi","node":"n0","ns":"duploservices-kpmg","res_ver":"120067077","routine":"worker-13","tasktype":"python-task","wf":"duploservices-kpmg:development:auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf"},"level":"warning","msg":"Failed to cast contentMD5 [] to string","ts":"2025-08-14T12:10:47Z"}
{"json":{"exec_id":"v1n-f2mqw9zi","node":"n0","ns":"duploservices-kpmg","res_ver":"120067292","routine":"worker-13","wf":"duploservices-kpmg:development:auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf"},"level":"warning","msg":"Trying to abort a node in state [Failed]","ts":"2025-08-14T12:10:47Z"}
{"json":{"exec_id":"v1n-f2mqw9zi","ns":"duploservices-kpmg","routine":"worker-25"},"level":"warning","msg":"Workflow namespace[duploservices-kpmg]/name[v1n-f2mqw9zi] has already been terminated.","ts":"2025-08-14T12:11:07Z"}
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-3"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:15:31Z"}
E0814 12:15:31.210232 1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-5"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:15:41Z"}
E0814 12:15:41.300136 1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-8"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:16:01Z"}
E0814 12:16:01.305906 1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
W0814 12:17:21.377989 1 warnings.go:70] metadata.finalizers: "flyte-finalizer": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-21"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:17:21Z"}
E0814 12:17:21.378197 1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
W0814 12:19:21.442075 1 warnings.go:70] metadata.finalizers: "flyte-finalizer": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-30"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:19:21Z"}
E0814 12:19:21.442244 1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-33"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:21:21Z"}
E0814 12:21:21.518997 1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"kui6eqjcj2r6","ns":"duploservices-kpmg","routine":"worker-7"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"kui6eqjcj2r6\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-08-14T12:23:21Z"}
E0814 12:23:21.610229 1 workers.go:103] error syncing 'duploservices-kpmg/kui6eqjcj2r6': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "kui6eqjcj2r6": the object has been modified; please apply your changes to the latest version and try again
victorious-jelly-28729
08/14/2025, 12:51 PMapiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
kind: FlyteWorkflow
metadata:
creationTimestamp: '2025-08-14T12:15:26Z'
generation: 2
labels:
domain: development
execution-id: kui6eqjcj2r6
project: duploservices-kpmg
shard-key: '19'
workflow-name: auto-fine-tuning-workflows-auto-fine-tuning-auto-fine-tuning-wf
name: kui6eqjcj2r6
namespace: duploservices-kpmg
resourceVersion: '120071686'
uid: 378fd071-2294-4b4a-a414-d81193ae0a0b
selfLink: >-
/apis/flyte.lyft.com/v1alpha1/namespaces/duploservices-kpmg/flyteworkflows/kui6eqjcj2r6
status:
failedAttempts: 1
message: "failed at Node[start-node]. CausedByError: Failed to store workflow inputs (as start node), caused by: Failed to write data [0b] to path [metadata/propeller/duploservices-kpmg-development-kui6eqjcj2r6/start-node/data/0/outputs.pb].: PutObject, putting object: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 41483abe-6ab8-4312-83c4-965f42a96ae5"
phase: 0
spec:
connections:
n0:
- n1
n1:
- n2
n2:
- end-node
start-node:
- n0
edges:
downstream:
n0:
- n1
n1:
- n2
n2:
- end-node
start-node:
- n0
upstream:
end-node:
- n2
n0:
- start-node
n1:
- n0
n2:
- n1
id: >-
duploservices-kpmg:development:auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf
nodes:
end-node:
id: end-node
kind: end
resources: {}
n0:
id: n0
kind: task
name: generate_finetuning_data
resources: {}
task: >-
resource_type:TASK project:"duploservices-kpmg" domain:"development"
name:"auto_fine_tuning.data.finetuning_data.generate_finetuning_data"
version:"ft1.3"
n1:
id: n1
kind: task
name: combined_dataset_preparation
resources: {}
task: >-
resource_type:TASK project:"duploservices-kpmg" domain:"development"
name:"auto_fine_tuning.dataset_preparation.combine_dataset.combined_dataset_preparation"
version:"ft1.3"
n2:
id: n2
inputBindings:
- binding:
promise:
attrPath:
- stringValue: o0
nodeId: n1
var: o0
var: job_id
kind: task
name: fine_tune
resources: {}
task: >-
resource_type:TASK project:"duploservices-kpmg" domain:"development"
name:"auto_fine_tuning.fine_tuning.fine_tune.fine_tune" version:"ft1.3"
start-node:
id: start-node
kind: start
resources: {}
outputs: {}
acceptedAt: '2025-08-14T12:15:25Z'
executionConfig:
EnvironmentVariables:
AUTH_TOKEN: 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
CHUNK_COUNT_TO_BE_PROCESSED: '500'
FTM_POSTGRES_DATABASE: forge
FTM_POSTGRES_HOST: <http://duplopgdb-kpmg.clastecrvnqt.us-west-2.rds.amazonaws.com|duplopgdb-kpmg.clastecrvnqt.us-west-2.rds.amazonaws.com>
FTM_POSTGRES_PASSWORD: W3g2P5FT&6p[Z5pU
FTM_POSTGRES_USER: xforgeuser
KB_ID: 9d491af8-0835-4c5f-b9f5-76d09f925a3b
LLAMA_MODEL_URL: ''
MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct
MODEL_TYPE: generator
OPEN_API_KEY: >-
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
S3_SOURCE_BUCKET: sagemaker-us-west-2-089910700032
TENANT_ID: 8b89b65f-d0fd-4e2a-8e9a-acc507e0c118
USER_ID: e58503eb-e551-48fe-aa2f-ac652984281d
Interruptible: null
MaxParallelism: 25
OverwriteCache: false
RecoveryExecution: {}
TaskPluginImpls: {}
TaskResources:
Limits:
CPU: '24'
EphemeralStorage: '0'
GPU: '1'
Memory: 50Gi
Storage: '0'
Requests:
CPU: '1'
EphemeralStorage: '0'
GPU: '0'
Memory: 1000Mi
Storage: '0'
executionId:
domain: development
name: kui6eqjcj2r6
project: duploservices-kpmg
inputs: {}
node-defaults: {}
rawOutputDataConfig: {}
securityContext:
run_as: {}
tasks:
resource_type:TASK project:"duploservices-kpmg" domain:"development" name:"auto_fine_tuning.data.finetuning_data.generate_finetuning_data" version:"ft1.3":
container:
args:
- pyflyte-fast-execute
- '--additional-distribution'
- >-
<s3://duploservices-kpmg-flyte-089910700032/duploservices-kpmg/development/64ALQVW3AMMCNCDTHFM2U5NDDY======/fast4734933ef38579c5bbf523ce20455b56.tar.gz>
- '--dest-dir'
- /root
- '--'
- pyflyte-execute
- '--inputs'
- '{{.input}}'
- '--output-prefix'
- '{{.outputPrefix}}'
- '--raw-output-data-prefix'
- '{{.rawOutputDataPrefix}}'
- '--checkpoint-path'
- '{{.checkpointOutputPrefix}}'
- '--prev-checkpoint'
- '{{.prevCheckpointPrefix}}'
- '--resolver'
- flytekit.core.python_auto_container.default_task_resolver
- '--'
- task-module
- auto_fine_tuning.data.finetuning_data
- task-name
- generate_finetuning_data
image: <http://registry.uniphore.com/uniphore-manual/instruction-dataset:v-5|registry.uniphore.com/uniphore-manual/instruction-dataset:v-5>
resources:
limits:
- name: CPU
value: 1500m
- name: MEMORY
value: 4000Mi
requests:
- name: CPU
value: 1500m
- name: MEMORY
value: 4000Mi
id:
domain: development
name: auto_fine_tuning.data.finetuning_data.generate_finetuning_data
project: duploservices-kpmg
resourceType: TASK
version: ft1.3
interface:
inputs: {}
outputs: {}
metadata:
retries: {}
runtime:
flavor: python
type: FLYTE_SDK
version: 1.13.5
securityContext:
secrets:
- group: hugging-face-token
key: auth_token
type: python-task
resource_type:TASK project:"duploservices-kpmg" domain:"development" name:"auto_fine_tuning.dataset_preparation.combine_dataset.combined_dataset_preparation" version:"ft1.3":
container:
args:
- pyflyte-fast-execute
- '--additional-distribution'
- >-
<s3://duploservices-kpmg-flyte-089910700032/duploservices-kpmg/development/64ALQVW3AMMCNCDTHFM2U5NDDY======/fast4734933ef38579c5bbf523ce20455b56.tar.gz>
- '--dest-dir'
- /root
- '--'
- pyflyte-execute
- '--inputs'
- '{{.input}}'
- '--output-prefix'
- '{{.outputPrefix}}'
- '--raw-output-data-prefix'
- '{{.rawOutputDataPrefix}}'
- '--checkpoint-path'
- '{{.checkpointOutputPrefix}}'
- '--prev-checkpoint'
- '{{.prevCheckpointPrefix}}'
- '--resolver'
- flytekit.core.python_auto_container.default_task_resolver
- '--'
- task-module
- auto_fine_tuning.dataset_preparation.combine_dataset
- task-name
- combined_dataset_preparation
image: <http://registry.uniphore.com/uniphore-manual/instruction-dataset:v-5|registry.uniphore.com/uniphore-manual/instruction-dataset:v-5>
resources:
limits:
- name: CPU
value: 1500m
- name: MEMORY
value: 4000Mi
requests:
- name: CPU
value: 1500m
- name: MEMORY
value: 4000Mi
id:
domain: development
name: >-
auto_fine_tuning.dataset_preparation.combine_dataset.combined_dataset_preparation
project: duploservices-kpmg
resourceType: TASK
version: ft1.3
interface:
inputs: {}
outputs:
variables:
o0:
type:
simple: STRING
metadata:
retries: {}
runtime:
flavor: python
type: FLYTE_SDK
version: 1.13.5
securityContext:
secrets:
- group: hugging-face-token
key: auth_token
type: python-task
resource_type:TASK project:"duploservices-kpmg" domain:"development" name:"auto_fine_tuning.fine_tuning.fine_tune.fine_tune" version:"ft1.3":
container:
args:
- pyflyte-fast-execute
- '--additional-distribution'
- >-
<s3://duploservices-kpmg-flyte-089910700032/duploservices-kpmg/development/64ALQVW3AMMCNCDTHFM2U5NDDY======/fast4734933ef38579c5bbf523ce20455b56.tar.gz>
- '--dest-dir'
- /root
- '--'
- pyflyte-execute
- '--inputs'
- '{{.input}}'
- '--output-prefix'
- '{{.outputPrefix}}'
- '--raw-output-data-prefix'
- '{{.rawOutputDataPrefix}}'
- '--checkpoint-path'
- '{{.checkpointOutputPrefix}}'
- '--prev-checkpoint'
- '{{.prevCheckpointPrefix}}'
- '--resolver'
- flytekit.core.python_auto_container.default_task_resolver
- '--'
- task-module
- auto_fine_tuning.fine_tuning.fine_tune
- task-name
- fine_tune
image: <http://registry.uniphore.com/uniphore-manual/auto-fine-tuning:v-53|registry.uniphore.com/uniphore-manual/auto-fine-tuning:v-53>
resources:
limits:
- name: CPU
value: '3'
- name: MEMORY
value: 14000Mi
requests:
- name: CPU
value: '3'
- name: MEMORY
value: 14000Mi
id:
domain: development
name: auto_fine_tuning.fine_tuning.fine_tune.fine_tune
project: duploservices-kpmg
resourceType: TASK
version: ft1.3
interface:
inputs:
variables:
job_id:
type:
simple: STRING
outputs: {}
metadata:
retries: {}
runtime:
flavor: python
type: FLYTE_SDK
version: 1.13.5
securityContext:
secrets:
- group: hugging-face-token
key: auth_token
type: python-task
workflowMeta:
eventVersion: 2
victorious-jelly-28729
08/14/2025, 12:51 PMfailed at Node[start-node]. CausedByError: Failed to store workflow inputs (as start node), caused by: Failed to write data [0b] to path [metadata/propeller/duploservices-kpmg-development-kui6eqjcj2r6/start-node/data/0/outputs.pb].: PutObject, putting object: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 41483abe-6ab8-4312-83c4-965f42a96ae5
victorious-jelly-28729
08/14/2025, 12:51 PMvictorious-jelly-28729
08/14/2025, 12:54 PMjolly-florist-66728
08/14/2025, 12:54 PMjolly-florist-66728
08/14/2025, 12:55 PMvictorious-jelly-28729
08/14/2025, 12:58 PMjolly-florist-66728
08/14/2025, 12:58 PMjolly-florist-66728
08/14/2025, 12:59 PM<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>
on the service account used by FlytePropeller, as shown in eks-production.yaml. The AWS SDK in FlytePropeller should then automatically refresh tokens, but if the pod is not using the correct service account or the IAM role is misconfigured, token refresh may fail.
To determine if you are using Flyte Binary or a different installation, check your deployment method:
- If you installed using the flyte-binary
Helm chart or see a single pod running all Flyte components (admin, propeller, datacatalog, console), you are using Flyte Binary.
- If you see separate pods for flyteadmin
, flytepropeller
, datacatalog
, etc., you are using the multi-component (core) installation, often deployed with the flyte-core
Helm chart.
You can also check your Helm releases with helm list -n <namespace>
and look for flyte-binary
or flyte-core
.
Would you like step-by-step instructions to check your service account and IAM role configuration?
Sources:
- Flyte binary vs core discussion
- eks-production.yaml π 0 button π 0 button π‘ Hint: Mention @runllm in the thread for followups.victorious-jelly-28729
08/14/2025, 1:02 PMjolly-florist-66728
08/14/2025, 1:02 PMjolly-florist-66728
08/14/2025, 1:03 PMkubectl get serviceaccount <propeller-sa> -n <namespace> -o yaml
Look for the <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>
annotation. Also, ensure your IAM role trust policy includes the EKS OIDC provider as described in AWS docs.
Would you like more detail on updating the service account or IAM role trust policy?
Sources:
- Flyte core Helm values π 0 button π 0 button π‘ Hint: Mention @runllm in the thread for followups.victorious-jelly-28729
08/14/2025, 1:18 PMapiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
kind: FlyteWorkflow
metadata:
creationTimestamp: '2025-08-14T13:16:05Z'
generation: 2
labels:
domain: development
execution-id: uu730hr059jx
project: duploservices-kpmg
shard-key: '19'
workflow-name: auto-fine-tuning-workflows-auto-fine-tuning-auto-fine-tuning-wf
name: uu730hr059jx
namespace: duploservices-kpmg
resourceVersion: '120146002'
uid: 7b5cfaf6-a5d0-4cfe-b7fd-77a7237384ff
selfLink: >-
/apis/flyte.lyft.com/v1alpha1/namespaces/duploservices-kpmg/flyteworkflows/uu730hr059jx
status:
failedAttempts: 1
message: "failed at Node[start-node]. CausedByError: Failed to store workflow inputs (as start node), caused by: Failed to write data [0b] to path [metadata/propeller/duploservices-kpmg-development-uu730hr059jx/start-node/data/0/outputs.pb].: PutObject, putting object: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 5dd28e4e-8040-4d0b-b86d-1c9e0fd0d2e4"
phase: 0
spec:
connections:
n0:
- n1
n1:
- n2
n2:
- end-node
start-node:
- n0
edges:
downstream:
n0:
- n1
n1:
- n2
n2:
- end-node
start-node:
- n0
upstream:
end-node:
- n2
n0:
- start-node
n1:
- n0
n2:
- n1
id: >-
duploservices-kpmg:development:auto_fine_tuning.workflows.auto_fine_tuning.auto_fine_tuning_wf
nodes:
end-node:
id: end-node
kind: end
resources: {}
n0:
id: n0
kind: task
name: generate_finetuning_data
resources: {}
task: >-
resource_type:TASK project:"duploservices-kpmg" domain:"development"
name:"auto_fine_tuning.data.finetuning_data.generate_finetuning_data"
version:"ft1.4"
n1:
id: n1
kind: task
name: combined_dataset_preparation
resources: {}
task: >-
resource_type:TASK project:"duploservices-kpmg" domain:"development"
name:"auto_fine_tuning.dataset_preparation.combine_dataset.combined_dataset_preparation"
version:"ft1.4"
n2:
id: n2
inputBindings:
- binding:
promise:
attrPath:
- stringValue: o0
nodeId: n1
var: o0
var: job_id
kind: task
name: fine_tune
resources: {}
task: >-
resource_type:TASK project:"duploservices-kpmg" domain:"development"
name:"auto_fine_tuning.fine_tuning.fine_tune.fine_tune" version:"ft1.4"
start-node:
id: start-node
kind: start
resources: {}
outputs: {}
acceptedAt: '2025-08-14T13:16:05Z'
executionConfig:
EnvironmentVariables:
AUTH_TOKEN: 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
CHUNK_COUNT_TO_BE_PROCESSED: '500'
FTM_POSTGRES_DATABASE: forge
FTM_POSTGRES_HOST: <http://duplopgdb-kpmg.clastecrvnqt.us-west-2.rds.amazonaws.com|duplopgdb-kpmg.clastecrvnqt.us-west-2.rds.amazonaws.com>
FTM_POSTGRES_PASSWORD: W3g2P5FT&6p[Z5pU
FTM_POSTGRES_USER: xforgeuser
KB_ID: 9d491af8-0835-4c5f-b9f5-76d09f925a3b
LLAMA_MODEL_URL: ''
MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct
MODEL_TYPE: generator
OPEN_API_KEY: >-
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
S3_SOURCE_BUCKET: sagemaker-us-west-2-089910700032
TENANT_ID: 8b89b65f-d0fd-4e2a-8e9a-acc507e0c118
USER_ID: e58503eb-e551-48fe-aa2f-ac652984281d
Interruptible: null
MaxParallelism: 25
OverwriteCache: false
RecoveryExecution: {}
TaskPluginImpls: {}
TaskResources:
Limits:
CPU: '24'
EphemeralStorage: '0'
GPU: '1'
Memory: 50Gi
Storage: '0'
Requests:
CPU: '1'
EphemeralStorage: '0'
GPU: '0'
Memory: 1000Mi
Storage: '0'
executionId:
domain: development
name: uu730hr059jx
project: duploservices-kpmg
inputs: {}
node-defaults: {}
rawOutputDataConfig: {}
securityContext:
run_as: {}
tasks:
resource_type:TASK project:"duploservices-kpmg" domain:"development" name:"auto_fine_tuning.data.finetuning_data.generate_finetuning_data" version:"ft1.4":
container:
args:
- pyflyte-fast-execute
- '--additional-distribution'
- >-
<s3://duploservices-kpmg-flyte-089910700032/duploservices-kpmg/development/DYHKDKBSRRSXUGKKLDNJ7ZWNKQ======/fast5d983a4338db9fbc2756c3d55bf29e4c.tar.gz>
- '--dest-dir'
- /root
- '--'
- pyflyte-execute
- '--inputs'
- '{{.input}}'
- '--output-prefix'
- '{{.outputPrefix}}'
- '--raw-output-data-prefix'
- '{{.rawOutputDataPrefix}}'
- '--checkpoint-path'
- '{{.checkpointOutputPrefix}}'
- '--prev-checkpoint'
- '{{.prevCheckpointPrefix}}'
- '--resolver'
- flytekit.core.python_auto_container.default_task_resolver
- '--'
- task-module
- auto_fine_tuning.data.finetuning_data
- task-name
- generate_finetuning_data
image: <http://registry.uniphore.com/uniphore-manual/instruction-dataset:v-5|registry.uniphore.com/uniphore-manual/instruction-dataset:v-5>
resources:
limits:
- name: CPU
value: 1500m
- name: MEMORY
value: 4000Mi
requests:
- name: CPU
value: 1500m
- name: MEMORY
value: 4000Mi
id:
domain: development
name: auto_fine_tuning.data.finetuning_data.generate_finetuning_data
project: duploservices-kpmg
resourceType: TASK
version: ft1.4
interface:
inputs: {}
outputs: {}
metadata:
retries: {}
runtime:
flavor: python
type: FLYTE_SDK
version: 1.13.5
securityContext:
secrets:
- group: hugging-face-token
key: auth_token
type: python-task
resource_type:TASK project:"duploservices-kpmg" domain:"development" name:"auto_fine_tuning.dataset_preparation.combine_dataset.combined_dataset_preparation" version:"ft1.4":
container:
args:
- pyflyte-fast-execute
- '--additional-distribution'
- >-
<s3://duploservices-kpmg-flyte-089910700032/duploservices-kpmg/development/DYHKDKBSRRSXUGKKLDNJ7ZWNKQ======/fast5d983a4338db9fbc2756c3d55bf29e4c.tar.gz>
- '--dest-dir'
- /root
- '--'
- pyflyte-execute
- '--inputs'
- '{{.input}}'
- '--output-prefix'
- '{{.outputPrefix}}'
- '--raw-output-data-prefix'
- '{{.rawOutputDataPrefix}}'
- '--checkpoint-path'
- '{{.checkpointOutputPrefix}}'
- '--prev-checkpoint'
- '{{.prevCheckpointPrefix}}'
- '--resolver'
- flytekit.core.python_auto_container.default_task_resolver
- '--'
- task-module
- auto_fine_tuning.dataset_preparation.combine_dataset
- task-name
- combined_dataset_preparation
image: <http://registry.uniphore.com/uniphore-manual/instruction-dataset:v-5|registry.uniphore.com/uniphore-manual/instruction-dataset:v-5>
resources:
limits:
- name: CPU
value: 1500m
- name: MEMORY
value: 4000Mi
requests:
- name: CPU
value: 1500m
- name: MEMORY
value: 4000Mi
id:
domain: development
name: >-
auto_fine_tuning.dataset_preparation.combine_dataset.combined_dataset_preparation
project: duploservices-kpmg
resourceType: TASK
version: ft1.4
interface:
inputs: {}
outputs:
variables:
o0:
type:
simple: STRING
metadata:
retries: {}
runtime:
flavor: python
type: FLYTE_SDK
version: 1.13.5
securityContext:
secrets:
- group: hugging-face-token
key: auth_token
type: python-task
resource_type:TASK project:"duploservices-kpmg" domain:"development" name:"auto_fine_tuning.fine_tuning.fine_tune.fine_tune" version:"ft1.4":
container:
args:
- pyflyte-fast-execute
- '--additional-distribution'
- >-
<s3://duploservices-kpmg-flyte-089910700032/duploservices-kpmg/development/DYHKDKBSRRSXUGKKLDNJ7ZWNKQ======/fast5d983a4338db9fbc2756c3d55bf29e4c.tar.gz>
- '--dest-dir'
- /root
- '--'
- pyflyte-execute
- '--inputs'
- '{{.input}}'
- '--output-prefix'
- '{{.outputPrefix}}'
- '--raw-output-data-prefix'
- '{{.rawOutputDataPrefix}}'
- '--checkpoint-path'
- '{{.checkpointOutputPrefix}}'
- '--prev-checkpoint'
- '{{.prevCheckpointPrefix}}'
- '--resolver'
- flytekit.core.python_auto_container.default_task_resolver
- '--'
- task-module
- auto_fine_tuning.fine_tuning.fine_tune
- task-name
- fine_tune
image: <http://registry.uniphore.com/uniphore-manual/auto-fine-tuning:v-53|registry.uniphore.com/uniphore-manual/auto-fine-tuning:v-53>
resources:
limits:
- name: CPU
value: '3'
- name: MEMORY
value: 14000Mi
requests:
- name: CPU
value: '3'
- name: MEMORY
value: 14000Mi
id:
domain: development
name: auto_fine_tuning.fine_tuning.fine_tune.fine_tune
project: duploservices-kpmg
resourceType: TASK
version: ft1.4
interface:
inputs:
variables:
job_id:
type:
simple: STRING
outputs: {}
metadata:
retries: {}
runtime:
flavor: python
type: FLYTE_SDK
version: 1.13.5
securityContext:
secrets:
- group: hugging-face-token
key: auth_token
type: python-task
workflowMeta:
eventVersion: 2
jolly-florist-66728
08/14/2025, 1:18 PMjolly-florist-66728
08/14/2025, 1:19 PMAccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
. This means FlytePropeller (or the pod running the workflow) is unable to assume the IAM role via IRSA (IAM Roles for Service Accounts), so it cannot write to S3.
To resolve this, check the following:
1. The Kubernetes service account used by FlytePropeller (and/or task pods) must be annotated with the correct IAM role ARN (e.g., `eks.amazonaws.com/role-arn`).
2. The IAM role trust policy must allow the EKS OIDC provider and the service account as a principal.
3. The IAM role must have permissions for sts:AssumeRoleWithWebIdentity
and S3 access.
You can check the service account annotation with:
kubectl get serviceaccount <propeller-sa> -n <namespace> -o yaml
and verify the IAM role trust policy in the AWS console.
Would you like step-by-step instructions for checking and fixing these settings?
Sources:
- https://github.com/flyteorg/flyte/issues/2417 π 0 button π 0 button π‘ Hint: Mention @runllm in the thread for followups.victorious-jelly-28729
08/14/2025, 1:21 PMapiVersion: v1
kind: ServiceAccount
metadata:
annotations:
<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::089910700032:role/duploservices-kpmg
<http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte
<http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: duploservices-kpmg
creationTimestamp: "2025-07-10T19:54:47Z"
labels:
<http://app.kubernetes.io/instance|app.kubernetes.io/instance>: flyte
<http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
<http://app.kubernetes.io/name|app.kubernetes.io/name>: flytepropeller
<http://helm.sh/chart|helm.sh/chart>: flyte-core-v1.14.1
name: flytepropeller
namespace: duploservices-kpmg
resourceVersion: "75821561"
uid: 1aa7433b-2f9b-4838-844d-47ea3517d575
jolly-florist-66728
08/14/2025, 1:21 PMjolly-florist-66728
08/14/2025, 1:25 PM<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::089910700032:role/duploservices-kpmg
). This is required for IRSA to work, allowing the pod to assume the specified IAM role and access AWS resources. If you are still seeing AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
, the issue is likely with the IAM role's trust policy or permissions.
Ensure the IAM role trust policy allows the EKS OIDC provider and the service account as a principal, and that the role has the necessary S3 and STS permissions. Would you like step-by-step instructions to verify and update the IAM role trust policy?
Sources:
- Flyte core Helm values π 0 button π 0 button π‘ Hint: Mention @runllm in the thread for followups.victorious-jelly-28729
08/14/2025, 1:38 PMjolly-florist-66728
08/14/2025, 1:38 PMjolly-florist-66728
08/14/2025, 1:39 PM<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>
. The actual token retrieval and refresh logic is managed by the AWS SDK, not by explicit FlytePropeller code. There is no explicit FlytePropeller code for fetching or refreshing WebIdentity tokens.
If you want to see where FlytePropeller interacts with AWS, you might look for S3 storage client initialization, but the token handling is abstracted by the SDK. Would you like more detail on where FlytePropeller initializes the S3 client or how the AWS SDK handles token refresh?
Sources:
- https://github.com/flyteorg/flyte/issues/5606
- https://github.com/flyteorg/flyte/issues/5012
Note: I wasnβt able to find highly relevant data sources, but above are a few potentially relevant links. π 0 button π 0 button π‘ Hint: Mention @runllm in the thread for followups.victorious-jelly-28729
09/03/2025, 10:12 AMancient-wolf-19325
09/03/2025, 10:12 AMancient-wolf-19325
09/03/2025, 10:13 AMlogger.level: 4
or higher). Then, review the FlytePropeller logs for messages about pod deletions or node aborts. The logs will typically include the workflow execution ID, node ID, and the reason for termination. For more on debugging executions and viewing logs, see Debugging Executions.
Would you like step-by-step instructions for changing the log level or examples of what to look for in the logs?
Sources:
- Debugging Executions π 0 button π 0 button π‘ Hint: Mention @runllm in the thread for followups.