cuddly-engine-34540
02/28/2025, 11:57 AMdefault-for-task-types>sensor: agent-service
in values-eks-core.yaml, as I want to use the Sensor agent, as specified here in the Flyte documentation.
I can successfully execute pyflyte run --remote hello_world.py hello_world_wf
to submit a dummy workflow to my remote Flyte deployment on AWS, and it runs without issues.
However, I encounter a problem when running a workflow that includes a FileSensor task. The workflow itself remains in the RUNNING state, but the FileSensor task gets stuck in the QUEUED state indefinitely.
Has anyone encountered this issue before, or does anyone have suggestions on how to debug this?
Thanks!cuddly-engine-34540
02/28/2025, 4:08 PMfrom flytekit import task, workflow
from flytekit.sensor.file_sensor import FileSensor
sensor = FileSensor(name="test_file_sensor")
@task()
def t1():
print("SUCCEEDED")
@workflow()
def wf():
sensor(path="s3://<account-number-here>-flyte-sandbox-data/file4.txt") >> t1()
if __name__ == "__main__":
wf()
average-finland-92144
02/28/2025, 4:58 PMagent-service
?cuddly-engine-34540
02/28/2025, 5:09 PMenabled_plugins:
# -- Tasks specific configuration [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#GetConfig>)
tasks:
# -- Plugins configuration, [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#TaskPluginConfig>)
task-plugins:
# -- [Enabled Plugins](<https://pkg.go.dev/github.com/lyft/flyteplugins/go/tasks/config#Config>). Enable sagemaker*, athena if you install the backend
# plugins
enabled-plugins:
- container
- sidecar
- k8s-array
- agent-service
# - sagemaker_hyperparameter_tuning
# - sagemaker_custom_training
# - sagemaker_training
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
sensor: agent-service
# sagemaker_custom_training_task: sagemaker_custom_training
# sagemaker_custom_training_job_task: sagemaker_custom_training
this is the only section relevant to understand if agent-service
is enabled, right?cuddly-engine-34540
03/03/2025, 3:37 PMaverage-finland-92144
03/03/2025, 3:39 PMcuddly-engine-34540
03/03/2025, 5:04 PM# hello_world.py
from flytekit import task, workflow
@task
def say_hello() -> str:
return "Hello, World!"
@workflow
def hello_world_wf() -> str:
res = say_hello()
return res
pyflyte run --remote hello_world.py hello_world_wf
The propeller logs gives:
>> kubectl logs flytepropeller-96fd46f56-k598f -n flyte --all-containers
{"json":{"exec_id":"ap6nqwf4frx6pw6m9bfs","ns":"flytesnacks-development","routine":"worker-11"},"level":"warning","msg":"Workflow namespace[flytesnacks-development]/name[ap6nqwf4frx6pw6m9bfs] has already been terminated.","ts":"2025-03-03T16:11:40Z"}
{"json":{"exec_id":"ap6nqwf4frx6pw6m9bfs","ns":"flytesnacks-development","routine":"worker-12"},"level":"warning","msg":"Workflow namespace[flytesnacks-development]/name[ap6nqwf4frx6pw6m9bfs] has already been terminated.","ts":"2025-03-03T16:21:49Z"}
But the task with FileSensor gets stuck in the QUEUED state indefinitely:
# file_sensor_example.py
from flytekit import task, workflow
from flytekit.sensor.file_sensor import FileSensor
sensor = FileSensor(name="test_file_sensor")
@task()
def t1():
print("SUCCEEDED")
@workflow()
def wf():
sensor(path="s3://<account-number-here>-flyte-sandbox-data/file4.txt") >> t1()
pyflyte run --remote file_sensor_example.py wf
kubectl logs flytepropeller-96fd46f56-k598f -n flyte --all-containers
{"json":{"exec_id":"ap6nqwf4frx6pw6m9bfs","ns":"flytesnacks-development","routine":"worker-11"},"level":"warning","msg":"Workflow namespace[flytesnacks-development]/name[ap6nqwf4frx6pw6m9bfs] has already been terminated.","ts":"2025-03-03T16:11:40Z"}
{"json":{"exec_id":"ap6nqwf4frx6pw6m9bfs","ns":"flytesnacks-development","routine":"worker-12"},"level":"warning","msg":"Workflow namespace[flytesnacks-development]/name[ap6nqwf4frx6pw6m9bfs] has already been terminated.","ts":"2025-03-03T16:21:49Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18675","routine":"worker-13","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"No plugin found for Handler-type [sensor], defaulting to [container]","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"No plugin found for Handler-type [sensor], defaulting to [container]","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"No plugin found for Handler-type [sensor], defaulting to [container]","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"Runtime error from plugin [container]. Error: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"error","msg":"handling parent node failed with error: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"error","msg":"failed Execute for node. Error: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"Error in handling running workflow [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration]","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"error","msg":"Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration]. Error Type[*errors.NodeErrorWithCause]","ts":"2025-03-03T16:59:44Z"}
E0303 16:59:44.940228 1 workers.go:103] error syncing 'flytesnacks-development/awc4xggqmwm46vfx767f': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"No plugin found for Handler-type [sensor], defaulting to [container]","ts":"2025-03-03T16:59:54Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"No plugin found for Handler-type [sensor], defaulting to [container]","ts":"2025-03-03T16:59:54Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"Runtime error from plugin [container]. Error: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration","ts":"2025-03-03T16:59:54Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"error","msg":"handling parent node failed with error: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration","ts":"2025-03-03T16:59:54Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"error","msg":"failed Execute for node. Error: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration","ts":"2025-03-03T16:59:54Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"Error in handling running workflow [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration]","ts":"2025-03-03T16:59:54Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"error","msg":"Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration]. Error Type[*errors.NodeErrorWithCause]","ts":"2025-03-03T16:59:54Z"}
E0303 16:59:54.875567 1 workers.go:103] error syncing 'flytesnacks-development/awc4xggqmwm46vfx767f': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration
cuddly-engine-34540
03/03/2025, 5:11 PMNo plugin found for Handler-type [sensor], defaulting to [container]
Should I have here some pod named sensor agent?
>> kubectl get pods -n flyte
NAME READY STATUS RESTARTS AGE
datacatalog-75d79dd9c8-bqmzm 1/1 Running 0 66m
datacatalog-75d79dd9c8-xg5r9 1/1 Running 0 66m
flyte-pod-webhook-7bdb957bcb-wg6tm 1/1 Running 0 62m
flyteadmin-95d4d9cd4-hh4kq 1/1 Running 0 66m
flyteadmin-95d4d9cd4-nqz29 1/1 Running 0 66m
flyteconsole-5d89dd4d65-j9nd6 1/1 Running 0 66m
flyteconsole-5d89dd4d65-vhsc9 1/1 Running 0 66m
flytepropeller-96fd46f56-fqslq 1/1 Running 0 47m
flytepropeller-96fd46f56-k598f 1/1 Running 0 61m
flytescheduler-864c49d598-2gxxn 1/1 Running 0 66m
syncresources-dcfd89b-jx8dn 1/1 Running 0 66m
cuddly-engine-34540
03/03/2025, 5:12 PMaverage-finland-92144
03/03/2025, 5:36 PMtrue
?
https://github.com/unionai-oss/deploy-flyte/blob/018adaa25921d20783be4e90d6c5bb821873ad3c/environments/azure/flyte-core/values-aks.yaml#L120-L121cuddly-engine-34540
03/03/2025, 5:46 PMfalse
! Let me set it to true
and try againcuddly-engine-34540
03/03/2025, 5:57 PMfailed to get task from agent with rpc error: code = Internal desc = failed to get sensor task with error:
Trace:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 114, in _error_wrapper
return await func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/aiobotocore/client.py", line 412, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/flytekit/extend/backend/agent_service.py", line 102, in wrapper
res = await func(self, request, context, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/flytekit/extend/backend/agent_service.py", line 136, in GetTask
res = await mirror_async_methods(agent.get, resource_meta=agent.metadata_type.decode(request.resource_meta))
File "/usr/local/lib/python3.10/site-packages/flytekit/sensor/sensor_engine.py", line 40, in get
if await sensor_def("sensor", config=resource_meta.sensor_config).poke(**inputs)
File "/usr/local/lib/python3.10/site-packages/flytekit/sensor/file_sensor.py", line 13, in poke
return await fs._exists(path)
File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 1072, in _exists
await self._info(path, bucket, key, version_id=version_id)
File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 1426, in _info
out = await self._call_s3(
File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 371, in _call_s3
return await _error_wrapper(
File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 146, in _error_wrapper
raise err
PermissionError: Forbidden
Message:
PermissionError: Forbidden.
PermissionError: Forbidden.
- Is it some IAM permission issue?
As showed in the snippet of file_sensor_example.py
above, the sensor is reading s3://<account-number-here>-flyte-sandbox-data/file4.txt
@workflow()
def wf():
sensor(path="s3://<account-number-here>-flyte-sandbox-data/file4.txt") >> t1()
Since this was the s3 bucket created by the terraform deploy-flyte in this file, I was expecting that the task in Flyte would already have permission to read the files in this bucket.
What am I missing here?cuddly-engine-34540
03/03/2025, 6:19 PMHEAD
, you must have the s3:GetObject
permission.
here in terraform we already give "s3:GetObject*":
https://github.com/unionai-oss/deploy-flyte/blob/main/environments/aws/flyte-core/iam.tf#L19
data "aws_iam_policy_document" "flyte_data_bucket_policy" {
statement {
sid = ""
effect = "Allow"
actions = [
"s3:DeleteObject*",
"s3:GetObject*",
"s3:ListBucket",
"s3:PutObject*"
]
resources = [
"arn:aws:s3:::${module.flyte_data.s3_bucket_id}",
"arn:aws:s3:::${module.flyte_data.s3_bucket_id}/*"
]
}
}
average-finland-92144
03/03/2025, 8:02 PMaverage-finland-92144
03/03/2025, 8:04 PMcuddly-engine-34540
03/04/2025, 9:23 AMkubectl describe pod flyteagent-755fc4fc8c-cgrxd -n flyte
, which has no indication of service account:
>> kubectl describe pod flyteagent-755fc4fc8c-cgrxd -n flyte
Name: flyteagent-755fc4fc8c-cgrxd
Namespace: flyte
Priority: 0
Service Account: flyteagent
Node: ip-10-3-143-247.eu-west-1.compute.internal/10.3.143.247
Start Time: Tue, 04 Mar 2025 06:15:04 -0300
Labels: app.kubernetes.io/instance=flyte-coretf
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=flyteagent
helm.sh/chart=flyteagent-v1.15.0
pod-template-hash=755fc4fc8c
Annotations: <none>
Status: Running
IP: 10.3.83.189
IPs:
IP: 10.3.83.189
Controlled By: ReplicaSet/flyteagent-755fc4fc8c
Containers:
flyteagent:
Container ID: <containerd://8b9fbeb04270259b1256a6ec575eb9ee11a660db8fe1133ca0db44638702a21>0
Image: cr.flyte.org/flyteorg/flyteagent-release:v1.15.0
Image ID: cr.flyte.org/flyteorg/flyteagent-release@sha256:8e8dc10b7f02015fe0391053f6032bf3a4bffc5d56b6144428de52f144bebe9f
Port: 8000/TCP
Host Port: 0/TCP
Command:
pyflyte
serve
agent
State: Running
Started: Tue, 04 Mar 2025 06:15:27 -0300
Ready: True
Restart Count: 0
Limits:
cpu: 500m
ephemeral-storage: 200Mi
memory: 300Mi
Requests:
cpu: 500m
ephemeral-storage: 200Mi
memory: 200Mi
Readiness: grpc <pod>:8000 delay=1s timeout=1s period=3s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/secrets from flyteagent (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-npb85 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
flyteagent:
Type: Secret (a volume populated by a Secret)
SecretName: flyteagent
Optional: false
kube-api-access-npb85:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m54s default-scheduler Successfully assigned flyte/flyteagent-755fc4fc8c-cgrxd to ip-10-3-143-247.eu-west-1.compute.internal
Normal Pulling 5m52s kubelet Pulling image "cr.flyte.org/flyteorg/flyteagent-release:v1.15.0"
Normal Pulled 5m31s kubelet Successfully pulled image "cr.flyte.org/flyteorg/flyteagent-release:v1.15.0" in 21.484s (21.484s including waiting). Image size: 759260708 bytes.
Normal Created 5m31s kubelet Created container flyteagent
Normal Started 5m31s kubelet Started container flyteagent
Warning Unhealthy 5m19s (x6 over 5m29s) kubelet Readiness probe failed: timeout: failed to connect service "10.3.83.189:8000" within 1s: context deadline exceeded
I also ran kubectl describe deployment flyteagent -n flyte
, which gives `Service Account: flyteagent`:
>> kubectl describe deployment flyteagent -n flyte
Name: flyteagent
Namespace: flyte
CreationTimestamp: Tue, 04 Mar 2025 06:15:04 -0300
Labels: app.kubernetes.io/instance=flyte-coretf
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=flyteagent
helm.sh/chart=flyteagent-v1.15.0
Annotations: deployment.kubernetes.io/revision: 1
meta.helm.sh/release-name: flyte-coretf
meta.helm.sh/release-namespace: flyte
Selector: app.kubernetes.io/instance=flyte-coretf,app.kubernetes.io/name=flyteagent
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app.kubernetes.io/instance=flyte-coretf
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=flyteagent
helm.sh/chart=flyteagent-v1.15.0
Service Account: flyteagent
Containers:
flyteagent:
Image: cr.flyte.org/flyteorg/flyteagent-release:v1.15.0
Port: 8000/TCP
Host Port: 0/TCP
Command:
pyflyte
serve
agent
Limits:
cpu: 500m
ephemeral-storage: 200Mi
memory: 300Mi
Requests:
cpu: 500m
ephemeral-storage: 200Mi
memory: 200Mi
Readiness: grpc <pod>:8000 delay=1s timeout=1s period=3s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/secrets from flyteagent (rw)
Volumes:
flyteagent:
Type: Secret (a volume populated by a Secret)
SecretName: flyteagent
Optional: false
Node-Selectors: <none>
Tolerations: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: flyteagent-755fc4fc8c (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 10m deployment-controller Scaled up replica set flyteagent-755fc4fc8c to 1
cuddly-engine-34540
03/04/2025, 9:42 AM>> kubectl get serviceaccounts -n flyte
NAME SECRETS AGE
datacatalog 0 21m
default 0 21m
flyte-pod-webhook 0 21m
flyteadmin 0 21m
flyteagent 0 21m
flytepropeller 0 21m
flytescheduler 0 21m
kubectl get serviceaccount flyteagent -n flyte -o yaml
reveals that this SA does not receive any Role:
>> kubectl get serviceaccount flyteagent -n flyte -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
meta.helm.sh/release-name: flyte-coretf
meta.helm.sh/release-namespace: flyte
creationTimestamp: "2025-03-04T09:15:02Z"
labels:
app.kubernetes.io/instance: flyte-coretf
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: flyteagent
helm.sh/chart: flyteagent-v1.15.0
name: flyteagent
namespace: flyte
resourceVersion: "2117"
uid: da4d2648-c655-4e97-8471-a948d1132578
In comparison we see that serviceaccount flyteadmin
has the Role role/flyte-sandbox-backend-role
that was created by Terraform:
kubectl get serviceaccount flyteadmin -n flyte -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::484907521551:role/flyte-sandbox-backend-role
meta.helm.sh/release-name: flyte-coretf
meta.helm.sh/release-namespace: flyte
creationTimestamp: "2025-03-04T09:15:02Z"
labels:
app.kubernetes.io/instance: flyte-coretf
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: flyteadmin
helm.sh/chart: flyte-core-v1.15.0
name: flyteadmin
namespace: flyte
resourceVersion: "2120"
uid: 6dd6fde3-02c8-4ffb-9f44-acc12144450a
@average-finland-92144 any suggestion on how to associate this one, which we know already have access to the bucket to serviceaccount flyteagent
?average-finland-92144
03/04/2025, 3:48 PMflyteagent
SA get the annotations from a base field that is empty by default
https://github.com/flyteorg/flyte/blob/6e5aca7016a067ed7c4458c2f35951013d2e390e/charts/flyteagent/values.yaml#L54
2. I think we can use the same role flytepropeller is using (flyte-sandbox-backend-role
in your case) but still the Trust Relationship needs to be established
So, to fix this you should be able to
a. Look at the installed Helm chart for the flyteagent (helm ls -n flyte
), create a simple values-override.yaml
file with something like this
serviceAccount:
annotations:
<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::484907521551:role/flyte-sandbox-backend-role
Then run a helm upgrade flyteorg/flyteagent flyteagent -n flyte --values values-override.yaml
to upgrade the deployment to set that annotation. (You can edit the SA too but this is more durable)
b. Add flyte:flyteagent
to this list, save, terraform plan and apply and it should add it to the trust relationshipcuddly-engine-34540
03/05/2025, 9:14 AMflyte:flyteagent
to this list and ran terraform apply.
namespace_service_accounts = ["flyte:flytepropeller", "flyte:flyteadmin", "flyte:datacatalog", "flyte:flyteagent"]
But it seems that it didn't fix the issue, since the service account still doesn't have the annotation:
>> kubectl get serviceaccount flyteagent -n flyte -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
<http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-coretf
<http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
creationTimestamp: "2025-03-05T08:34:51Z"
labels:
<http://app.kubernetes.io/instance|app.kubernetes.io/instance>: flyte-coretf
<http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
<http://app.kubernetes.io/name|app.kubernetes.io/name>: flyteagent
<http://helm.sh/chart|helm.sh/chart>: flyteagent-v1.15.0
name: flyteagent
namespace: flyte
resourceVersion: "2120"
uid: 2a71093a-4fe8-466c-b43b-4d852186c6d9
I also tried to add flyteagent to this list (although I couldn't find it being used by terraform anywhere else)
flyte_backend_ksas = ["flytepropeller", "flyteadmin", "datacatalog", "flyteagent"]
But it also didn't added the IAM Role annotation to the flyteagent service account.average-finland-92144
03/05/2025, 3:58 PMflyteagent
is a standalone Helm chart that gets deployed when you set flyteagent.enable: True
in the other charts (flyte-core for example) so it's not managed by Terraform.
If you added the flyteagent
SA to the list, please validate the Trust Relationship in IAM for the backend role, that KSA should be there.
The annotation is controlled by Helm unless we create a new module to handle this in the reference implementationcuddly-engine-34540
03/05/2025, 4:50 PMIf you added the flyteagent SA to the list, please validate the Trust Relationship in IAM for the backend role, that KSA should be there.I confirm this is true. After adding
flyteagent
SA to the list it appear in the Trust Relationship:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::484907521551:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/E57C7E3E82823040800D9C5778AA7E04"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"<http://oidc.eks.eu-west-1.amazonaws.com/id/E57C7E3E82823040800D9C5778AA7E04:sub|oidc.eks.eu-west-1.amazonaws.com/id/E57C7E3E82823040800D9C5778AA7E04:sub>": [
"system:serviceaccount:flyte:flytepropeller",
"system:serviceaccount:flyte:flyteadmin",
"system:serviceaccount:flyte:datacatalog",
"system:serviceaccount:flyte:flyteagent"
],
"<http://oidc.eks.eu-west-1.amazonaws.com/id/E57C7E3E82823040800D9C5778AA7E04:aud|oidc.eks.eu-west-1.amazonaws.com/id/E57C7E3E82823040800D9C5778AA7E04:aud>": "<http://sts.amazonaws.com|sts.amazonaws.com>"
}
}
}
]
}
Thanks for the suggestions, I'm going to do a. and get back herecuddly-engine-34540
03/05/2025, 5:25 PM>> helm ls -n flyte
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
flyte-coretf flyte 1 2025-03-05 13:42:04.463322455 -0300 -03 deployed flyte-core-v1.15.0
From what you said above, I was expecting to see some kind of flyteagent chart above, but it does not show in the list!
Nonetheless I continue with your suggestion and created values-override.yaml:
# values-override.yaml
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::484907521551:role/flyte-sandbox-backend-role
Ran the command you suggested but got error:
helm upgrade flyteorg/flyteagent flyteagent -n flyte --values values-override.yaml
Error: non-absolute URLs should be in form of repo_name/path_to_chart, got: flyteagent
@average-finland-92144 sorry, first time helm user here. Any suggestion on how to proceed?cuddly-engine-34540
03/05/2025, 6:12 PMkubectl annotate serviceaccount flyteagent -n flyte <http://eks.amazonaws.com/role-arn=arn:aws:iam::484907521551:role/flyte-sandbox-backend-role|eks.amazonaws.com/role-arn=arn:aws:iam::484907521551:role/flyte-sandbox-backend-role>
Which seems to have added the correct annotation to the flyteagent service account:
>> kubectl get serviceaccount flyteagent -n flyte -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::484907521551:role/flyte-sandbox-backend-role
meta.helm.sh/release-name: flyte-coretf
meta.helm.sh/release-namespace: flyte
creationTimestamp: "2025-03-05T16:42:39Z"
labels:
app.kubernetes.io/instance: flyte-coretf
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: flyteagent
helm.sh/chart: flyteagent-v1.15.0
name: flyteagent
namespace: flyte
resourceVersion: "24839"
uid: 20d58dad-8e01-48c3-ae84-b04e8b0636e9
Unfortunately, even so, I still receive the same PermissionError: Forbidden
(full error log already sent above) when trying to run the FileSensor task
So, just to organize things
• the issue in the last message is about how to add the annotation to the SA in a more durable way
• the issue in this message is that, even with this annotation added via kubectl annotate serviceaccount
, it seems that this does not fix the original issue with the permission for the FileSensor task to read the bucketaverage-finland-92144
03/05/2025, 7:34 PMRan the command you suggested but got error:
sorry, I think ordering matters here, it should be the chart name before the repo sohelm upgrade flyteorg/flyteagent flyteagent -n flyte --values values-override.yaml
helm upgrade flyteagent flyteorg/flyteagent -n flyte --values values-override.yaml
cuddly-engine-34540
03/06/2025, 8:21 AMhelm ls -n flyte
doesn’t show any flyteagent
chart.
I believe this might be why your last suggestion didn’t work:
>> helm upgrade flyteagent flyteorg/flyteagent -n flyte --values values-override.yaml
Error: repo flyteorg not found
>> helm repo add flyteorg <https://helm.flyte.org>
"flyteorg" has been added to your repositories
>> helm repo update
...
>> helm upgrade flyteagent flyteorg/flyteagent -n flyte --values values-override.yaml
Error: UPGRADE FAILED: "flyteagent" has no deployed releases
Based on what you mentioned (quoted below), I was expecting to see flyteagent chart in helm ls -n flyte
, but it's not there!
> flyteagent
is a standalone Helm chart that gets deployed when you set flyteagent.enable: True
in the other charts (flyte-core for example) so it's not managed by Terraform.
Whenever you have a moment, I’d really appreciate it if you could take a look at my previous two messages—I believe they contain key details to help move our troubleshooting forward.
Thanks again for your time and support! 😃average-finland-92144
03/06/2025, 4:37 PMflyteagent
Pod?
Also it's strang that Helm won't show you the flyteagent chart. What's the output of helm ls -n flyte
? According to the deployment description you set, it should be there:
>> kubectl describe deployment flyteagent -n flyte
Name: flyteagent
Namespace: flyte
CreationTimestamp: Tue, 04 Mar 2025 06:15:04 -0300
Labels: <http://app.kubernetes.io/instance=flyte-coretf|app.kubernetes.io/instance=flyte-coretf>
<http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
<http://app.kubernetes.io/name=flyteagent|app.kubernetes.io/name=flyteagent>
<http://helm.sh/chart=flyteagent-v1.15.0|helm.sh/chart=flyteagent-v1.15.0>
Anyways, that's only to set the SA annotations in a more durable way, but you already annotated it at least to unblock you
Let's see if the Pod logs show us something bettercuddly-engine-34540
03/06/2025, 6:07 PMWhat's the output of helm ls -n flyte ?
>> helm ls -n flyte
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
flyte-coretf flyte 1 2025-03-06 13:53:06.875303572 -0300 -03 deployed flyte-core-v1.15.0
It is really strange, I was expecting a flyteagent chart. I got a similar output for kubectl describe deployment flyteagent -n flyte
than you
>> kubectl describe deployment flyteagent -n flyte
Name: flyteagent
Namespace: flyte
CreationTimestamp: Thu, 06 Mar 2025 13:53:43 -0300
Labels: <http://app.kubernetes.io/instance=flyte-coretf|app.kubernetes.io/instance=flyte-coretf>
<http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
<http://app.kubernetes.io/name=flyteagent|app.kubernetes.io/name=flyteagent>
<http://helm.sh/chart=flyteagent-v1.15.0|helm.sh/chart=flyteagent-v1.15.0>
Annotations: <http://deployment.kubernetes.io/revision|deployment.kubernetes.io/revision>: 1
<http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-coretf
<http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
...
could you get logs from the flyteagent Pod?This is the logs for flyteagent Pod. When I run new tasks of
FileSensor
, no lines are added to this logs!
>> kubectl logs flyteagent-755fc4fc8c-b6l2v -n flyte
🚀 Starting the agent service...
Starting up the server to expose the prometheus metrics...
Agent Metadata
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Agent Name ┃ Support Task Types ┃ Is Sync ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ Sensor │ sensor (v0) │ False │
│ MMCloud Agent │ mmcloud_task (v0) │ False │
│ Databricks Agent │ spark (v0) databricks (v0) │ False │
│ Airflow Agent │ airflow (v0) │ False │
│ SageMaker Endpoint Agent │ sagemaker-endpoint (v0) │ False │
│ Boto Agent │ boto (v0) │ True │
│ Bigquery Agent │ bigquery_query_job_task (v0) │ False │
│ K8s DataService Async Agent │ dataservicetask (v0) │ False │
│ OpenAI Batch Endpoint Agent │ openai-batch (v0) │ False │
│ ChatGPT Agent │ chatgpt (v0) │ True │
│ Snowflake Agent │ snowflake (v0) │ False │
└─────────────────────────────┴───────────────────────────────┴─────────┘
[2025-03-06T17:29:28.590+0000] {credentials.py:550} INFO - Found credentials from IAM Role: worker-on-demand-eks-node-group-2025030616415554400000000b
@average-finland-92144 From the last line above, is it correct to say that the relevant IAM Role for the flyteagent is worker-on-demand-eks-node-group-2025030616415554400000000b
, and not the flyte-sandbox-backend-role
that was altered when we added to flyte:flyteagent
to this list?
If so, how to proceed?average-finland-92144
03/06/2025, 6:47 PMflyte-sandbox-backend-role
, what about the one for the `worker-on-demand...`role?cuddly-engine-34540
03/06/2025, 6:48 PM{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "EKSNodeAssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
average-finland-92144
03/06/2025, 6:49 PMcuddly-engine-34540
03/10/2025, 6:06 PMcuddly-engine-34540
03/11/2025, 3:57 PMaverage-finland-92144
03/11/2025, 11:14 PMflyteagent
Pod is using the right ServiceAccount which is now annotated with an IAM role with permissions. That, with the Trust Relationship we modified using Terraform should be enough.
The suspicious part is that log that comes from `botocore`(ref) which is the library Flyte uses to interface with AWS S3cuddly-engine-34540
03/12/2025, 12:36 PMflyteagent
pod with the right service account:
>> kubectl get serviceaccount flyteagent -n flyte -o yaml
...
metadata:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::484907521551:role/flyte-sandbox-backend-role
...
The API call made by the flyteagent to S3 is NOT autenticated with the role above role/flyte-sandbox-backend-role
.
It is authenticated with the role role/worker-on-demand-eks-node-group...
that appears in the flyteagent logs from boto, which is not right for that pod, as you said.
From the Cloud Trail log:
"eventSource": "s3.amazonaws.com",
"eventName": "HeadObject",
"errorCode": "AccessDenied",
"errorMessage": "User: arn:aws:sts::484907521551:assumed-role/worker-on-demand-eks-node-group-2025031211521468510000000c/i-03c0ed266bcdae53e is not authorized to perform: s3:GetObject on resource: \"arn:aws:s3:::484907521551-flyte-sandbox-data/file4.txt\" because no identity-based policy allows the s3:GetObject action",
"userAgent": "[aiobotocore/2.19.0 md/Botocore#1.36.3 ua/2.0 os/linux#6.1.128-136.201.amzn2023.x86_64 md/arch#x86_64 lang/python#3.10.16 md/pyimpl#CPython cfg/retry-mode#legacy botocore/1.36.3]",
"userIdentity": "{type=AssumedRole, principalid=<REDACT>:i-03c0ed266bcdae53e, arn=arn:aws:sts::484907521551:assumed-role/worker-on-demand-eks-node-group-2025031211521468510000000c/i-03c0ed266bcdae53e, accountid=484907521551, accesskeyid=<REDACT>, username=null, sessioncontext={attributes={creationdate=2025-03-12 12:02:11.000, mfaauthenticated=false}, sessionissuer={type=Role, principalid=<REDACT>, arn=arn:aws:iam::484907521551:role/worker-on-demand-eks-node-group-2025031211521468510000000c, accountid=484907521551, username=worker-on-demand-eks-node-group-2025031211521468510000000c}, webidfederationdata=null, sourceidentity=null, ec2roledelivery=2.0, ec2issuedinvpc=null, assumedroot=null}, invokedby=null, identityprovider=null, credentialid=null, onbehalfof=null, inscopeof=null}",
...
"additionalEventData": "{SignatureVersion=SigV4, CipherSuite=TLS_AES_128_GCM_SHA256, bytesTransferredIn=0, AuthenticationMethod=AuthHeader, x-amz-id-2=Wqo72BM3+aHV9QW5xg+W7h1w3kxNe9MO8cYAL5uUgBHGyGaQbDFrhtHF9JI2+0cAQFmjJ9sRRfc=, bytesTransferredOut=530}",
@average-finland-92144 So I guess that the next steps is figuring out why the flyteagent pod is authenticating the requests to S3 with the wrong role role/worker-on-demand-eks-node-group
instead of the annotated role/flyte-sandbox-backend-role
.
You think it's a problem with the chart? (which chart?), or a problem with the deploy-flyte terraform?
What could be making the flyteagent pod be able to autenticate with role/worker-on-demand-eks-node-group
, which is not right for that pod, as you said?
What can we do to make sure that the service account annotation with role/flyte-sandbox-backend-role
in flyteagent pod takes precedence over this wrong role/worker-on-demand-eks-node-group
auth?cuddly-engine-34540
03/13/2025, 4:54 PMkubectl annotate serviceaccount flyteagent -n flyte <http://eks.amazonaws.com/role-arn=arn:aws:iam::484907521551:role/flyte-sandbox-backend-role|eks.amazonaws.com/role-arn=arn:aws:iam::484907521551:role/flyte-sandbox-backend-role>
and restarting the flyteagent pod, FileSensor in flyteagent is able to check if the file is present in the S3 bucket. Problem solved!
@average-finland-92144 thank you very much for all the helpaverage-finland-92144
03/13/2025, 5:50 PM