Subject: Issue with FileSensor Task Stuck in QUEUE...
# flyte-support
c
Subject: Issue with FileSensor Task Stuck in QUEUED State on AWS EKS Flyte Deployment Hello! I successfully deployed Flyte on AWS EKS using the Terraform scripts from this repository. The only modification I made was adding
default-for-task-types>sensor: agent-service
in values-eks-core.yaml, as I want to use the Sensor agent, as specified here in the Flyte documentation. I can successfully execute
pyflyte run --remote hello_world.py hello_world_wf
to submit a dummy workflow to my remote Flyte deployment on AWS, and it runs without issues. However, I encounter a problem when running a workflow that includes a FileSensor task. The workflow itself remains in the RUNNING state, but the FileSensor task gets stuck in the QUEUED state indefinitely. Has anyone encountered this issue before, or does anyone have suggestions on how to debug this? Thanks!
This is the workflow that encounters the issue
Copy code
from flytekit import task, workflow
from flytekit.sensor.file_sensor import FileSensor


sensor = FileSensor(name="test_file_sensor")

@task()
def t1():
    print("SUCCEEDED")


@workflow()
def wf():
    sensor(path="s3://<account-number-here>-flyte-sandbox-data/file4.txt") >> t1()


if __name__ == "__main__":
    wf()
a
hey Marcus Sorry, just to confirm, did you also enabled the
agent-service
?
c
@average-finland-92144 I believe it was already enabled in the deploy-flyte repo, and I kept it. https://github.com/unionai-oss/deploy-flyte/blob/main/environments/aws/flyte-core/values-eks-core.yaml The enabled_plugins section of my values-eks-core.yaml is the following:
Copy code
enabled_plugins:
    # -- Tasks specific configuration [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#GetConfig>)
    tasks:
      # -- Plugins configuration, [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#TaskPluginConfig>)
      task-plugins:
        # -- [Enabled Plugins](<https://pkg.go.dev/github.com/lyft/flyteplugins/go/tasks/config#Config>). Enable sagemaker*, athena if you install the backend
        # plugins
        enabled-plugins:
          - container
          - sidecar
          - k8s-array
          - agent-service
        #          - sagemaker_hyperparameter_tuning
        #          - sagemaker_custom_training
        #          - sagemaker_training
        default-for-task-types:
          container: container
          sidecar: sidecar
          container_array: k8s-array
          sensor: agent-service
  #          sagemaker_custom_training_task: sagemaker_custom_training
  #          sagemaker_custom_training_job_task: sagemaker_custom_training
this is the only section relevant to understand if
agent-service
is enabled, right?
@average-finland-92144 any suggestion how to troubleshoot this?
a
Could get logs from the flytepropeller Pod? That's the Flyte component that interacts with the Agents service and should capture if there's something preventing execution
c
Thanks for the suggestion, @average-finland-92144! The simple hello world (without the sensor) runs successfully:
Copy code
# hello_world.py
from flytekit import task, workflow

@task
def say_hello() -> str:
    return "Hello, World!"

@workflow
def hello_world_wf() -> str:
    res = say_hello()
    return res
pyflyte run --remote hello_world.py hello_world_wf
The propeller logs gives:
Copy code
>> kubectl logs flytepropeller-96fd46f56-k598f -n flyte --all-containers

{"json":{"exec_id":"ap6nqwf4frx6pw6m9bfs","ns":"flytesnacks-development","routine":"worker-11"},"level":"warning","msg":"Workflow namespace[flytesnacks-development]/name[ap6nqwf4frx6pw6m9bfs] has already been terminated.","ts":"2025-03-03T16:11:40Z"}

{"json":{"exec_id":"ap6nqwf4frx6pw6m9bfs","ns":"flytesnacks-development","routine":"worker-12"},"level":"warning","msg":"Workflow namespace[flytesnacks-development]/name[ap6nqwf4frx6pw6m9bfs] has already been terminated.","ts":"2025-03-03T16:21:49Z"}
But the task with FileSensor gets stuck in the QUEUED state indefinitely:
Copy code
# file_sensor_example.py
from flytekit import task, workflow
from flytekit.sensor.file_sensor import FileSensor


sensor = FileSensor(name="test_file_sensor")

@task()
def t1():
    print("SUCCEEDED")


@workflow()
def wf():
    sensor(path="s3://<account-number-here>-flyte-sandbox-data/file4.txt") >> t1()
pyflyte run --remote file_sensor_example.py wf
kubectl logs flytepropeller-96fd46f56-k598f -n flyte --all-containers
Copy code
{"json":{"exec_id":"ap6nqwf4frx6pw6m9bfs","ns":"flytesnacks-development","routine":"worker-11"},"level":"warning","msg":"Workflow namespace[flytesnacks-development]/name[ap6nqwf4frx6pw6m9bfs] has already been terminated.","ts":"2025-03-03T16:11:40Z"}
{"json":{"exec_id":"ap6nqwf4frx6pw6m9bfs","ns":"flytesnacks-development","routine":"worker-12"},"level":"warning","msg":"Workflow namespace[flytesnacks-development]/name[ap6nqwf4frx6pw6m9bfs] has already been terminated.","ts":"2025-03-03T16:21:49Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18675","routine":"worker-13","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"No plugin found for Handler-type [sensor], defaulting to [container]","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"No plugin found for Handler-type [sensor], defaulting to [container]","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"No plugin found for Handler-type [sensor], defaulting to [container]","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"Runtime error from plugin [container]. Error: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"error","msg":"handling parent node failed with error: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"error","msg":"failed Execute for node. Error: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"Error in handling running workflow [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration]","ts":"2025-03-03T16:59:44Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","ns":"flytesnacks-development","res_ver":"18676","routine":"worker-13","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"error","msg":"Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration]. Error Type[*errors.NodeErrorWithCause]","ts":"2025-03-03T16:59:44Z"}
E0303 16:59:44.940228       1 workers.go:103] error syncing 'flytesnacks-development/awc4xggqmwm46vfx767f': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"No plugin found for Handler-type [sensor], defaulting to [container]","ts":"2025-03-03T16:59:54Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"No plugin found for Handler-type [sensor], defaulting to [container]","ts":"2025-03-03T16:59:54Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","tasktype":"sensor","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"Runtime error from plugin [container]. Error: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration","ts":"2025-03-03T16:59:54Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"error","msg":"handling parent node failed with error: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration","ts":"2025-03-03T16:59:54Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","node":"n0","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"error","msg":"failed Execute for node. Error: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration","ts":"2025-03-03T16:59:54Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"warning","msg":"Error in handling running workflow [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration]","ts":"2025-03-03T16:59:54Z"}
{"json":{"exec_id":"awc4xggqmwm46vfx767f","ns":"flytesnacks-development","res_ver":"18677","routine":"worker-18","wf":"flytesnacks:development:workflows.file_sensor_example.wf"},"level":"error","msg":"Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration]. Error Type[*errors.NodeErrorWithCause]","ts":"2025-03-03T16:59:54Z"}
E0303 16:59:54.875567       1 workers.go:103] error syncing 'flytesnacks-development/awc4xggqmwm46vfx767f': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] invalid TaskSpecification, unable to determine Pod configuration
Maybe this part of the logs suggests that FlytePropeller is unaware of the Sensor agent, or that the sensor agent is not running
No plugin found for Handler-type [sensor], defaulting to [container]
Should I have here some pod named sensor agent?
Copy code
>> kubectl get pods -n flyte
NAME                                 READY   STATUS    RESTARTS   AGE
datacatalog-75d79dd9c8-bqmzm         1/1     Running   0          66m
datacatalog-75d79dd9c8-xg5r9         1/1     Running   0          66m
flyte-pod-webhook-7bdb957bcb-wg6tm   1/1     Running   0          62m
flyteadmin-95d4d9cd4-hh4kq           1/1     Running   0          66m
flyteadmin-95d4d9cd4-nqz29           1/1     Running   0          66m
flyteconsole-5d89dd4d65-j9nd6        1/1     Running   0          66m
flyteconsole-5d89dd4d65-vhsc9        1/1     Running   0          66m
flytepropeller-96fd46f56-fqslq       1/1     Running   0          47m
flytepropeller-96fd46f56-k598f       1/1     Running   0          61m
flytescheduler-864c49d598-2gxxn      1/1     Running   0          66m
syncresources-dcfd89b-jx8dn          1/1     Running   0          66m
How can I make sure that sensor agent is running and FlytePropeller is aware of the Sensor agent?
a
yeah, I feel we're missing something that should be better documented • Have you set this to
true
? https://github.com/unionai-oss/deploy-flyte/blob/018adaa25921d20783be4e90d6c5bb821873ad3c/environments/azure/flyte-core/values-aks.yaml#L120-L121
c
no, it's set to
false
! Let me set it to
true
and try again
@average-finland-92144 it worked! the sensor task is not stuck in QUEUED anymore. Now it entered RUNNING state. Unfortunately if eventually FAILED after a couple of minutes:
Copy code
failed to get task from agent with rpc error: code = Internal desc = failed to get sensor task with error:
 Trace:

    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 114, in _error_wrapper
        return await func(*args, **kwargs)
      File "/usr/local/lib/python3.10/site-packages/aiobotocore/client.py", line 412, in _make_api_call
        raise error_class(parsed_response, operation_name)
    botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/site-packages/flytekit/extend/backend/agent_service.py", line 102, in wrapper
        res = await func(self, request, context, *args, **kwargs)
      File "/usr/local/lib/python3.10/site-packages/flytekit/extend/backend/agent_service.py", line 136, in GetTask
        res = await mirror_async_methods(agent.get, resource_meta=agent.metadata_type.decode(request.resource_meta))
      File "/usr/local/lib/python3.10/site-packages/flytekit/sensor/sensor_engine.py", line 40, in get
        if await sensor_def("sensor", config=resource_meta.sensor_config).poke(**inputs)
      File "/usr/local/lib/python3.10/site-packages/flytekit/sensor/file_sensor.py", line 13, in poke
        return await fs._exists(path)
      File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 1072, in _exists
        await self._info(path, bucket, key, version_id=version_id)
      File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 1426, in _info
        out = await self._call_s3(
      File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 371, in _call_s3
        return await _error_wrapper(
      File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 146, in _error_wrapper
        raise err
    PermissionError: Forbidden

Message:

    PermissionError: Forbidden.
PermissionError: Forbidden.
- Is it some IAM permission issue? As showed in the snippet of
file_sensor_example.py
above, the sensor is reading
s3://<account-number-here>-flyte-sandbox-data/file4.txt
Copy code
@workflow()
def wf():
    sensor(path="s3://<account-number-here>-flyte-sandbox-data/file4.txt") >> t1()
Since this was the s3 bucket created by the terraform deploy-flyte in this file, I was expecting that the task in Flyte would already have permission to read the files in this bucket. What am I missing here?
From AWS docs: General purpose bucket permissions - To use
HEAD
, you must have the
s3:GetObject
permission. here in terraform we already give "s3:GetObject*": https://github.com/unionai-oss/deploy-flyte/blob/main/environments/aws/flyte-core/iam.tf#L19
Copy code
data "aws_iam_policy_document" "flyte_data_bucket_policy" {
  statement {
    sid    = ""
    effect = "Allow"
    actions = [
      "s3:DeleteObject*",
      "s3:GetObject*",
      "s3:ListBucket",
      "s3:PutObject*"
    ]
    resources = [
      "arn:aws:s3:::${module.flyte_data.s3_bucket_id}",
      "arn:aws:s3:::${module.flyte_data.s3_bucket_id}/*"
    ]
  }
}
a
oh, but I don't think the Terraform in the repo accounts for the Agent deployment. Let me see
can you check the agent's service account? It should have the annotations in place but the trust relationship is not configured in IAM
c
@average-finland-92144 tks for the response! I asked for
kubectl describe pod flyteagent-755fc4fc8c-cgrxd -n flyte
, which has no indication of service account:
Copy code
>> kubectl describe pod flyteagent-755fc4fc8c-cgrxd -n flyte
Name:             flyteagent-755fc4fc8c-cgrxd
Namespace:        flyte
Priority:         0
Service Account:  flyteagent
Node:             ip-10-3-143-247.eu-west-1.compute.internal/10.3.143.247
Start Time:       Tue, 04 Mar 2025 06:15:04 -0300
Labels:           app.kubernetes.io/instance=flyte-coretf
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=flyteagent
                  helm.sh/chart=flyteagent-v1.15.0
                  pod-template-hash=755fc4fc8c
Annotations:      <none>
Status:           Running
IP:               10.3.83.189
IPs:
  IP:           10.3.83.189
Controlled By:  ReplicaSet/flyteagent-755fc4fc8c
Containers:
  flyteagent:
    Container ID:  <containerd://8b9fbeb04270259b1256a6ec575eb9ee11a660db8fe1133ca0db44638702a21>0
    Image:         cr.flyte.org/flyteorg/flyteagent-release:v1.15.0
    Image ID:      cr.flyte.org/flyteorg/flyteagent-release@sha256:8e8dc10b7f02015fe0391053f6032bf3a4bffc5d56b6144428de52f144bebe9f
    Port:          8000/TCP
    Host Port:     0/TCP
    Command:
      pyflyte
      serve
      agent
    State:          Running
      Started:      Tue, 04 Mar 2025 06:15:27 -0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  200Mi
      memory:             300Mi
    Requests:
      cpu:                500m
      ephemeral-storage:  200Mi
      memory:             200Mi
    Readiness:            grpc <pod>:8000  delay=1s timeout=1s period=3s #success=1 #failure=3
    Environment:          <none>
    Mounts:
      /etc/secrets from flyteagent (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-npb85 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  flyteagent:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  flyteagent
    Optional:    false
  kube-api-access-npb85:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  5m54s                  default-scheduler  Successfully assigned flyte/flyteagent-755fc4fc8c-cgrxd to ip-10-3-143-247.eu-west-1.compute.internal
  Normal   Pulling    5m52s                  kubelet            Pulling image "cr.flyte.org/flyteorg/flyteagent-release:v1.15.0"
  Normal   Pulled     5m31s                  kubelet            Successfully pulled image "cr.flyte.org/flyteorg/flyteagent-release:v1.15.0" in 21.484s (21.484s including waiting). Image size: 759260708 bytes.
  Normal   Created    5m31s                  kubelet            Created container flyteagent
  Normal   Started    5m31s                  kubelet            Started container flyteagent
  Warning  Unhealthy  5m19s (x6 over 5m29s)  kubelet            Readiness probe failed: timeout: failed to connect service "10.3.83.189:8000" within 1s: context deadline exceeded
I also ran
kubectl describe deployment flyteagent -n flyte
, which gives `Service Account: flyteagent`:
Copy code
>> kubectl describe deployment flyteagent -n flyte
Name:                   flyteagent
Namespace:              flyte
CreationTimestamp:      Tue, 04 Mar 2025 06:15:04 -0300
Labels:                 app.kubernetes.io/instance=flyte-coretf
                        app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/name=flyteagent
                        helm.sh/chart=flyteagent-v1.15.0
Annotations:            deployment.kubernetes.io/revision: 1
                        meta.helm.sh/release-name: flyte-coretf
                        meta.helm.sh/release-namespace: flyte
Selector:               app.kubernetes.io/instance=flyte-coretf,app.kubernetes.io/name=flyteagent
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/instance=flyte-coretf
                    app.kubernetes.io/managed-by=Helm
                    app.kubernetes.io/name=flyteagent
                    helm.sh/chart=flyteagent-v1.15.0
  Service Account:  flyteagent
  Containers:
   flyteagent:
    Image:      cr.flyte.org/flyteorg/flyteagent-release:v1.15.0
    Port:       8000/TCP
    Host Port:  0/TCP
    Command:
      pyflyte
      serve
      agent
    Limits:
      cpu:                500m
      ephemeral-storage:  200Mi
      memory:             300Mi
    Requests:
      cpu:                500m
      ephemeral-storage:  200Mi
      memory:             200Mi
    Readiness:            grpc <pod>:8000  delay=1s timeout=1s period=3s #success=1 #failure=3
    Environment:          <none>
    Mounts:
      /etc/secrets from flyteagent (rw)
  Volumes:
   flyteagent:
    Type:          Secret (a volume populated by a Secret)
    SecretName:    flyteagent
    Optional:      false
  Node-Selectors:  <none>
  Tolerations:     <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   flyteagent-755fc4fc8c (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  10m   deployment-controller  Scaled up replica set flyteagent-755fc4fc8c to 1
I found all the service accounts:
Copy code
>> kubectl get serviceaccounts -n flyte
NAME                SECRETS   AGE
datacatalog         0         21m
default             0         21m
flyte-pod-webhook   0         21m
flyteadmin          0         21m
flyteagent          0         21m
flytepropeller      0         21m
flytescheduler      0         21m
kubectl get serviceaccount flyteagent -n flyte -o yaml
reveals that this SA does not receive any Role:
Copy code
>> kubectl get serviceaccount flyteagent -n flyte -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    meta.helm.sh/release-name: flyte-coretf
    meta.helm.sh/release-namespace: flyte
  creationTimestamp: "2025-03-04T09:15:02Z"
  labels:
    app.kubernetes.io/instance: flyte-coretf
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: flyteagent
    helm.sh/chart: flyteagent-v1.15.0
  name: flyteagent
  namespace: flyte
  resourceVersion: "2117"
  uid: da4d2648-c655-4e97-8471-a948d1132578
In comparison we see that
serviceaccount flyteadmin
has the Role
role/flyte-sandbox-backend-role
that was created by Terraform:
Copy code
kubectl get serviceaccount flyteadmin -n flyte -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::484907521551:role/flyte-sandbox-backend-role
    meta.helm.sh/release-name: flyte-coretf
    meta.helm.sh/release-namespace: flyte
  creationTimestamp: "2025-03-04T09:15:02Z"
  labels:
    app.kubernetes.io/instance: flyte-coretf
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: flyteadmin
    helm.sh/chart: flyte-core-v1.15.0
  name: flyteadmin
  namespace: flyte
  resourceVersion: "2120"
  uid: 6dd6fde3-02c8-4ffb-9f44-acc12144450a
@average-finland-92144 any suggestion on how to associate this one, which we know already have access to the bucket to
serviceaccount flyteagent
?
a
@cuddly-engine-34540 ok so this is an interesting find: 1. The
flyteagent
SA get the annotations from a base field that is empty by default https://github.com/flyteorg/flyte/blob/6e5aca7016a067ed7c4458c2f35951013d2e390e/charts/flyteagent/values.yaml#L54 2. I think we can use the same role flytepropeller is using (
flyte-sandbox-backend-role
in your case) but still the Trust Relationship needs to be established So, to fix this you should be able to a. Look at the installed Helm chart for the flyteagent (
helm ls -n flyte
), create a simple
values-override.yaml
file with something like this
Copy code
serviceAccount:
  annotations:
    <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::484907521551:role/flyte-sandbox-backend-role
Then run a
helm upgrade flyteorg/flyteagent flyteagent -n flyte --values values-override.yaml
to upgrade the deployment to set that annotation. (You can edit the SA too but this is more durable) b. Add
flyte:flyteagent
to this list, save, terraform plan and apply and it should add it to the trust relationship
c
@average-finland-92144 thanks for the response! If I understand correctly, I need to do EITHER "a. helm command" or "b. terraform". But not "a" AND "b", right? Let's go with "b. terraform" (unless we have a very strong reason, I prefer to stick with terraform and not helm commands). As suggested, I added
flyte:flyteagent
to this list and ran terraform apply.
namespace_service_accounts = ["flyte:flytepropeller", "flyte:flyteadmin", "flyte:datacatalog", "flyte:flyteagent"]
But it seems that it didn't fix the issue, since the service account still doesn't have the annotation:
Copy code
>> kubectl get serviceaccount flyteagent -n flyte -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-coretf
    <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
  creationTimestamp: "2025-03-05T08:34:51Z"
  labels:
    <http://app.kubernetes.io/instance|app.kubernetes.io/instance>: flyte-coretf
    <http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: flyteagent
    <http://helm.sh/chart|helm.sh/chart>: flyteagent-v1.15.0
  name: flyteagent
  namespace: flyte
  resourceVersion: "2120"
  uid: 2a71093a-4fe8-466c-b43b-4d852186c6d9
I also tried to add flyteagent to this list (although I couldn't find it being used by terraform anywhere else)
flyte_backend_ksas = ["flytepropeller", "flyteadmin", "datacatalog", "flyteagent"]
But it also didn't added the IAM Role annotation to the flyteagent service account.
a
ok, so it has to be both a. and b. in this case.
flyteagent
is a standalone Helm chart that gets deployed when you set
flyteagent.enable: True
in the other charts (flyte-core for example) so it's not managed by Terraform. If you added the
flyteagent
SA to the list, please validate the Trust Relationship in IAM for the backend role, that KSA should be there. The annotation is controlled by Helm unless we create a new module to handle this in the reference implementation
c
If you added the flyteagent SA to the list, please validate the Trust Relationship in IAM for the backend role, that KSA should be there.
I confirm this is true. After adding
flyteagent
SA to the list it appear in the Trust Relationship:
Copy code
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::484907521551:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/E57C7E3E82823040800D9C5778AA7E04"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "<http://oidc.eks.eu-west-1.amazonaws.com/id/E57C7E3E82823040800D9C5778AA7E04:sub|oidc.eks.eu-west-1.amazonaws.com/id/E57C7E3E82823040800D9C5778AA7E04:sub>": [
            "system:serviceaccount:flyte:flytepropeller",
            "system:serviceaccount:flyte:flyteadmin",
            "system:serviceaccount:flyte:datacatalog",
            "system:serviceaccount:flyte:flyteagent"
          ],
          "<http://oidc.eks.eu-west-1.amazonaws.com/id/E57C7E3E82823040800D9C5778AA7E04:aud|oidc.eks.eu-west-1.amazonaws.com/id/E57C7E3E82823040800D9C5778AA7E04:aud>": "<http://sts.amazonaws.com|sts.amazonaws.com>"
        }
      }
    }
  ]
}
Thanks for the suggestions, I'm going to do a. and get back here
trying to do a. as suggested:
Copy code
>> helm ls -n flyte
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
flyte-coretf    flyte           1               2025-03-05 13:42:04.463322455 -0300 -03 deployed        flyte-core-v1.15.0
From what you said above, I was expecting to see some kind of flyteagent chart above, but it does not show in the list! Nonetheless I continue with your suggestion and created values-override.yaml:
Copy code
# values-override.yaml
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::484907521551:role/flyte-sandbox-backend-role
Ran the command you suggested but got error:
helm upgrade flyteorg/flyteagent flyteagent -n flyte --values values-override.yaml
Error: non-absolute URLs should be in form of repo_name/path_to_chart, got: flyteagent
@average-finland-92144 sorry, first time helm user here. Any suggestion on how to proceed?
@average-finland-92144 just to test things out, I also ran
kubectl annotate serviceaccount flyteagent -n flyte <http://eks.amazonaws.com/role-arn=arn:aws:iam::484907521551:role/flyte-sandbox-backend-role|eks.amazonaws.com/role-arn=arn:aws:iam::484907521551:role/flyte-sandbox-backend-role>
Which seems to have added the correct annotation to the flyteagent service account:
Copy code
>> kubectl get serviceaccount flyteagent -n flyte -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::484907521551:role/flyte-sandbox-backend-role
    meta.helm.sh/release-name: flyte-coretf
    meta.helm.sh/release-namespace: flyte
  creationTimestamp: "2025-03-05T16:42:39Z"
  labels:
    app.kubernetes.io/instance: flyte-coretf
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: flyteagent
    helm.sh/chart: flyteagent-v1.15.0
  name: flyteagent
  namespace: flyte
  resourceVersion: "24839"
  uid: 20d58dad-8e01-48c3-ae84-b04e8b0636e9
Unfortunately, even so, I still receive the same
PermissionError: Forbidden
(full error log already sent above) when trying to run the FileSensor task So, just to organize things • the issue in the last message is about how to add the annotation to the SA in a more durable way • the issue in this message is that, even with this annotation added via
kubectl annotate serviceaccount
, it seems that this does not fix the original issue with the permission for the FileSensor task to read the bucket
a
Ran the command you suggested but got error:
helm upgrade flyteorg/flyteagent flyteagent -n flyte --values values-override.yaml
sorry, I think ordering matters here, it should be the chart name before the repo so
Copy code
helm upgrade flyteagent flyteorg/flyteagent -n flyte --values values-override.yaml
c
@average-finland-92144 Thank you so much for your response—I really appreciate your help! I’m not sure if you’ve had a chance to go through my last two messages yet (both contain important details about the issue), but I wanted to highlight that running
helm ls -n flyte
doesn’t show any
flyteagent
chart. I believe this might be why your last suggestion didn’t work:
Copy code
>> helm upgrade flyteagent flyteorg/flyteagent -n flyte --values values-override.yaml
Error: repo flyteorg not found

>> helm repo add flyteorg <https://helm.flyte.org>
"flyteorg" has been added to your repositories
>> helm repo update
...
>> helm upgrade flyteagent flyteorg/flyteagent -n flyte --values values-override.yaml
Error: UPGRADE FAILED: "flyteagent" has no deployed releases
Based on what you mentioned (quoted below), I was expecting to see flyteagent chart in
helm ls -n flyte
, but it's not there! >
flyteagent
is a standalone Helm chart that gets deployed when you set
flyteagent.enable: True
in the other charts (flyte-core for example) so it's not managed by Terraform. Whenever you have a moment, I’d really appreciate it if you could take a look at my previous two messages—I believe they contain key details to help move our troubleshooting forward. Thanks again for your time and support! 😃
a
@cuddly-engine-34540 could you get logs from the
flyteagent
Pod? Also it's strang that Helm won't show you the flyteagent chart. What's the output of
helm ls -n flyte
? According to the deployment description you set, it should be there:
Copy code
>> kubectl describe deployment flyteagent -n flyte
Name:                   flyteagent
Namespace:              flyte
CreationTimestamp:      Tue, 04 Mar 2025 06:15:04 -0300
Labels:                 <http://app.kubernetes.io/instance=flyte-coretf|app.kubernetes.io/instance=flyte-coretf>
                        <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                        <http://app.kubernetes.io/name=flyteagent|app.kubernetes.io/name=flyteagent>
                        <http://helm.sh/chart=flyteagent-v1.15.0|helm.sh/chart=flyteagent-v1.15.0>
Anyways, that's only to set the SA annotations in a more durable way, but you already annotated it at least to unblock you Let's see if the Pod logs show us something better
c
What's the output of helm ls -n flyte ?
Copy code
>> helm ls -n flyte

NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
flyte-coretf    flyte           1               2025-03-06 13:53:06.875303572 -0300 -03 deployed        flyte-core-v1.15.0
It is really strange, I was expecting a flyteagent chart. I got a similar output for
kubectl describe deployment flyteagent -n flyte
than you
Copy code
>> kubectl describe deployment flyteagent -n flyte
Name:                   flyteagent
Namespace:              flyte
CreationTimestamp:      Thu, 06 Mar 2025 13:53:43 -0300
Labels:                 <http://app.kubernetes.io/instance=flyte-coretf|app.kubernetes.io/instance=flyte-coretf>
                        <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                        <http://app.kubernetes.io/name=flyteagent|app.kubernetes.io/name=flyteagent>
                        <http://helm.sh/chart=flyteagent-v1.15.0|helm.sh/chart=flyteagent-v1.15.0>
Annotations:            <http://deployment.kubernetes.io/revision|deployment.kubernetes.io/revision>: 1
                        <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-coretf
                        <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
...
could you get logs from the flyteagent Pod?
This is the logs for flyteagent Pod. When I run new tasks of
FileSensor
, no lines are added to this logs!
Copy code
>> kubectl logs flyteagent-755fc4fc8c-b6l2v -n flyte
🚀 Starting the agent service...
Starting up the server to expose the prometheus metrics...
                             Agent Metadata                              
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Agent Name                  ┃ Support Task Types            ┃ Is Sync ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ Sensor                      │ sensor (v0)                   │ False   │
│ MMCloud Agent               │ mmcloud_task (v0)             │ False   │
│ Databricks Agent            │ spark (v0) databricks (v0)    │ False   │
│ Airflow Agent               │ airflow (v0)                  │ False   │
│ SageMaker Endpoint Agent    │ sagemaker-endpoint (v0)       │ False   │
│ Boto Agent                  │ boto (v0)                     │ True    │
│ Bigquery Agent              │ bigquery_query_job_task (v0)  │ False   │
│ K8s DataService Async Agent │ dataservicetask (v0)          │ False   │
│ OpenAI Batch Endpoint Agent │ openai-batch (v0)             │ False   │
│ ChatGPT Agent               │ chatgpt (v0)                  │ True    │
│ Snowflake Agent             │ snowflake (v0)                │ False   │
└─────────────────────────────┴───────────────────────────────┴─────────┘
[2025-03-06T17:29:28.590+0000] {credentials.py:550} INFO - Found credentials from IAM Role: worker-on-demand-eks-node-group-2025030616415554400000000b
@average-finland-92144 From the last line above, is it correct to say that the relevant IAM Role for the flyteagent is
worker-on-demand-eks-node-group-2025030616415554400000000b
, and not the
flyte-sandbox-backend-role
that was altered when we added to
flyte:flyteagent
to this list? If so, how to proceed?
a
oh this is a good finding. I don't think adding the KSA to that list could cause this, but we can confirm: I think you already checked the Trust Relationship for the
flyte-sandbox-backend-role
, what about the one for the `worker-on-demand...`role?
c
The Trust Relationship for role `worker-on-demand-eks-node-group-2025030616415554400000000b`:
Copy code
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EKSNodeAssumeRole",
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
a
that is right for that role, what is not right is that a Pod assumes that role
c
hello, @average-finland-92144! How are you doing? any idea why that Pod assumes that role incorrectly, or how to fix the proper behavior that will allow FileSensor to read from the bucket?
@average-finland-92144 sorry to ping here again, I'm stuck 😕
a
Hey @cuddly-engine-34540 So the
flyteagent
Pod is using the right ServiceAccount which is now annotated with an IAM role with permissions. That, with the Trust Relationship we modified using Terraform should be enough. The suspicious part is that log that comes from `botocore`(ref) which is the library Flyte uses to interface with AWS S3
c
some interesting finding when analyzing AWS CloudTrail logs. Even after annotating the
flyteagent
pod with the right service account:
Copy code
>> kubectl get serviceaccount flyteagent -n flyte -o yaml
...
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::484907521551:role/flyte-sandbox-backend-role
...
The API call made by the flyteagent to S3 is NOT autenticated with the role above
role/flyte-sandbox-backend-role
. It is authenticated with the role
role/worker-on-demand-eks-node-group...
that appears in the flyteagent logs from boto, which is not right for that pod, as you said. From the Cloud Trail log:
Copy code
"eventSource": "s3.amazonaws.com",
"eventName": "HeadObject",
"errorCode": "AccessDenied",

"errorMessage": "User: arn:aws:sts::484907521551:assumed-role/worker-on-demand-eks-node-group-2025031211521468510000000c/i-03c0ed266bcdae53e is not authorized to perform: s3:GetObject on resource: \"arn:aws:s3:::484907521551-flyte-sandbox-data/file4.txt\" because no identity-based policy allows the s3:GetObject action",

"userAgent": "[aiobotocore/2.19.0 md/Botocore#1.36.3 ua/2.0 os/linux#6.1.128-136.201.amzn2023.x86_64 md/arch#x86_64 lang/python#3.10.16 md/pyimpl#CPython cfg/retry-mode#legacy botocore/1.36.3]",

"userIdentity": "{type=AssumedRole, principalid=<REDACT>:i-03c0ed266bcdae53e, arn=arn:aws:sts::484907521551:assumed-role/worker-on-demand-eks-node-group-2025031211521468510000000c/i-03c0ed266bcdae53e, accountid=484907521551, accesskeyid=<REDACT>, username=null, sessioncontext={attributes={creationdate=2025-03-12 12:02:11.000, mfaauthenticated=false}, sessionissuer={type=Role, principalid=<REDACT>, arn=arn:aws:iam::484907521551:role/worker-on-demand-eks-node-group-2025031211521468510000000c, accountid=484907521551, username=worker-on-demand-eks-node-group-2025031211521468510000000c}, webidfederationdata=null, sourceidentity=null, ec2roledelivery=2.0, ec2issuedinvpc=null, assumedroot=null}, invokedby=null, identityprovider=null, credentialid=null, onbehalfof=null, inscopeof=null}",
...
"additionalEventData": "{SignatureVersion=SigV4, CipherSuite=TLS_AES_128_GCM_SHA256, bytesTransferredIn=0, AuthenticationMethod=AuthHeader, x-amz-id-2=Wqo72BM3+aHV9QW5xg+W7h1w3kxNe9MO8cYAL5uUgBHGyGaQbDFrhtHF9JI2+0cAQFmjJ9sRRfc=, bytesTransferredOut=530}",
@average-finland-92144 So I guess that the next steps is figuring out why the flyteagent pod is authenticating the requests to S3 with the wrong role
role/worker-on-demand-eks-node-group
instead of the annotated
role/flyte-sandbox-backend-role
. You think it's a problem with the chart? (which chart?), or a problem with the deploy-flyte terraform? What could be making the flyteagent pod be able to autenticate with
role/worker-on-demand-eks-node-group
, which is not right for that pod, as you said? What can we do to make sure that the service account annotation with
role/flyte-sandbox-backend-role
in flyteagent pod takes precedence over this wrong
role/worker-on-demand-eks-node-group
auth?
good news! After annotating with
kubectl annotate serviceaccount flyteagent -n flyte <http://eks.amazonaws.com/role-arn=arn:aws:iam::484907521551:role/flyte-sandbox-backend-role|eks.amazonaws.com/role-arn=arn:aws:iam::484907521551:role/flyte-sandbox-backend-role>
and restarting the flyteagent pod, FileSensor in flyteagent is able to check if the file is present in the S3 bucket. Problem solved! @average-finland-92144 thank you very much for all the help
a
ohh the good ol' restart to the rescue! Glad that it's working now Thanks for sharing @cuddly-engine-34540!