Hey everyone, I'm trying the single-cluster deploy...
# flyte-deployment
g
Hey everyone, I'm trying the single-cluster deployment of flyte on EKS using the AWS CDK (constrained by work). I'm following "flyte: the hard way". Has anyone done this before? I'll put details of where I'm stuck in the thread
01-eks-permissions.md, 02-deploying-eks-cluster.md, 04-create-database.md all fine. I've verified with
kubectl run pgsql-postgresql-client --rm --tty -i --restart='Never' --namespace testdb --image <http://docker.io/bitnami/postgresql:11.7.0-debian-10-r9|docker.io/bitnami/postgresql:11.7.0-debian-10-r9> --env='PGPASSWORD=<Password>' --command -- psql testdb --host <RDS-ENDPOINT-NAME> -U flyteadmin -d flyteadmin -p 5432
that I can connect to the endpoint from the cluster
03-roles-service-accounts.md
was a little tricky. This was the closest I could get to:
Copy code
this.cluster.addManifest('flyte-namespace', {
      apiVersion: 'v1',
      kind: 'Namespace',
      metadata: {
        name: 'flyte',
      },
    });

    const flytePolicy = new Policy(this, 'FlyteCustomPolicy', {
      statements: [
        new PolicyStatement({
          effect: Effect.ALLOW,
          actions: [
            's3:DeleteObject*',
            's3:GetObject*',
            's3:ListBucket',
            's3:PutObject*'
          ],
          resources: [
            `arn:aws:s3:::XXXXXXXX`,
            `arn:aws:s3:::XXXXXXXX/*`
          ]
        })
      ]
    });

    const flyteBackendServiceAccountManifest = cluster.addServiceAccount('flyte-system-role', {
      name: 'flyte-backend-flyte-binary',
      namespace: 'flyte',
    });
    flyteBackendServiceAccountManifest.role.attachInlinePolicy(flytePolicy);


    const flyteWorkersServiceAccountManifest = cluster.addServiceAccount('flyte-workers-role', {
      name: 'flyte-admin', // should be default, but get the error
      namespace: 'flyte',
    });
    flyteWorkersServiceAccountManifest.role.attachInlinePolicy(flytePolicy);

    // MANUAL EDIT OF TRUST POLICY
Two things for the flyte workers role: • I couldn't set the
name: 'default'
. It said the name already exists • I am not able to alter the trust policy through the CDK. I think this is due to limitations of CDK and the
.addServiceAccount
method. I manually changed this in the console (from
system:serviceaccount:flyte:flyte-admin
) to
Copy code
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::YYYYYYYY:oidc-provider/oidc.eks.<region>.<http://amazonaws.com/id/XXXXX|amazonaws.com/id/XXXXX>"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringLike": {
          "oidc.eks.<region>.<http://amazonaws.com/id/XXXXX:sub|amazonaws.com/id/XXXXX:sub>": "system:serviceaccount:*:default",
          "oidc.eks.<region>.<http://amazonaws.com/id/XXXXX:aud|amazonaws.com/id/XXXXX:aud>": "<http://sts.amazonaws.com|sts.amazonaws.com>"
        }
      }
    }
  ]
}
And then deployed. The
'flyte'
namespace exists, as do the service accounts
kubectl get serviceaccounts --namespace flyte
Now for stage
05-deploy-with-helm.md
. Ideally I would use the CDK method
cluster.addHelmChart
, but there's a lot of yaml so I'll do a manual deployment for now. I edit the helm file and I get the error
Copy code
Error: INSTALLATION FAILED: Unable to continue with install: ServiceAccount "flyte-backend-flyte-binary" in namespace "flyte" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "<http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>": must be set to "Helm"; annotation validation error: missing key "<http://meta.helm.sh/release-name|meta.helm.sh/release-name>": must be set to "flyte-backend"; annotation validation error: missing key "<http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>": must be set to "flyte"
Let me know if you have any tips. I'd be more than happy to contribute CDK code to the docs if we are able to get this working cleanly! Thanks in advance
Update: got a bit further by setting (rather obviously)
Copy code
serviceAccount:
  create: false
The pods fail to create. the log output is
Copy code
1.373ms] [rows:0] SELECT description FROM pg_catalog.pg_description WHERE objsubid = (SELECT ordinal_position FROM information_schema.columns WHERE table_schema = CURRENT_SCHEMA() AND table_name = 'reservations' AND column_name = 'serialized_metadata') AND objoid = (SELECT oid FROM pg_catalog.pg_class WHERE relname = 'reservations' AND relnamespace = (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = CURRENT_SCHEMA()))
{"json":{"src":"initialize.go:74"},"level":"info","msg":"Ran DB migration successfully.","ts":"2025-01-28T20:42:39Z"}
{"json":{"app_name":"datacatalog","src":"service.go:98"},"level":"info","msg":"Created data storage.","ts":"2025-01-28T20:42:39Z"}
{"json":{"app_name":"datacatalog","src":"service.go:109"},"level":"info","msg":"Created DB connection.","ts":"2025-01-28T20:42:39Z"}
{"json":{"src":"service.go:129"},"level":"info","msg":"Serving DataCatalog Insecure on port :8081","ts":"2025-01-28T20:42:39Z"}
{"json":{"src":"init_cert.go:63"},"level":"info","msg":"Creating secret [flyte-backend-flyte-binary-webhook-secret] in Namespace [flyte]","ts":"2025-01-28T20:42:46Z"}
{"json":{"src":"start.go:152"},"level":"error","msg":"Failed to initialize certificates for Secrets Webhook. client rate limiter Wait returned an error: context canceled","ts":"2025-01-28T20:42:46Z"}
{"json":{"src":"start.go:228"},"level":"panic","msg":"Failed to start Propeller, err: failed to create FlyteWorkflow CRD: <http://customresourcedefinitions.apiextensions.k8s.io|customresourcedefinitions.apiextensions.k8s.io> is forbidden: User \"system:serviceaccount:flyte:default\" cannot create resource \"customresourcedefinitions\" in API group \"<http://apiextensions.k8s.io|apiextensions.k8s.io>\" at the cluster scope","ts":"2025-01-28T20:42:46Z"}
panic: (*logrus.Entry) 0xc000856620
Seems to be an error linking to the service account, but unsure where to go from here
a
seems the cluster role is linked to the wrong SA. https://github.com/flyteorg/flyte/blob/45ce4c044491f123682ff08fac6d7761471e696a/charts/flyte-binary/templates/clusterrolebinding.yaml#L25-L28 It should be your
flyte-backend-flyte-binary
SA
g
Is there anything else I can print out to help debug? I thought the problems might be linked to the step above where I was unable to set the
name: 'default'
on the
flyte-workers-role
. Does it matter if there are other service accounts linked to the cluster? I also have these service accounts from when I set up the cluster before trying to configure flyte.
Copy code
const serviceAccountManifest = cluster.addServiceAccount('eks-admin-service-account', {
      name: 'eks-admin',
      namespace: 'kube-system',
    });

    const clusterRoleBindingManifest = cluster.addManifest('eks-admin-cluster-role-binding', {
      apiVersion: '<http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>', // native Kubernetes Role Based Access Control (RBAC)
      kind: 'ClusterRoleBinding',
      metadata: {
        name: 'eks-admin',
      },
      roleRef: {
        apiGroup: '<http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>',
        kind: 'ClusterRole',
        name: 'cluster-admin',
      },
      subjects: [
        {
          kind: 'ServiceAccount',
          name: 'eks-admin',
          namespace: 'kube-system',
        },
      ],
    });
Just for some more debugging, this is the output of
kubectl describe sa flyte-backend-flyte-binary --namepsace flyte
Copy code
Name:                flyte-backend-flyte-binary
Namespace:           flyte
Labels:              <http://app.kubernetes.io/name=flyte-backend-flyte-binary|app.kubernetes.io/name=flyte-backend-flyte-binary>
aws.cdk.eks/prune-XXXXXXXXX=
Annotations:
<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::XXXXXXXX:role/YYYYYYYY-EKSClusterflytesystemrole-ZZZZZZZ
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>
seems like it isn't exactly like that in the instructions. Missing labels
Copy code
Labels:              <http://app.kubernetes.io/instance=flyte-backend|app.kubernetes.io/instance=flyte-backend>
<http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
<http://app.kubernetes.io/name=flyte-binary|app.kubernetes.io/name=flyte-binary>                     <http://app.kubernetes.io/version=1.16.0|app.kubernetes.io/version=1.16.0>
<http://helm.sh/chart=flyte-binary-v1.3.0|helm.sh/chart=flyte-binary-v1.3.0>
annotations
Copy code
<http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-backend
<http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
a
yeah, let's print
kubectl get clusterrolebinding -n flyte
and then
describe
the rolebinding there
what is odd here is that propeller is using the
default
service account to instantiate the CRD. Please also do a
describe
on the flyte-binary Pod just to confirm which SA is it using
g
big print with no formatting coming, sorry in advance: 1.
kubectl get clusterrolebinding -n flyte
Copy code
NAME                                                            ROLE                                                                        AGE
aws-node                                                        ClusterRole/aws-node                                                        5d23h
cluster-admin                                                   ClusterRole/cluster-admin                                                   5d23h
eks-admin                                                       ClusterRole/cluster-admin                                                   5d23h
eks:addon-cluster-admin                                         ClusterRole/cluster-admin                                                   5d23h
eks:addon-manager                                               ClusterRole/eks:addon-manager                                               5d23h
eks:az-poller                                                   ClusterRole/eks:az-poller                                                   5d23h
eks:certificate-controller                                      ClusterRole/system:controller:certificate-controller                        5d23h
eks:certificate-controller-approver                             ClusterRole/eks:certificate-controller-approver                             5d23h
eks:certificate-controller-manager                              ClusterRole/eks:certificate-controller-manager                              5d23h
eks:certificate-controller-signer                               ClusterRole/eks:certificate-controller-signer                               5d23h
eks:cloud-controller-manager                                    ClusterRole/eks:cloud-controller-manager                                    5d23h
eks:cloud-provider-extraction-migration                         ClusterRole/eks:cloud-provider-extraction-migration                         5d23h
eks:cluster-event-watcher                                       ClusterRole/eks:cluster-event-watcher                                       5d23h
eks:coredns-autoscaler                                          ClusterRole/eks:coredns-autoscaler                                          5d23h
eks:extension-metrics-apiserver                                 ClusterRole/eks:extension-metrics-apiserver                                 5d23h
eks:extension-metrics-apiserver-auth-delegator                  ClusterRole/system:auth-delegator                                           5d23h
eks:fargate-manager                                             ClusterRole/eks:fargate-manager                                             5d23h
eks:fargate-scheduler                                           ClusterRole/eks:fargate-scheduler                                           5d23h
eks:k8s-metrics                                                 ClusterRole/eks:k8s-metrics                                                 5d23h
eks:kms-storage-migrator                                        ClusterRole/eks:kms-storage-migrator                                        5d23h
eks:kube-proxy                                                  ClusterRole/system:node-proxier                                             5d23h
eks:kube-proxy-fargate                                          ClusterRole/system:node-proxier                                             5d23h
eks:kube-proxy-windows                                          ClusterRole/system:node-proxier                                             5d23h
eks:network-policy-controller                                   ClusterRole/eks:network-policy-controller                                   5d23h
eks:network-webhooks                                            ClusterRole/eks:network-webhooks                                            5d23h
eks:node-bootstrapper                                           ClusterRole/eks:node-bootstrapper                                           5d23h
eks:node-manager                                                ClusterRole/eks:node-manager                                                5d23h
eks:nodewatcher                                                 ClusterRole/eks:nodewatcher                                                 5d23h
eks:pod-identity-mutating-webhook                               ClusterRole/eks:pod-identity-mutating-webhook                               5d23h
eks:service-operations                                          ClusterRole/eks:service-operations                                          5d23h
eks:tagging-controller                                          ClusterRole/eks:tagging-controller                                          5d23h
flyte-backend-flyte-binary-cluster-role-binding                 ClusterRole/flyte-backend-flyte-binary-cluster-role                         47h
kuberay-apiserver                                               ClusterRole/kuberay-apiserver                                               5d23h
kuberay-operator                                                ClusterRole/kuberay-operator                                                5d23h
system:basic-user                                               ClusterRole/system:basic-user                                               5d23h
system:controller:attachdetach-controller                       ClusterRole/system:controller:attachdetach-controller                       5d23h
system:controller:certificate-controller                        ClusterRole/system:controller:certificate-controller                        5d23h
system:controller:clusterrole-aggregation-controller            ClusterRole/system:controller:clusterrole-aggregation-controller            5d23h
system:controller:cronjob-controller                            ClusterRole/system:controller:cronjob-controller                            5d23h
system:controller:daemon-set-controller                         ClusterRole/system:controller:daemon-set-controller                         5d23h
system:controller:deployment-controller                         ClusterRole/system:controller:deployment-controller                         5d23h
system:controller:disruption-controller                         ClusterRole/system:controller:disruption-controller                         5d23h
system:controller:endpoint-controller                           ClusterRole/system:controller:endpoint-controller                           5d23h
system:controller:endpointslice-controller                      ClusterRole/system:controller:endpointslice-controller                      5d23h
system:controller:endpointslicemirroring-controller             ClusterRole/system:controller:endpointslicemirroring-controller             5d23h
system:controller:ephemeral-volume-controller                   ClusterRole/system:controller:ephemeral-volume-controller                   5d23h
system:controller:expand-controller                             ClusterRole/system:controller:expand-controller                             5d23h
system:controller:generic-garbage-collector                     ClusterRole/system:controller:generic-garbage-collector                     5d23h
system:controller:horizontal-pod-autoscaler                     ClusterRole/system:controller:horizontal-pod-autoscaler                     5d23h
system:controller:job-controller                                ClusterRole/system:controller:job-controller                                5d23h
system:controller:legacy-service-account-token-cleaner          ClusterRole/system:controller:legacy-service-account-token-cleaner          5d23h
system:controller:namespace-controller                          ClusterRole/system:controller:namespace-controller                          5d23h
system:controller:node-controller                               ClusterRole/system:controller:node-controller                               5d23h
system:controller:persistent-volume-binder                      ClusterRole/system:controller:persistent-volume-binder                      5d23h
system:controller:pod-garbage-collector                         ClusterRole/system:controller:pod-garbage-collector                         5d23h
system:controller:pv-protection-controller                      ClusterRole/system:controller:pv-protection-controller                      5d23h
system:controller:pvc-protection-controller                     ClusterRole/system:controller:pvc-protection-controller                     5d23h
system:controller:replicaset-controller                         ClusterRole/system:controller:replicaset-controller                         5d23h
system:controller:replication-controller                        ClusterRole/system:controller:replication-controller                        5d23h
system:controller:resourcequota-controller                      ClusterRole/system:controller:resourcequota-controller                      5d23h
system:controller:root-ca-cert-publisher                        ClusterRole/system:controller:root-ca-cert-publisher                        5d23h
system:controller:route-controller                              ClusterRole/system:controller:route-controller                              5d23h
system:controller:service-account-controller                    ClusterRole/system:controller:service-account-controller                    5d23h
system:controller:service-controller                            ClusterRole/system:controller:service-controller                            5d23h
system:controller:statefulset-controller                        ClusterRole/system:controller:statefulset-controller                        5d23h
system:controller:ttl-after-finished-controller                 ClusterRole/system:controller:ttl-after-finished-controller                 5d23h
system:controller:ttl-controller                                ClusterRole/system:controller:ttl-controller                                5d23h
system:controller:validatingadmissionpolicy-status-controller   ClusterRole/system:controller:validatingadmissionpolicy-status-controller   5d23h
system:coredns                                                  ClusterRole/system:coredns                                                  5d23h
system:discovery                                                ClusterRole/system:discovery                                                5d23h
system:kube-controller-manager                                  ClusterRole/system:kube-controller-manager                                  5d23h
system:kube-dns                                                 ClusterRole/system:kube-dns                                                 5d23h
system:kube-scheduler                                           ClusterRole/system:kube-scheduler                                           5d23h
system:monitoring                                               ClusterRole/system:monitoring                                               5d23h
system:node                                                     ClusterRole/system:node                                                     5d23h
system:node-proxier                                             ClusterRole/system:node-proxier                                             5d23h
system:public-info-viewer                                       ClusterRole/system:public-info-viewer                                       5d23h
system:service-account-issuer-discovery                         ClusterRole/system:service-account-issuer-discovery                         5d23h
system:volume-scheduler                                         ClusterRole/system:volume-scheduler                                         5d23h
vpc-resource-controller-rolebinding                             ClusterRole/vpc-resource-controller-role                                    5d23h
2.
kubectl describe clusterrolebinding flyte-backend-flyte-binary-cluster-role
Copy code
kubectl describe clusterrolebinding flyte-backend-flyte-binary-cluster-role 
Name:         flyte-backend-flyte-binary-cluster-role-binding
Labels:       <http://app.kubernetes.io/instance=flyte-backend|app.kubernetes.io/instance=flyte-backend>
              <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
              <http://app.kubernetes.io/name=flyte-binary|app.kubernetes.io/name=flyte-binary>
              <http://app.kubernetes.io/version=1.16.0|app.kubernetes.io/version=1.16.0>
              <http://helm.sh/chart=flyte-binary-v1.14.1|helm.sh/chart=flyte-binary-v1.14.1>
Annotations:  <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-backend
              <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
Role:
  Kind:  ClusterRole
  Name:  flyte-backend-flyte-binary-cluster-role
Subjects:
  Kind            Name                        Namespace
  ----            ----                        ---------
  ServiceAccount  flyte-backend-flyte-binary  flyte
3.
kubectl describe pod flyte-backend-flyte-binary-6f99bcdbb8-v29ml -n flyte
Copy code
Name:             flyte-backend-flyte-binary-6f99bcdbb8-v29ml
Namespace:        flyte
Priority:         0
Service Account:  default
Node:             XXXXXX.ec2.internal/<IP>
Start Time:       Tue, 28 Jan 2025 21:33:07 +0000
Labels:           <http://app.kubernetes.io/component=flyte-binary|app.kubernetes.io/component=flyte-binary>
                  <http://app.kubernetes.io/instance=flyte-backend|app.kubernetes.io/instance=flyte-backend>
                  <http://app.kubernetes.io/name=flyte-binary|app.kubernetes.io/name=flyte-binary>
                  pod-template-hash=XXXXX
Annotations:      checksum/cluster-resource-templates: XXXXX
                  checksum/configuration: XXXX
                  checksum/configuration-secret: XXXXX
Status:           Running
IP:               <IP>
IPs:
  IP:           <IP>
Controlled By:  ReplicaSet/flyte-backend-flyte-binary-6f99bcdbb8
Init Containers:
  wait-for-db:
    Container ID:  <containerd://XXXXXXXX>
    Image:         postgres:15-alpine
    Image ID:      <http://docker.io/library/postgres@sha256:XXXXXXX|docker.io/library/postgres@sha256:XXXXXXX>
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -ec
    Args:
      until pg_isready \
        -h flyteadmin.cluster-XXXXXX.<region>.<http://rds.amazonaws.com|rds.amazonaws.com> \
        -p 5432 \
        -U flyteadmin
      do
        echo waiting for database
        sleep 0.1
      done
      
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 28 Jan 2025 21:33:08 +0000
      Finished:     Tue, 28 Jan 2025 21:33:08 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fxz8z (ro)
Containers:
  flyte:
    Container ID:  <containerd://XXXXXXX>
    Image:         <http://cr.flyte.org/flyteorg/flyte-binary-release:v1.14.1|cr.flyte.org/flyteorg/flyte-binary-release:v1.14.1>
    Image ID:      <http://cr.flyte.org/flyteorg/flyte-binary-release@sha256:XXXXXXX|cr.flyte.org/flyteorg/flyte-binary-release@sha256:XXXXXXX>
    Ports:         8088/TCP, 8089/TCP, 9443/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      start
      --config
      /etc/flyte/config.d/*.yaml
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Wed, 29 Jan 2025 21:11:14 +0000
      Finished:     Wed, 29 Jan 2025 21:11:19 +0000
    Ready:          False
    Restart Count:  278
    Liveness:       http-get http://:http/healthcheck delay=30s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/healthcheck delay=30s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       flyte-backend-flyte-binary-6f99bcdbb8-v29ml (v1:metadata.name)
      POD_NAMESPACE:  flyte (v1:metadata.namespace)
    Mounts:
      /etc/flyte/cluster-resource-templates from cluster-resource-templates (rw)
      /etc/flyte/config.d from config (rw)
      /var/run/flyte from state (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fxz8z (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  cluster-resource-templates:
    Type:               Projected (a volume that contains injected data from multiple sources)
    ConfigMapName:      flyte-backend-flyte-binary-cluster-resource-templates
    ConfigMapOptional:  <nil>
  config:
    Type:                Projected (a volume that contains injected data from multiple sources)
    ConfigMapName:       flyte-backend-flyte-binary-config
    ConfigMapOptional:   <nil>
    SecretName:          flyte-backend-flyte-binary-config-secret
    SecretOptionalName:  <nil>
  state:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-fxz8z:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Warning  BackOff  90s (x6618 over 23h)  kubelet  Back-off restarting failed container flyte in pod flyte-backend-flyte-binary-6f99bcdbb8-v29ml_flyte(1269290f-51b1-420a-8d5b-6326c72d25d5)
a
I think this is part of the problem: the flyte-binary Pod should not be using the
default
SA but the
flyte-backend-flyte-binary
g
hmmm, that makes sense. would changing the eks-start.yaml section help that? or is that referring to a different thing?
Copy code
002_serviceaccount.yaml: |
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: default # <---- to flyte-backend-flyte-binary
        namespace: '{{ namespace }}'
        annotations:
          <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: '{{ defaultIamRole }}'
a
no, that's the template for the workers (the Pods created for each execution) and that is fine
g
that's interesting. the workers were the only part of following your docs that I wasn't able to follow exactly with CDK (I had to manually adjust the trust relationship). so I'm surprised the issue is with the flyte-binary pod
a
can you make CDK not create the
flyte-backend-flyte-binary
but let Helm do it? I think it's the way the rest of the Helm templates can plumb into, for example, the binary Pod (see https://github.com/flyteorg/flyte/blob/448aba97201ba42297282d859e6064b7f89537ae/charts/flyte-binary/templates/deployment.yaml#L62)
g
I don't think it's possible for CDK to not create the service account when running
cluster.addServiceAccount()
I.e. in your docs, you say this "You won't create a Kubernetes service account at this point; it will be created by running the Helm chart at the end of the process" but CDK will have already created the service account. Looking at the
eks-starter.yaml
, we have
Copy code
serviceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: <flyte-system-role>
maybe I can just manually create an role and use that arn and see if helm does the rest?
So I removed the flyte-worker-role I created through cdk and reran the helm deployment but changing the lines to remove annotaions (as I have no role-arn to point to anymore). So it was just
Copy code
serviceAccount:
  create: true
Helm created the service account:
kubectl get sa --namespace flyte
Copy code
NAME                         SECRETS   AGE
default                      0         45h
flyte-admin                  0         45h
flyte-backend-flyte-binary   0         26m
kubectl describe sa --namespace flyte
Copy code
Name:                flyte-backend-flyte-binary
Namespace:           flyte
Labels:              <http://app.kubernetes.io/instance=flyte-backend|app.kubernetes.io/instance=flyte-backend>
                     <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                     <http://app.kubernetes.io/name=flyte-binary|app.kubernetes.io/name=flyte-binary>
                     <http://app.kubernetes.io/version=1.16.0|app.kubernetes.io/version=1.16.0>
                     <http://helm.sh/chart=flyte-binary-v1.14.1|helm.sh/chart=flyte-binary-v1.14.1>
Annotations:         <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-backend
                     <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>
The pod is running and didn't crash. It is now using the right service account too
Copy code
kubectl describe pod --namespace flyte flyte-backend-flyte-binary-8589d74cf6-5cf2s
Copy code
Name:             flyte-backend-flyte-binary-8589d74cf6-5cf2s
Namespace:        flyte
Priority:         0
Service Account:  flyte-backend-flyte-binary
Node:             <>
Start Time:       Thu, 30 Jan 2025 00:32:21 +0000
Labels:           <http://app.kubernetes.io/component=flyte-binary|app.kubernetes.io/component=flyte-binary>
                  <http://app.kubernetes.io/instance=flyte-backend|app.kubernetes.io/instance=flyte-backend>
                  <http://app.kubernetes.io/name=flyte-binary|app.kubernetes.io/name=flyte-binary>
                  pod-template-hash=8589d74cf6
Annotations:      checksum/cluster-resource-templates: XXXXX
                  checksum/configuration: XXXX
                  checksum/configuration-secret: XXX
Status:           Running
IP:               <>
IPs:
  IP:           <>
Controlled By:  ReplicaSet/flyte-backend-flyte-binary-8589d74cf6
Init Containers:
  wait-for-db:
    Container ID:  <containerd://XXXXX>
    Image:         postgres:15-alpine
    Image ID:      <http://docker.io/library/postgres@sha256:XXX|docker.io/library/postgres@sha256:XXX>
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -ec
    Args:
      until pg_isready \
        -h <http://flyteadmin.cluster-XXX.XXXX.rds.amazonaws.com|flyteadmin.cluster-XXX.XXXX.rds.amazonaws.com> \
        -p 5432 \
        -U flyteadmin
      do
        echo waiting for database
        sleep 0.1
      done
      
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 30 Jan 2025 00:32:22 +0000
      Finished:     Thu, 30 Jan 2025 00:32:22 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q5gjc (ro)
Containers:
  flyte:
    Container ID:  <containerd://XXXX>
    Image:         <http://cr.flyte.org/flyteorg/flyte-binary-release:v1.14.1|cr.flyte.org/flyteorg/flyte-binary-release:v1.14.1>
    Image ID:      <http://cr.flyte.org/flyteorg/flyte-binary-release@sha256:XXXXX|cr.flyte.org/flyteorg/flyte-binary-release@sha256:XXXXX>
    Ports:         8088/TCP, 8089/TCP, 9443/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      start
      --config
      /etc/flyte/config.d/*.yaml
    State:          Running
      Started:      Thu, 30 Jan 2025 00:32:23 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:http/healthcheck delay=30s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/healthcheck delay=30s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       flyte-backend-flyte-binary-8589d74cf6-5cf2s (v1:metadata.name)
      POD_NAMESPACE:  flyte (v1:metadata.namespace)
    Mounts:
      /etc/flyte/cluster-resource-templates from cluster-resource-templates (rw)
      /etc/flyte/config.d from config (rw)
      /var/run/flyte from state (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q5gjc (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  cluster-resource-templates:
    Type:               Projected (a volume that contains injected data from multiple sources)
    ConfigMapName:      flyte-backend-flyte-binary-cluster-resource-templates
    ConfigMapOptional:  <nil>
  config:
    Type:                Projected (a volume that contains injected data from multiple sources)
    ConfigMapName:       flyte-backend-flyte-binary-config
    ConfigMapOptional:   <nil>
    SecretName:          flyte-backend-flyte-binary-config-secret
    SecretOptionalName:  <nil>
  state:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-q5gjc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  33m   default-scheduler  Successfully assigned flyte/flyte-backend-flyte-binary-8589d74cf6-5cf2s to ip-XXXX.ec2.internal
  Normal  Pulled     33m   kubelet            Container image "postgres:15-alpine" already present on machine
  Normal  Created    33m   kubelet            Created container wait-for-db
  Normal  Started    33m   kubelet            Started container wait-for-db
  Normal  Pulled     33m   kubelet            Container image "<http://cr.flyte.org/flyteorg/flyte-binary-release:v1.14.1|cr.flyte.org/flyteorg/flyte-binary-release:v1.14.1>" already present on machine
  Normal  Created    33m   kubelet            Created container flyte
  Normal  Started    33m   kubelet            Started container flyte
However, now when I continue the instructions and run
kubectl -n flyte port-forward service/flyte-backend-flyte-binary 8088:8088 8089:8089
I get
Error from server (NotFound): services "flyte-backend-flyte-binary" not found
The only services are
flyte-backend-flyte-binary-grpc
,
flyte-backend-flyte-binary-http
, and
flyte-backend-flyte-binary-webhook
, and trying to chage the command to port-forward one of those services does not work
Some logs from the pod
Copy code
2025/01/30 01:13:54 /flyteorg/build/flyteadmin/scheduler/repositories/gormimpl/schedulable_entity_repo.go:70
[2.313ms] [rows:0] SELECT * FROM "schedulable_entities"
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2025-01-30T01:13:55Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2025-01-30T01:13:55Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2025-01-30T01:13:55Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2025-01-30T01:13:56Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2025-01-30T01:13:56Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2025-01-30T01:13:56Z"}
a
well, making progress 🙂 So, it was necessary to remove the role? I mean if you're now letting Helm create the SA, you can also add back the annotations to map it to the role, otherwise you'll have an error when trying to run a workflow
the ingress issue is probably also a docs issue Recent versions of flyte-binary split the service into those two, so you should be able to port-forward any of those separately Like
kubectl -n flyte port-forward service/flyte-backend-flyte-binary-grpc 8089:8089
g
seems to work, but what's the address I'm meant to go to?
a
also port forward the http service to test
kubectl -n flyte port-forward service/flyte-backend-flyte-binary-http 8088:8088
and go to
localhost:8088/console
g
okay nice, I'm in. Which annotations do I need to map?
a
I mean, adding this back to your Helm
Copy code
serviceAccount:
  create: enable
  annotations:
    <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: "arn:aws:iam::<aws-account-id>:role/flyte-system-role"
and of course having the role configured
g
hmm, is this circular because helm needs to create the role when we helm install? so I won't be able to annotate with the role name a priori
a
no, Helm won't create the IAM role. It just annotates it so IRSA works In Flyte the Hard Way that's why we create the role first and not the SA, and then let Helm create the SA and tie it to the IAM role using the annotation
g
Okay, apologies in advance as there might be some basic flyte questions thrown in here, which might not be related to the deployment. I've got -http and -grcp port forwarded. I've installed
flyctl
and I've cloned the example flytesnacks repo.
pyflyte run basics/hello_world.py hello_world_wf
runs fine locally, but with
--remote
I get
Copy code
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:30080: Failed to connect to remote host: connect: Connection 
refused (111)"
        debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2025-01-30T01:56:05.966009007+00:00", grpc_status:14, grpc_message:"failed
to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:30080: Failed to connect to remote host: connect: Connection refused (111)"}"
>
a
no worries so have you used the
flytectl demo start
option before? looks like your pyflyte client is pointing to that instance
you can do
Copy code
export FLYTECTL_CONFIG=$HOME/.flyte/config.yaml
and make sure that file includes
Copy code
admin:
  endpoint: localhost:8089
g
(I'd installed
flyctl
rather than
flytectl
... confusing both use the same colour of pink/purple)
a
oh, that's an unintended color scheme collision!
so, is it working now?
g
vanilla
flytectl demo start
ran but I'm going to configure to try the remote cluster
hmmm, running
flytcectl demo start
seems to have deleted the original helm deployment pods
Copy code
kubectl get services -n flyte                                                                                                    
NAME                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                         AGE
flyte-sandbox-docker-registry        NodePort    10.43.149.169   <none>        5000:30000/TCP                  10m
flyte-sandbox-grpc                   ClusterIP   10.43.42.250    <none>        8089/TCP                        10m
flyte-sandbox-http                   ClusterIP   10.43.213.177   <none>        8088/TCP                        10m
flyte-sandbox-kubernetes-dashboard   ClusterIP   10.43.52.239    <none>        80/TCP                          10m
flyte-sandbox-minio                  NodePort    10.43.206.188   <none>        9000:30002/TCP,9001:31243/TCP   10m
flyte-sandbox-postgresql             NodePort    10.43.78.35     <none>        5432:30001/TCP                  10m
flyte-sandbox-postgresql-hl          ClusterIP   None            <none>        5432/TCP                        10m
flyte-sandbox-proxy                  NodePort    10.43.78.59     <none>        8000:30080/TCP                  10m
flyte-sandbox-webhook                ClusterIP   10.43.229.159   <none>        443/TCP                         10m
flyteagent                           ClusterIP   10.43.43.218    <none>        8000/TCP                        10m
a
but, if you're deploying with Helm, you don't need
flytectl demo start
at all
g
ah I think I misread you. I never used the
flytectl demo start
command before. I don't know why it was pointing to that instance. I'll delete those sandbox pods and redeploy with helm now
a
so the flytectl demo start will create a k3s cluster in your machine (essentially a K8s cluster inside a Docker container) and will deploy Flyte there. What you're doing with CDK I guess it goes to an EKS cluster on your AWS account right?
g
EKS cluster on your AWS account right
yeah exactly. kubectl is currently pointing to the wrong cluster, so I think I just have to change that back to the AWS EKS cluster
looks to be working...
kubectl config current-context
is pointing the the EKS cluster
Is there a way of checking it did indeed run on the EKS cluster?
I'm also concerned my deployment is brittle, given e.g. I didn't point helm to a role ARN for the service account. is there any way of checking i'm in the clear?
you can see it says IAM role default and service account default
would be good to know which ARN they are pointing towards
but in general this looks great. thank you so much for the help
a
it says IAM role default and service account default
hm, we can confirm if it's a UI behavior or not. Do a
kubectl describe sa default -n flytesnacks-development
g
Copy code
kubectl describe sa default -n flytesnacks-development

Name:                default
Namespace:           flytesnacks-development
Labels:              <none>
Annotations:         <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::<>:role/<>
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>
so annotations points to the flyte-workers-role
a
cool, so it's a UI thing. Would you mind filing an Issue? I think this is something we can discuss with maintainers, I mean, if the UI should show that misleading "default" IAM role
g
is it meant to be pointing to the flyte-workers-role? I haven't actually got a flyte-system-role anymore, so I'm intrigued to know what is being used instead
sure, I'll file an issue on github today
a
in practice you could use the same IAM role for both workers and backend
g
I guess so because they have the same policy. I'll continue with the next steps of the deployment and let you know if we run into any issues
out of interest, the only manual step I have so far is changing the StringEquals to StringLike part. Is that necessary?
I am quite surprised this configuration is working at all, especially given the dashboard says Service Account: default
kubectl describe sa --namespace flyte
Copy code
Name:                default
Namespace:           flyte
Labels:              <none>
Annotations:         <none>
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>


Name:                flyte-admin
Namespace:           flyte
Labels:              <http://app.kubernetes.io/name=flyte-admin|app.kubernetes.io/name=flyte-admin>
                     aws.cdk.eks/prune-XXXXX=
Annotations:         <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::<>:role/<flyteworkersrol>
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>


Name:                flyte-backend-flyte-binary
Namespace:           flyte
Labels:              <http://app.kubernetes.io/instance=flyte-backend|app.kubernetes.io/instance=flyte-backend>
                     <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                     <http://app.kubernetes.io/name=flyte-binary|app.kubernetes.io/name=flyte-binary>
                     <http://app.kubernetes.io/version=1.16.0|app.kubernetes.io/version=1.16.0>
                     <http://helm.sh/chart=flyte-binary-v1.14.1|helm.sh/chart=flyte-binary-v1.14.1>
Annotations:         <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-backend
                     <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>
a
the workers use the
default
SA by default
g
For my understanding, if it's using the default SA, what are the other two SAs here doing?
and is there any way from pyflyte to configure workers using different SA?
just FYI, stuff like setting up the alb is very easy in CDK. I assume it's the same with terraform.
Copy code
this.cluster = new Cluster(this, 'EKSCluster', {
      ...
      albController: {
        version: AlbControllerVersion.V2_8_2,
      },
    });
and then
Copy code
kubectl get deployment -n kube-system aws-load-balancer-controller

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
aws-load-balancer-controller   2/2     2            2           2m27s
pretty sweet
once I've tidied up the CDK and if flyte still works, I'll post the code on your repo and we can go through it and discuss whether it's worth merging
a
what are the other two SAs here doing?
sorry I don't follow, what other SAs?
once I've tidied up the CDK and if flyte still works, I'll post the code on your repo and we can go through it and discuss whether it's worth merging
that'd be great, I can imagine EKS users would benefit from this a lot
g
so there are 3 SAs in the
flyte
namespace and only default is being used?
Copy code
Name:                default
Namespace:           flyte
Labels:              <none>
Annotations:         <none>
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>


Name:                flyte-admin
Namespace:           flyte
Labels:              <http://app.kubernetes.io/name=flyte-admin|app.kubernetes.io/name=flyte-admin>
                     aws.cdk.eks/prune-XXXXX=
Annotations:         <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::<>:role/<flyteworkersrol>
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>


Name:                flyte-backend-flyte-binary
Namespace:           flyte
Labels:              <http://app.kubernetes.io/instance=flyte-backend|app.kubernetes.io/instance=flyte-backend>
                     <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                     <http://app.kubernetes.io/name=flyte-binary|app.kubernetes.io/name=flyte-binary>
                     <http://app.kubernetes.io/version=1.16.0|app.kubernetes.io/version=1.16.0>
                     <http://helm.sh/chart=flyte-binary-v1.14.1|helm.sh/chart=flyte-binary-v1.14.1>
Annotations:         <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-backend
                     <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>
a
ah, ok. well,
default
comes with K8s.
flyte-backend-flyte-binary
is what is used
flyte-admin
pretty sure your single binary is not even using this
g
so it's not being used but the eks iam role attached to it is being used to run flyte pods? I can delete it and see if it's still required at all I guess
a
the eks iam role attached to it is being used to run flyte pods?
not sure how your Trust Relationship is configured but you can remove that one both from K8s and IAM and everything should still work
g
I'll try
Hi David, I still haven't got to the bottom of this, and I'd like to get some working CDK instructions going to bring more people to flyte. I've outlined my current CDK deployment here https://github.com/davidmirror-ops/flyte-the-hard-way/pull/28 So the error I'm getting
An error occurred (ValidationError) when calling the AssumeRoleWithWebIdentity operation: Request ARN is invalid
points towards the OIDC trust relationship issue. I don't yet understand well-enough how the flyte internals, roles and service accounts work I have CDK to create roles. I have CDK to create service accounts. Currently, I create the flyte-backend-flyte-binary role by setting
create: true
in the yaml because earlier in this thread I couldn't find a way to do that outside of the helm chart without the deployment breaking – although there might still be a way. Let me know what to try. Hopefully it's an obvious error
(and thanks for your help already)
just a few more ideas of how to do this. I'm following your docs for creating the IAM roles and you say "You won't create a Kubernetes service account at this point; it will be created by running the Helm chart at the end of the process." If I understand this correctly, does this mean we should create the role correctly following these lines of CDK (editing to wildcard with
StringLike
) and then we actually don't run the part where the service account is created using
KubernetesManifest
, i.e. these lines of CDK. Have I understood this correctly?