Hi! I've been attempting to set up Flyte on AWS an...
# ask-the-community
a
Hi! I've been attempting to set up Flyte on AWS and following the flyte the hard way guide. I am only set up as far as the Single Cluster Simple Cloud Deployment, which coincides with page 05-deploy-with-helm of FTHW. I'm able to deploy a test workflow (such as the hello_world.py) to the remote cluster. When it runs, I receive an error in the Flyte console:
Copy code
11/22/2023 2:27:40 PM UTC task submitted to K8s

11/22/2023 2:27:40 PM UTC [ContainersNotReady|ContainerCreating]: containers with unready status: [f5e7547d4c8994d4d992-n0-0]|
Copy code
kages/flytekit/bin/entrypoint.py:519 in    │
│ fast_execute_task_cmd                                                        │
│                                                                              │
│ ❱ 519 │   │   _download_distribution(additional_distribution, dest_dir)      │
│                                                                              │
│ /usr/local/lib/python3.11/site-packages/flytekit/core/utils.py:295 in        │
│ wrapper                                                                      │
│                                                                              │
│ ❱ 295 │   │   │   │   return func(*args, **kwargs)                           │
│                                                                              │
│ /usr/local/lib/python3.11/site-packages/flytekit/tools/fast_registration.py: │
│ 113 in download_distribution                                                 │
│                                                                              │
│ ❱ 113 │   FlyteContextManager.current_context().file_access.get_data(additio │
│                                                                              │
│ /usr/local/lib/python3.11/site-packages/flytekit/core/data_persistence.py:47 │
│ 5 in get_data                                                                │
│                                                                              │
│ ❱ 475 │   │   │   raise FlyteAssertion(                                      │
╰──────────────────────────────────────────────────────────────────────────────╯
FlyteAssertion: Failed to get data from 
s3://<my-bucket-here>/flytesnacks/development/RUD7F4QDHIZRCQGDFXZKERK4GM======/scr
ipt_mode.tar.gz to /root/ (recursive=False).

Original exception: Access Denied
I feel pretty lost and I'm unsure where to go from here. I'd really appreciate any help or advice! Thank you :) Update: I manually gave the file in the error above full world read permissions for testing and it did make it past that issue. Regardless, I'm given a new failure error: `tar: Removing leading
/' from member names
d
Hi @Alexandra D!
so did you complete the other sections in the FTHW guide? especially #03?
a
Yes I've gone through #01 through #05 (twice now, in two completely different AWS environments).
d
uh, ok. Can we check the output of
aws iam get-role --role-name flyte-system-role --query Role.AssumeRolePolicyDocument
a
of course!
Copy code
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::<my-aws-acct-id>:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/DC178F5D689F6DDF61B7E0F99688DED4"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "<http://oidc.eks.us-east-1.amazonaws.com/id/DC178F5D689F6DDF61B7E0F99688DED4:aud|oidc.eks.us-east-1.amazonaws.com/id/DC178F5D689F6DDF61B7E0F99688DED4:aud>": "<http://sts.amazonaws.com|sts.amazonaws.com>",
          "<http://oidc.eks.us-east-1.amazonaws.com/id/DC178F5D689F6DDF61B7E0F99688DED4:sub|oidc.eks.us-east-1.amazonaws.com/id/DC178F5D689F6DDF61B7E0F99688DED4:sub>": "system:serviceaccount:flyte:flyte-backend-flyte-binary"
        }
      }
    }
  ]
}
I'm unsure whether the OIDC UUID is worth redacting so I left it in.
d
got it. can you also please share
kubectl describe sa -n flyte flyte-backend-flyte-binary
a
Copy code
Name:                flyte-backend-flyte-binary
Namespace:           flyte
Labels:              <http://app.kubernetes.io/instance=flyte-backend|app.kubernetes.io/instance=flyte-backend>
                     <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                     <http://app.kubernetes.io/name=flyte-binary|app.kubernetes.io/name=flyte-binary>
                     <http://app.kubernetes.io/version=1.16.0|app.kubernetes.io/version=1.16.0>
                     <http://helm.sh/chart=flyte-binary-v1.10.0|helm.sh/chart=flyte-binary-v1.10.0>
Annotations:         <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::<my-aws-acct-id>:role/flyte-system-role
                     <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte-backend
                     <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>
d
alright. and what's the IAM Policy attached to the role?
a
Copy code
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:*",
        "s3-object-lambda:*"
      ],
      "Resource": "*"
    }
  ]
}
above a result from this portion of the FTHW guide:
eksctl create iamserviceaccount --cluster=<my-flyte-cluster> --name=flyte-backend-flyte-binary --role-only --role-name=flyte-system-role --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --approve --region <region-code> --namespace flyte
d
yes, I think it's time to update that guide and make it less permissive 😅 Ok, considering this `tar: Removing leading
/' from member names
I guess you're running
pyflyte run --remote ...
right?
a
correct
I've attempted a few different example workflows but they all fail the same way:
pyflyte run --remote hello_world.py hello_world_wf
Granted I still haven't even really figured out the first issue beyond manually increasing permissions on the file it kept running into. But yea once I give that file full world permissions I get the tar error.
d
well, `tar: Removing leading
/' from member names
<--this is not an error. It's more like a cryptic but normal log of the untar operation that happens when you do "fast registration", but that's a different story.
can you share
kubectl describe sa default -n flytesnacks-development
(Assuming you're not providing a different project-domain)
a
Copy code
Name:                default
Namespace:           flytesnacks-development
Labels:              <none>
Annotations:         <none>
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>
That doesn't look very good
d
there it is. This is missing on the guide and will need to be resolved soon
a
!!
I'm guessing the annotations section being empty is problematic?
d
yes, this is the SA that the Task Pods will use. Empty annotations means no IRSA, no way to access AWS resources
let me find a quick way to fix this
let's try this 1. Create a new role for the workers:
eksctl create iamserviceaccount --cluster=<your-EKS-cluster-name>--name=default --role-only --role-name=flyte-workers --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --approve --region <region-code> --namespace flyte
2. If you run
aws iam get-role --role-name flyte-workers --query Role.AssumeRolePolicyDocument
it should look similar to:
Copy code
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::<>acct-id>:oidc-provider/oidc.eks.<region-code>.<http://amazonaws.com/id/<UUID-OIDC|amazonaws.com/id/<UUID-OIDC>>"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.<region-code>.<http://amazonaws.com/id/<UUID-OIDC>:sub|amazonaws.com/id/<UUID-OIDC>:sub>": "system:serviceaccount:flyte:default",
          "oidc.eks.<region-code>.<http://amazonaws.com/id/<UUID-OIDC>:aud|amazonaws.com/id/<UUID-OIDC>:aud>": "<http://sts.amazonaws.com|sts.amazonaws.com>"
        }
      }
    }
  ]
}
3. If that's the case, edit the IAM Role to change from the
flyte
namespace to
*
Copy code
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::<>acct-id>:oidc-provider/oidc.eks.<region-code>.<http://amazonaws.com/id/<UUID-OIDC|amazonaws.com/id/<UUID-OIDC>>"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.<region-code>.<http://amazonaws.com/id/<UUID-OIDC>:sub|amazonaws.com/id/<UUID-OIDC>:sub>": "system:serviceaccount:*:default",
          "oidc.eks.<region-code>.<http://amazonaws.com/id/<UUID-OIDC>:aud|amazonaws.com/id/<UUID-OIDC>:aud>": "<http://sts.amazonaws.com|sts.amazonaws.com>"
        }
      }
    }
  ]
}
This is because for every project-domain combination you'll have a different namespace and a
default
KSA on each, so making it a wildcard is a convenience here Not ideal but let me know if it works
a
Should I re-annotate with the new
flyte-workers
role?
d
right. This can be achieved from the Helm values let me find it
a
I'm not sure I'm editing the trust policy correctly. I get this error when I add the wildcard:
d
oh yes, sorry, also change from
StringEquals
to
StringLike
Should I re-annotate with the new
flyte-workers
role?
regarding this, make sure your Helm values include the following: 1.
Copy code
configuration:
  inline:
    cluster_resources:
      customData:
      - production:
        - defaultIamRole:
            value: arn:aws:iam::<acct-id>:role/flyte-workers
      - staging:
        - defaultIamRole:
            value: arn:aws:iam::<acct-id>:role/flyte-workers
      - development:
        - defaultIamRole:
            value: arn:aws:iam::<acct-id>:role/flyte-workers
2.
Copy code
clusterResourceTemplates:
  inline:

    002_serviceaccount.yaml: |
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: default
        namespace: '{{ namespace }}'
        annotations:
          <http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: '{{ defaultIamRole }}'
You just need to update the
acct-id
and then run a Helm upgrade
a
Running
pyflyte run --remote hello_world.py hello_world_wf
Copy code
Failed with Exception Code: SYSTEM:Unknown
RPC Failed, with Status: StatusCode.INTERNAL
	details: failed to create a signed url. Error: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
	status code: 403, request id: 7e897ee1-045f-4a91-af19-a5594380fa95
	Debug string UNKNOWN:Error received from peer  {grpc_message:"failed to create a signed url. Error: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 7e897ee1-045f-4a91-af19-a5594380fa95", grpc_status:13, created_time:"2023-11-22T09:54:13.754081855-07:00"}
Gives me role permissions errors now
d
is the
default
SA annotated?
a
One sec that was my fault actually
Right so first I get this:
Copy code
FlyteAssertion: Failed to get data from 
<s3://flyte-metadata/flytesnacks/development/CXNXVNZLWOB3ULGK3EUPEK666M======/scr>
ipt_mode.tar.gz to /root/ (recursive=False).

Original exception: Access Denied
And then to temporarily get around this I manually enable public access but it still fails with just:
Copy code
tar: Removing leading `/' from member names
I don't know how to reply to a comment directly but no, the
default
SA is not annotated.
d
ok, it should be annotated after the Helm upgrade operation
a
Maybe I should share my values
d
or you can annotate it manually for testing
kubectl edit sa default -n flytesnacks-development
and add the annotation:
Copy code
<http://eks.amazonaws.com/role-arn|eks.amazonaws.com/role-arn>: arn:aws:iam::<acct-id>:role/flyte-workers
a
Copy code
configuration:
  database:
    username: flyteadmin
    password: "<db-pass>"
    host: <db-url>
    dbname: flyteadmin
  storage:
    metadataContainer: flyte-metadata
    userDataContainer: flyte-userdata
    provider: s3
    providerConfig:
      s3:
        region: "us-east-1"
        authType: "iam"
  inline:
    plugins:
      k8s:
        inject-finalizer: true
        default-env-vars:
          - AWS_METADATA_SERVICE_TIMEOUT: 5
          - AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
    storage:
      cache:
        max_size_mbs: 100
        target_gc_percent: 100
    cluster_resources:
      customData:
      - production:
        - defaultIamRole:
            value: arn:aws:iam::<acct id>:role/flyte-workers
      - staging:
        - defaultIamRole:
            value: arn:aws:iam::<acct id>:role/flyte-workers
      - development:
        - defaultIamRole:
            value: arn:aws:iam::<acct id>:role/flyte-workers
clusterResourceTemplates:
  inline:

    002_serviceaccount.yaml: |
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: default
        namespace: '{{ namespace }}'
        annotations:
          eks.amazonaws.com/role-arn: '{{ defaultIamRole }}'
serviceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::<acct id>:role/flyte-system-role"
Followed by
helm upgrade flyte-backend flyteorg/flyte-binary -n flyte --values eks-starter.yaml
d
it looks good. but if the default SA is not annotated you won't get pass of the error let's try annotating it manually to test
a
Copy code
Name:                default
Namespace:           flytesnacks-development
Labels:              <none>
Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::<acct id>:role/flyte-workers
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>
Its annotated manually now but receiving the same failure and tar message
d
is there any Pod in the
flytesnacks-development
namespace?
kubectl get pods -n flytesnacks-development
a
Copy code
NAME                        READY   STATUS      RESTARTS   AGE
a588f4jx5jcxdwqmr5ms-n0-0   0/1     OOMKilled   0          3m18s
a6zn9kr6frfl82zmqg99-n0-0   0/1     Error       0          14m
a8slvpckm9z42dwp47xt-n0-0   0/1     OOMKilled   0          12m
acxxm4j4g6xx7qbkklql-n0-0   0/1     Error       0          3h33m
advlcfj6r4ccb2qczrj6-n0-0   0/1     Error       0          3h17m
am2htqq2kff5c8cv6zkm-n0-0   0/1     Error       0          3h25m
amjsd9ncgkc24fcbstjh-n0-0   0/1     Error       0          16m
amxjs9gbv78bw6b2s7d5-n0-0   0/1     Error       0          96m
aphp5t4ms5b59vm6cgff-n0-0   0/1     OOMKilled   0          2m39s
ascxj9pb4gtcq7g8hdt4-n0-0   0/1     OOMKilled   0          95m
asq4jdzlp6qxh8bbkxmg-n0-0   0/1     OOMKilled   0          103m
f12e6b2489129437caf9-n0-0   0/1     Error       0          112m
f325c37c092a84f2c831-n0-0   0/1     Error       0          15m
f5e7547d4c8994d4d992-n0-0   0/1     Error       0          169m
f6d21a6916aa94b85919-n0-0   0/1     Error       0          19m
f89dee937e9da4242ba8-n0-0   0/1     OOMKilled   0          4m23s
faa87eed69709459781e-n0-0   0/1     Error       0          104m
fae1758d5567949e6bdc-n0-0   0/1     Error       0          96m
fb6d1f3d7f1844657877-n0-0   0/1     Error       0          3h34m
ffb213020e44741e7859-n0-0   0/1     Error       0          3h15m
The logs for the first show the tar msg The logs for the second show the s3 access issue that I bypass by manually enabling public access for the script_mode.tar.gz
and I'm sure the rest of that list just alternates between the two across my various attempts
d
oh,
OOMKIlled
can you add the following to your values file first and upgrade
Copy code
configuration:
  inline:
    task_resources:
      defaults:
        cpu: 100m
        memory: 100Mi
        storage: 100Mi
      limits:
        memory: 2Gi
are you requesting specific memory/cpu resources in your task? I don't think so right?
a
no, right now I'm just running the simple hello_world.py, I can print that in a sec
In the upgrade you added, you double indent task_resources, is that intentional?
should it be placed under the
cluster_resources
segment?
d
it's not, it was a formating error
a
Okay gotcha
same failure/tar msg
f294923173c7444c39a1-n0-0   0/1     OOMKilled   0          32s
`tar: Removing leading
/' from member names
And the hello_world.py from flytesnacks (though I've tried others just in case, and they all fail in the same way)
Copy code
# %% [markdown]
#
# # Hello, World!
#
# ```{eval-rst}
# .. tags:: Basic
#
# # Let's write a Flyte {py:func}`~flytekit.workflow` that invokes a # {py:func}`~flytekit.task` to generate the output "Hello, World!". # # Flyte tasks are the core building blocks of larger, more complex workflows. # Workflows compose multiple tasks – or other workflows – # into meaningful steps of computation to produce some useful set of outputs or outcomes. # # To begin, import
task
and
workflow
from the
flytekit
library. # %% from flytekit import task, workflow # %% [markdown] # Define a task that produces the string "Hello, World!". # Simply using the
@task
decorator to annotate the Python function. # %% @task def say_hello() -> str: return "Hello, World!" # %% [markdown] # You can handle the output of a task in the same way you would with a regular Python function. # Store the output in a variable and use it as a return value for a Flyte workflow. # %% @workflow def hello_world_wf() -> str: res = say_hello() return res # %% [markdown] # Run the workflow by simply calling it like a Python function. # %% if name == "__main__": print(f"Running hello_world_wf() {hello_world_wf()}") # %% [markdown] # Next, let's delve into the specifics of {ref}`tasks <task>`, # {ref}`workflows <workflow>` and {ref}`launch plans <launch_plan>`.```
d
still an `OOMKilled`(a K8s signal indicating Out of Memory). Let's jump the base requests to test:
Copy code
task_resources:
      defaults:
        cpu: 1000m
        memory: 1000Mi
        storage: 1000Mi
      limits:
        storage: 2000Mi
a
i love seeing green
d
it's my fav color too 😎
thanks for your patience, I'll make sure to update the FTHW guide
a
Thank you for -your- patience! You spent so much time with me and I am extremely grateful for that. You're incredible! ❤️
d
now, if you want to make the IAM Policy lesss permissive, these are the minimum permissions:
Copy code
"Action": [
    "s3:DeleteObject*",
    "s3:GetObject*",
    "s3:ListBucket",
    "s3:PutObject*"
   ],
 "Resource": [
          "arn:aws:s3:::<your-S3-bucket>*",
          "arn:aws:s3:::<your-S3-bucket>*/*"
      ],
a
I'll give that a go, and I might be back to ask further questions regarding the S3 permissions issues I was having (bypassing with public access). But I'll give us both a break for now 🙂
Oh and just a minor follow-up, this seems to tell me that we didn't need the separate
flyte-workers
role and I could get rid of that I imagine? I will give removing it a shot when I get back to my computer in a bit, but my guess is that the resource configuration fixed the issue.
d
I used to think so months ago when I wrote that v1 of the guide. We can for sure collapse everythin into a single IAM Role, but we'll still need to edit the Trust Relationship to map both the backend SA and the
default
SA used by the workers. And that seems a bit off in terms of self-contained security policies, sharing IAM roles with multiple SAs? The end result could be the same and the idea with FTHW is to provide a quickstart, but now I guess we need to rethink it so it helps set up a production grade environment. I've been iterating recently on a reference implementation built with Terraform that should incorporate all these reccomendations: https://github.com/unionai-oss/deploy-flyte/tree/main/environments/aws