https://flyte.org logo
#ask-the-community
Title
# ask-the-community
r

Radhakrishna Sanka

04/13/2023, 4:40 PM
I was looking at the instructions for getting a cluster running on AWS and I noticed that the 1.0 documentation was a lot better than the latest (explicitly describes all the dependencies). Any idea on whats the best guide for this ?
d

David Espejo (he/him)

04/13/2023, 5:56 PM
hi @Radhakrishna Sanka Here's a community-maintained resource to get you started on AWS https://github.com/davidmirror-ops/flyte-the-hard-way
r

Radhakrishna Sanka

04/13/2023, 10:31 PM
Thanks a lot for this. I was looking for a resource like this. krishna
@David Espejo (he/him) thanks a lot for the guide. It seems like I’m having some issues creating nodegroups, it creates the instances but gives me the following error:
Copy code
Your current user or role does not have access to Kubernetes objects on this EKS nodegroup
Screen Shot 2023-04-15 at 7.05.33 PM.png
d

David Espejo (he/him)

04/16/2023, 6:48 PM
seems like there are two errors in here: 1. the role issue has more to do with the IAM role you're using in the Management console to configure the cluster. It probably doesn't have permissions to query the API server. Take a look at: https://repost.aws/knowledge-center/eks-kubernetes-object-access-error 2.
The instances failed yo join the kubernetes cluster
typically happens when the worker nodes cannot communicate with the EKS control plane. It largely depends on the API Server access configuration (eg Public, Public+Private, Private). If you're using private subnets you probably need endpoint services. Check out this runbook that can help you isolate the root cause of the problem: https://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/automation-awssupport-troubleshooteksworkernode.html
r

Radhakrishna Sanka

04/16/2023, 10:50 PM
Yeah it seems like something was off with the VPC (either the lack of endpoints or something else). I did finally manage to create the EKS. One thing I’m trying to confirm is the full outline for the IAM role
flyte-system
would it be possible to share the JSON for the role ? I want to compare against the role I’m making here
d

David Espejo (he/him)

04/17/2023, 5:33 PM
I'm rebuilding my environment today, once I get to the role will post the JSON here.
Copy code
{
    "Role": {
        "Path": "/",
        "RoleName": "flyte-system-role",
        "RoleId": "AROAYS5I3UDGD6RDWHN5M",
        "Arn": "arn:aws:iam::590375264460:role/flyte-system-role",
        "CreateDate": "2023-04-17T21:14:53+00:00",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Federated": "arn:aws:iam::590375264460:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/1EE94FBE2DE77558404404CF5947470C"
                    },
                    "Action": "sts:AssumeRoleWithWebIdentity",
                    "Condition": {
                        "StringEquals": {
                            "<http://oidc.eks.us-east-1.amazonaws.com/id/1EE94FBE2DE77558404404CF5947470C:aud|oidc.eks.us-east-1.amazonaws.com/id/1EE94FBE2DE77558404404CF5947470C:aud>": "<http://sts.amazonaws.com|sts.amazonaws.com>",
                            "<http://oidc.eks.us-east-1.amazonaws.com/id/1EE94FBE2DE77558404404CF5947470C:sub|oidc.eks.us-east-1.amazonaws.com/id/1EE94FBE2DE77558404404CF5947470C:sub>": "system:serviceaccount:flyte:flyte-backend-binary"
                        }
                    }
                }
            ]
        },
        "Description": "",
        "MaxSessionDuration": 3600,
        "Tags": [
            {
                "Key": "<http://alpha.eksctl.io/cluster-name|alpha.eksctl.io/cluster-name>",
                "Value": "fthw-eks-cluster"
            },
            {
                "Key": "<http://eksctl.cluster.k8s.io/v1alpha1/cluster-name|eksctl.cluster.k8s.io/v1alpha1/cluster-name>",
                "Value": "fthw-eks-cluster"
            },
            {
                "Key": "<http://alpha.eksctl.io/iamserviceaccount-name|alpha.eksctl.io/iamserviceaccount-name>",
                "Value": "flyte/flyte-backend-binary"
            },
            {
                "Key": "<http://alpha.eksctl.io/eksctl-version|alpha.eksctl.io/eksctl-version>",
                "Value": "0.132.0-dev+15bffbb0d.2023-03-01T18:34:36Z"
            }
        ],
        "RoleLastUsed": {}
    }
}
bear in mind that it was created with
eksctl
following the instructions in the guide (which I just updated to improve the experience)
r

Radhakrishna Sanka

04/17/2023, 9:23 PM
Thanks ! I’m doing a Infrastructure-as-Code implementation of this so I’m trying to get as much details down as possible.
d

David Espejo (he/him)

04/17/2023, 9:27 PM
oh please! you using Terraform or similar?
r

Radhakrishna Sanka

04/17/2023, 9:27 PM
pulumi, since a lot of code is in golang
I need to clean up some things but I’ll put it up on github for you to access. I wouldn’t mind having another set of eyes on this
d

David Espejo (he/him)

04/17/2023, 9:29 PM
awesome! Thanks so much
r

Radhakrishna Sanka

04/17/2023, 9:29 PM
No thank you for helping so much, I’m still stuck trying to figure out the flye-admin role stuff but I think I might get it to work soon
d

David Espejo (he/him)

04/17/2023, 9:39 PM
if it's of any help, take a look at this architecture diagram which I plan to contribute to docs It's been checked with maintainers and code 🙂 https://drive.google.com/file/d/1U4UnjzEV4LzqI_e2SOLJb8wzuzt4DxUE/view?usp=sharing
r

Radhakrishna Sanka

04/17/2023, 10:32 PM
@David Espejo (he/him) - Heres the repo (https://github.com/rkrishnasanka/flyte-pulumi.git). The thing is still under development, I’m still figuring out: 1. how to pull the security group associated with the EKS cluster for the RDS 2. Role for the EKS IAM, ODIC provider and a bunch of other stuff that you used the kubernates cmdline tools for. Basically 3, 4, 5 Any inputs you might have would be useful
@David Espejo (he/him) one question I have is which security group you’re pulling for the EKS cluster?
d

David Espejo (he/him)

04/18/2023, 3:21 PM
For the tutorial I'm just using the default one. Of course EKS will create and manage two other SGs for intra-node and node-to-control-plane communication
r

Radhakrishna Sanka

04/18/2023, 3:22 PM
So which one would work for the RDS ?
I would image the node-to-control panel SG ?
d

David Espejo (he/him)

04/18/2023, 3:27 PM
in this case, it would be also the
default
in the same VPC as the EKS cluster. the node-to-control plane is only for the worker nodes to the API server
r

Radhakrishna Sanka

04/18/2023, 3:29 PM
Will the control plane be the only one talking to the RDS ?
d

David Espejo (he/him)

04/18/2023, 3:31 PM
with
control plane
I mean EKS control plane 🙂 so in this case, `flyteadmin`which will be a workload running on the worker nodes, is the one that needs communication with RDS
r

Radhakrishna Sanka

04/18/2023, 3:36 PM
Gotcha, I think I have some of the kubernates fundamentals wrong I think. Okay I’m testing out the RDS connection with the default Security group now. One more clarification: Would this RDS connect test you wrote work if I don’t create the
flyte-system-role
? It seemed like it should since the flyte-system role, etc. where used for getting the roles S3Access
d

David Espejo (he/him)

04/18/2023, 3:38 PM
right , the RDS connection test doesn't require the IAM Role. It's needed when you have to write data to S3 (eg running a workflow)
r

Radhakrishna Sanka

04/25/2023, 4:52 PM
@David Espejo (he/him) Hows it going ? So I was travelling most of last week and got to work on this again today: I created an instance of the RDS (not aurora at the moment to conserve the resources). But it seems like the pods are unable to resolve the endpoint:
Copy code
sql: could not translate host name "<http://flyte-db5b39eef.cfk9fd2rmdtl.us-east-1.rds.amazonaws.com|flyte-db5b39eef.cfk9fd2rmdtl.us-east-1.rds.amazonaws.com>" to address: Temporary failure in name resolution
pod "pgsql-postgresql-client" deleted
pod testdb/pgsql-postgresql-client terminated (Error)
Any idea to test if any namespace resolution is working on the eks nodes ?
d

David Espejo (he/him)

04/25/2023, 5:10 PM
Hey @Radhakrishna Sanka thanks for sharing and I'm glad you're keeping up the great work! It could be something on the coredns instance (assuming you're using coredns) You can follow the instructions here to troubleshoot DNS resolution: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
r

Radhakrishna Sanka

04/25/2023, 5:12 PM
Oh thanks for this. Yeah, kubernates + network has been the most tedious part of trying to set this up tbh. Between VPC configuration issues, subnet unavailability, I think most of the issues have been around this.
d

David Espejo (he/him)

04/25/2023, 5:14 PM
I hear you, it can be troublesome
r

Radhakrishna Sanka

04/25/2023, 6:37 PM
@David Espejo (he/him) so I ran:
kubectl get pods -n kube-system
Copy code
NAME                       READY   STATUS    RESTARTS   AGE
aws-node-26snb             1/1     Running   0          134m
aws-node-b5vsr             1/1     Running   0          133m
aws-node-dx522             1/1     Running   0          133m
aws-node-ssgss             1/1     Running   0          133m
aws-node-xwp5n             1/1     Running   0          134m
coredns-7975d6fb9b-cnwqf   0/1     Pending   0          138m
coredns-7975d6fb9b-xw8qw   0/1     Pending   0          138m
kube-proxy-5v9cw           1/1     Running   0          134m
kube-proxy-9r8fv           1/1     Running   0          134m
kube-proxy-9z4c5           1/1     Running   0          133m
kube-proxy-cbfff           1/1     Running   0          133m
kube-proxy-gxhcj           1/1     Running   0          133m
It shows that my coredns pods are pending ? Does that mean they didn’t start. ? is there any way to force it to start ?
d

David Espejo (he/him)

04/25/2023, 6:45 PM
can you
kubectl describe core-dns-... -n kube-system
see the
Events
section
r

Radhakrishna Sanka

04/25/2023, 7:34 PM
odd the command isn’t working. its giving the following error
Copy code
error: the server doesn't have a resource type "coredns-7975d6fb9b-cnwqf"
Apologies, I’m new to kubernates
d

David Espejo (he/him)

04/25/2023, 8:09 PM
sorry:
kubectl describe pod core-dns... -n kube-system
r

Radhakrishna Sanka

04/25/2023, 9:01 PM
@David Espejo (he/him) figured it out, my instances were too tiny so it didn’t get enough threads to launch coreDNS
Now I just have the RDS acting up and not responding
Not sure if its a VPC thing or something else
d

David Espejo (he/him)

04/25/2023, 10:07 PM
can you share the result of running the db test container?
the DB should be in the same VPC and Security Group your EKS cluster is at the moment
r

Radhakrishna Sanka

04/25/2023, 10:09 PM
Ah shit, I just tore it down. I’m guessing its using the default security group for the VPC. Its in the same VPC though. I’ll launch it and send over the result.
d

David Espejo (he/him)

05/04/2023, 5:14 PM
hi @Radhakrishna Sanka have you been able to make progress on this?
25 Views