I was looking at the instructions for getting a cl...
# flyte-support
m
I was looking at the instructions for getting a cluster running on AWS and I noticed that the 1.0 documentation was a lot better than the latest (explicitly describes all the dependencies). Any idea on whats the best guide for this ?
a
hi @miniature-plumber-7394 Here's a community-maintained resource to get you started on AWS https://github.com/davidmirror-ops/flyte-the-hard-way
m
Thanks a lot for this. I was looking for a resource like this. krishna
@average-finland-92144 thanks a lot for the guide. It seems like I’m having some issues creating nodegroups, it creates the instances but gives me the following error:
Copy code
Your current user or role does not have access to Kubernetes objects on this EKS nodegroup
message has been deleted
a
seems like there are two errors in here: 1. the role issue has more to do with the IAM role you're using in the Management console to configure the cluster. It probably doesn't have permissions to query the API server. Take a look at: https://repost.aws/knowledge-center/eks-kubernetes-object-access-error 2.
The instances failed yo join the kubernetes cluster
typically happens when the worker nodes cannot communicate with the EKS control plane. It largely depends on the API Server access configuration (eg Public, Public+Private, Private). If you're using private subnets you probably need endpoint services. Check out this runbook that can help you isolate the root cause of the problem: https://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/automation-awssupport-troubleshooteksworkernode.html
m
Yeah it seems like something was off with the VPC (either the lack of endpoints or something else). I did finally manage to create the EKS. One thing I’m trying to confirm is the full outline for the IAM role
flyte-system
would it be possible to share the JSON for the role ? I want to compare against the role I’m making here
a
I'm rebuilding my environment today, once I get to the role will post the JSON here.
Copy code
{
    "Role": {
        "Path": "/",
        "RoleName": "flyte-system-role",
        "RoleId": "AROAYS5I3UDGD6RDWHN5M",
        "Arn": "arn:aws:iam::590375264460:role/flyte-system-role",
        "CreateDate": "2023-04-17T21:14:53+00:00",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Federated": "arn:aws:iam::590375264460:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/1EE94FBE2DE77558404404CF5947470C"
                    },
                    "Action": "sts:AssumeRoleWithWebIdentity",
                    "Condition": {
                        "StringEquals": {
                            "<http://oidc.eks.us-east-1.amazonaws.com/id/1EE94FBE2DE77558404404CF5947470C:aud|oidc.eks.us-east-1.amazonaws.com/id/1EE94FBE2DE77558404404CF5947470C:aud>": "<http://sts.amazonaws.com|sts.amazonaws.com>",
                            "<http://oidc.eks.us-east-1.amazonaws.com/id/1EE94FBE2DE77558404404CF5947470C:sub|oidc.eks.us-east-1.amazonaws.com/id/1EE94FBE2DE77558404404CF5947470C:sub>": "system:serviceaccount:flyte:flyte-backend-binary"
                        }
                    }
                }
            ]
        },
        "Description": "",
        "MaxSessionDuration": 3600,
        "Tags": [
            {
                "Key": "<http://alpha.eksctl.io/cluster-name|alpha.eksctl.io/cluster-name>",
                "Value": "fthw-eks-cluster"
            },
            {
                "Key": "<http://eksctl.cluster.k8s.io/v1alpha1/cluster-name|eksctl.cluster.k8s.io/v1alpha1/cluster-name>",
                "Value": "fthw-eks-cluster"
            },
            {
                "Key": "<http://alpha.eksctl.io/iamserviceaccount-name|alpha.eksctl.io/iamserviceaccount-name>",
                "Value": "flyte/flyte-backend-binary"
            },
            {
                "Key": "<http://alpha.eksctl.io/eksctl-version|alpha.eksctl.io/eksctl-version>",
                "Value": "0.132.0-dev+15bffbb0d.2023-03-01T18:34:36Z"
            }
        ],
        "RoleLastUsed": {}
    }
}
bear in mind that it was created with
eksctl
following the instructions in the guide (which I just updated to improve the experience)
m
Thanks ! I’m doing a Infrastructure-as-Code implementation of this so I’m trying to get as much details down as possible.
a
oh please! you using Terraform or similar?
m
pulumi, since a lot of code is in golang
I need to clean up some things but I’ll put it up on github for you to access. I wouldn’t mind having another set of eyes on this
a
awesome! Thanks so much
m
No thank you for helping so much, I’m still stuck trying to figure out the flye-admin role stuff but I think I might get it to work soon
a
if it's of any help, take a look at this architecture diagram which I plan to contribute to docs It's been checked with maintainers and code 🙂 https://drive.google.com/file/d/1U4UnjzEV4LzqI_e2SOLJb8wzuzt4DxUE/view?usp=sharing
m
@average-finland-92144 - Heres the repo (https://github.com/rkrishnasanka/flyte-pulumi.git). The thing is still under development, I’m still figuring out: 1. how to pull the security group associated with the EKS cluster for the RDS 2. Role for the EKS IAM, ODIC provider and a bunch of other stuff that you used the kubernates cmdline tools for. Basically 3, 4, 5 Any inputs you might have would be useful
👀 1
@average-finland-92144 one question I have is which security group you’re pulling for the EKS cluster?
a
For the tutorial I'm just using the default one. Of course EKS will create and manage two other SGs for intra-node and node-to-control-plane communication
m
So which one would work for the RDS ?
I would image the node-to-control panel SG ?
a
in this case, it would be also the
default
in the same VPC as the EKS cluster. the node-to-control plane is only for the worker nodes to the API server
m
Will the control plane be the only one talking to the RDS ?
a
with
control plane
I mean EKS control plane 🙂 so in this case, `flyteadmin`which will be a workload running on the worker nodes, is the one that needs communication with RDS
m
Gotcha, I think I have some of the kubernates fundamentals wrong I think. Okay I’m testing out the RDS connection with the default Security group now. One more clarification: Would this RDS connect test you wrote work if I don’t create the
flyte-system-role
? It seemed like it should since the flyte-system role, etc. where used for getting the roles S3Access
a
right , the RDS connection test doesn't require the IAM Role. It's needed when you have to write data to S3 (eg running a workflow)
m
@average-finland-92144 Hows it going ? So I was travelling most of last week and got to work on this again today: I created an instance of the RDS (not aurora at the moment to conserve the resources). But it seems like the pods are unable to resolve the endpoint:
Copy code
sql: could not translate host name "<http://flyte-db5b39eef.cfk9fd2rmdtl.us-east-1.rds.amazonaws.com|flyte-db5b39eef.cfk9fd2rmdtl.us-east-1.rds.amazonaws.com>" to address: Temporary failure in name resolution
pod "pgsql-postgresql-client" deleted
pod testdb/pgsql-postgresql-client terminated (Error)
Any idea to test if any namespace resolution is working on the eks nodes ?
a
Hey @miniature-plumber-7394 thanks for sharing and I'm glad you're keeping up the great work! It could be something on the coredns instance (assuming you're using coredns) You can follow the instructions here to troubleshoot DNS resolution: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
m
Oh thanks for this. Yeah, kubernates + network has been the most tedious part of trying to set this up tbh. Between VPC configuration issues, subnet unavailability, I think most of the issues have been around this.
a
I hear you, it can be troublesome
m
@average-finland-92144 so I ran:
kubectl get pods -n kube-system
Copy code
NAME                       READY   STATUS    RESTARTS   AGE
aws-node-26snb             1/1     Running   0          134m
aws-node-b5vsr             1/1     Running   0          133m
aws-node-dx522             1/1     Running   0          133m
aws-node-ssgss             1/1     Running   0          133m
aws-node-xwp5n             1/1     Running   0          134m
coredns-7975d6fb9b-cnwqf   0/1     Pending   0          138m
coredns-7975d6fb9b-xw8qw   0/1     Pending   0          138m
kube-proxy-5v9cw           1/1     Running   0          134m
kube-proxy-9r8fv           1/1     Running   0          134m
kube-proxy-9z4c5           1/1     Running   0          133m
kube-proxy-cbfff           1/1     Running   0          133m
kube-proxy-gxhcj           1/1     Running   0          133m
It shows that my coredns pods are pending ? Does that mean they didn’t start. ? is there any way to force it to start ?
a
can you
kubectl describe core-dns-... -n kube-system
see the
Events
section
m
odd the command isn’t working. its giving the following error
Copy code
error: the server doesn't have a resource type "coredns-7975d6fb9b-cnwqf"
Apologies, I’m new to kubernates
a
sorry:
kubectl describe pod core-dns... -n kube-system
m
@average-finland-92144 figured it out, my instances were too tiny so it didn’t get enough threads to launch coreDNS
Now I just have the RDS acting up and not responding
Not sure if its a VPC thing or something else
a
can you share the result of running the db test container?
the DB should be in the same VPC and Security Group your EKS cluster is at the moment
m
Ah shit, I just tore it down. I’m guessing its using the default security group for the VPC. Its in the same VPC though. I’ll launch it and send over the result.
🙏🏽 1
a
hi @miniature-plumber-7394 have you been able to make progress on this?
154 Views