Flyteconsole is now giving 502 Bad Gateway errors How do I i Flyte #flyte-support

Flyteconsole is now giving "502 Bad Gateway" error...

fancy-leather-59976

07/28/2023, 11:44 AM

Flyteconsole is now giving "502 Bad Gateway" errors. How do I interrogate Flyte to find out what the problem is? And what are solutions to likely causes? I figure something that runs the process for that functionality is dead, and I don't know how to query whether it's dead, nor how to restart it. (FWIW, it's running on top of eks at AWS.) (Sorry for the naive question. My team has been using Flyte for the past year or so. Unfortunately the leading engineers who introduced Flyte have left our company, and I know only a little about it. This isn't a prod cluster, so it's not 100% important that I get this right.)

cool-lifeguard-49380

07/28/2023, 1:54 PM

Can you pls connect to the k8s cluster and post the result of

kubectl -n flyte get pods

fancy-leather-59976

07/28/2023, 1:57 PM

I get

No resources found in flyte namespace.

Though (again, I don't know much about these things) there's stuff that looks flyte-ish:

Copy code

NAMESPACE        NAME                           READY  STATUS  RESTARTS     AGE
<snipped irrelevant content>
great-falls       flyte-pod-webhook-67b567f698-pbkg7            1/1   Running  0        170d
great-falls       flyteadmin-6bfb478fdf-6hq8m               1/1   Running  0        176d
great-falls       flyteadmin-6bfb478fdf-fw4z8               1/1   Running  0        176d
great-falls       flyteconsole-66d7468bf9-99tkk              1/1   Running  27 (9d ago)   176d
great-falls       flyteconsole-66d7468bf9-lj82g              1/1   Running  24 (3d17h ago)  176d
great-falls       flytepropeller-c888bf854-zpcgx              1/1   Running  0        176d
great-falls       flytescheduler-58dcdf9b96-8t4wd             1/1   Running  0        170d

cool-lifeguard-49380

07/28/2023, 1:58 PM

Ah ok not running in the default namespace.

cool-lifeguard-49380

07/28/2023, 1:58 PM

When you do

kubectl get pods --all-namespaces

is ther anything that is not running? Like completed, or failed

fancy-leather-59976

07/28/2023, 1:58 PM

Copy code

$ kubectl -n great-falls get pods
NAME                 READY  STATUS  RESTARTS     AGE
datacatalog-c485d95d4-75vgn     1/1   Running  0        176d
datacatalog-c485d95d4-pbkfk     1/1   Running  0        176d
flyte-pod-webhook-67b567f698-pbkg7  1/1   Running  0        170d
flyteadmin-6bfb478fdf-6hq8m     1/1   Running  0        176d
flyteadmin-6bfb478fdf-fw4z8     1/1   Running  0        176d
flyteconsole-66d7468bf9-99tkk    1/1   Running  27 (9d ago)   176d
flyteconsole-66d7468bf9-lj82g    1/1   Running  24 (3d17h ago)  176d
flytepropeller-c888bf854-zpcgx    1/1   Running  0        176d
flytescheduler-58dcdf9b96-8t4wd   1/1   Running  0        170d
syncresources-5d9df9b699-gjdqx    1/1   Running  0        170d

cool-lifeguard-49380

07/28/2023, 1:58 PM

or crashloopbackoff

cool-lifeguard-49380

07/28/2023, 1:59 PM

Some restarts in flyte console but a few days old and seems to be running now

fancy-leather-59976

07/28/2023, 1:59 PM

Everything is

Running

, except there's some prometheus stuff that is

Pending

(which doesn't seem relevant)

fancy-leather-59976

07/28/2023, 2:02 PM

No, I'm still getting

502 Bad Gateway

. Was working yesterday.

cool-lifeguard-49380

07/28/2023, 2:21 PM

Are there any nginx pods in your namespaces?

cool-lifeguard-49380

07/28/2023, 2:21 PM

Or do you know how flyte console is exposed?

cool-lifeguard-49380

07/28/2023, 2:22 PM

Does the error when accessing flyte console through the browser say anything about nginx?

fancy-leather-59976

07/28/2023, 2:23 PM

I don't see anything (in the output of

get pods

or in the error message, or in the local documentation at my company) about nginx.

cool-lifeguard-49380

07/28/2023, 2:24 PM

any other ingress? How do you access flyte console?

cool-lifeguard-49380

07/28/2023, 2:24 PM

Is it exposed in the public internet? Or local network?

fancy-leather-59976

07/28/2023, 2:24 PM

We use VPN

fancy-leather-59976

07/28/2023, 2:26 PM

Under

kubectl describe ep

, I see some things which are a little suspicious:

Copy code

Annotations: <http://endpoints.kubernetes.io/last-change-trigger-time|endpoints.kubernetes.io/last-change-trigger-time>: 2023-07-27T17:52:46Z

and

Copy code

Subsets:
 Addresses:     <none>
 NotReadyAddresses: 5.0.12.118
 Ports:
  Name Port Protocol
  ---- ---- --------
  http 7979 TCP

Those are both under

Name:     external-dns

cool-lifeguard-49380

07/28/2023, 2:58 PM

What’s the name of the corresponding service?

fancy-leather-59976

07/28/2023, 3:01 PM

Not sure. Only thing I see associated is

external-dns

. Is there a command I should type to get that?

cool-lifeguard-49380

07/28/2023, 3:02 PM

Mh I still don’t fully understand how your deployment looks like.

cool-lifeguard-49380

07/28/2023, 3:03 PM

What is the output of

kubectl get service -n great-falls

cool-lifeguard-49380

07/28/2023, 3:03 PM

Or also in all namespaces?

cool-lifeguard-49380

07/28/2023, 3:03 PM

There must be some kind of ingress or reverse proxy I’d say.

fancy-leather-59976

07/28/2023, 3:03 PM

Copy code

$ kubectl get service -n great-falls
NAME        TYPE      CLUSTER-IP    EXTERNAL-IP                                PORT(S)                         AGE
datacatalog     LoadBalancer  10.100.244.46  <http://a2491ed93a79c4e4c9c3659352ba7ad5-1661450188.us-east-1.elb.amazonaws.com|a2491ed93a79c4e4c9c3659352ba7ad5-1661450188.us-east-1.elb.amazonaws.com>  8089:32312/TCP,88:31369/TCP,89:32734/TCP         176d
flyte-pod-webhook  ClusterIP   10.100.53.162  <none>                                  443/TCP                         176d
flyteadmin     LoadBalancer  10.100.205.238  <http://ab6eeae927cb148ea831f9a0482954f0-911285910.us-east-1.elb.amazonaws.com|ab6eeae927cb148ea831f9a0482954f0-911285910.us-east-1.elb.amazonaws.com>  80:31117/TCP,81:30601/TCP,87:30822/TCP,10254:31187/TCP  176d
flyteconsole    LoadBalancer  10.100.150.24  <http://a1f320958d4844bd081a46e0c3fd4485-1851160411.us-east-1.elb.amazonaws.com|a1f320958d4844bd081a46e0c3fd4485-1851160411.us-east-1.elb.amazonaws.com>  80:30722/TCP                       176d

fancy-leather-59976

07/28/2023, 3:04 PM

Copy code

$ kubectl get service
NAME      TYPE    CLUSTER-IP    EXTERNAL-IP  PORT(S)  AGE
external-dns  ClusterIP  10.100.185.214  <none>    7979/TCP  177d
kubernetes   ClusterIP  10.100.0.1    <none>    443/TCP  177d

cool-lifeguard-49380

07/28/2023, 3:05 PM

(Are these IPs all reachable only from within the VPN? Otherwise maybe redact them)

fancy-leather-59976

07/28/2023, 3:05 PM

Copy code

$ kubectl get service -A | grep -v prometheus
NAMESPACE        NAME                         TYPE      CLUSTER-IP    EXTERNAL-IP                                PORT(S)                         AGE
actions-runner-system  actions-runner-controller-metrics-service      ClusterIP   10.100.186.11  <none>                                  8443/TCP                         136d
actions-runner-system  actions-runner-controller-webhook          ClusterIP   10.100.193.93  <none>                                  443/TCP                         136d
cert-manager      cert-manager                     ClusterIP   10.100.159.173  <none>                                  9402/TCP                         136d
cert-manager      cert-manager-webhook                 ClusterIP   10.100.183.161  <none>                                  443/TCP                         136d
default         external-dns                     ClusterIP   10.100.185.214  <none>                                  7979/TCP                         177d
default         kubernetes                      ClusterIP   10.100.0.1    <none>                                  443/TCP                         177d
great-falls       datacatalog                     LoadBalancer  10.100.244.46  <http://a2491ed93a79c4e4c9c3659352ba7ad5-1661450188.us-east-1.elb.amazonaws.com|a2491ed93a79c4e4c9c3659352ba7ad5-1661450188.us-east-1.elb.amazonaws.com>  8089:32312/TCP,88:31369/TCP,89:32734/TCP         176d
great-falls       flyte-pod-webhook                  ClusterIP   10.100.53.162  <none>                                  443/TCP                         176d
great-falls       flyteadmin                      LoadBalancer  10.100.205.238  <http://ab6eeae927cb148ea831f9a0482954f0-911285910.us-east-1.elb.amazonaws.com|ab6eeae927cb148ea831f9a0482954f0-911285910.us-east-1.elb.amazonaws.com>  80:31117/TCP,81:30601/TCP,87:30822/TCP,10254:31187/TCP  176d
great-falls       flyteconsole                     LoadBalancer  10.100.150.24  <http://a1f320958d4844bd081a46e0c3fd4485-1851160411.us-east-1.elb.amazonaws.com|a1f320958d4844bd081a46e0c3fd4485-1851160411.us-east-1.elb.amazonaws.com>  80:30722/TCP                       176d
kube-system       aws-load-balancer-webhook-service          ClusterIP   10.100.154.57  <none>                                  443/TCP                         177d
kube-system       kube-dns                       ClusterIP   10.100.0.10   <none>                                  53/UDP,53/TCP                      177d
monitoring       alertmanager-operated                ClusterIP   None       <none>                                  9093/TCP,9094/TCP,9094/UDP                177d

fancy-leather-59976

07/28/2023, 3:06 PM

I assume nothing is reachable unless I'm in the VPN, but I don't really know.

cool-lifeguard-49380

07/28/2023, 3:06 PM

<http://a1f320958d4844bd081a46xxxxxx485-1851160411.us-east-1.elb.amazonaws.com/console|a1f320958d4844bd081a46xxxxxx485-1851160411.us-east-1.elb.amazonaws.com/console>

is how you would try to reach the console?

fancy-leather-59976

07/28/2023, 3:09 PM

No. It's something like https://somethingsomething-flyte-great-falls.dev.companyname.com/console

cool-lifeguard-49380

07/28/2023, 3:10 PM

and there is a dns record somwhere configured in aws that maps this domain to

<http://a1f320958d4844bd081a46e0c3fd4485-1851160411.us-east-1.elb.amazonaws.com|a1f320958d4844bd081a46e0c3fd4485-1851160411.us-east-1.elb.amazonaws.com>?

cool-lifeguard-49380

07/28/2023, 3:11 PM

kubectl get ingress --all-namespaces

cool-lifeguard-49380

07/28/2023, 3:11 PM

Does this give anything?

fancy-leather-59976

07/28/2023, 3:13 PM

Copy code

$ kubectl get ingress --all-namespaces
NAMESPACE   NAME            CLASS  HOSTS                  ADDRESS                                      PORTS  AGE
great-falls  flyte-core         <none>  *                    <http://internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com|internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com>        80   176d
great-falls  flyte-core-grpc      <none>  *                    <http://internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com|internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com>        80   176d
monitoring  prometheus-stack-grafana  <none>  <http://grafana-great-falls.dev.embarkvet.com|grafana-great-falls.dev.embarkvet.com>  <http://internal-k8s-monitori-promethe-900828f478-1937690874.us-east-1.elb.amazonaws.com|internal-k8s-monitori-promethe-900828f478-1937690874.us-east-1.elb.amazonaws.com>  80   177d

cool-lifeguard-49380

07/28/2023, 3:14 PM

Ok, and

kubectl -n great-falls describe ingress flyte-core-(grpc)

fancy-leather-59976

07/28/2023, 3:14 PM

Route 53 record for the somethingsomething-flyte-great-falls.dev.companyname.com seems to point at internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com

cool-lifeguard-49380

07/28/2023, 3:14 PM

Makes sense

cool-lifeguard-49380

07/28/2023, 3:14 PM

that is the external address of the ingress

fancy-leather-59976

07/28/2023, 3:15 PM

Copy code

$ kubectl -n great-falls describe ingress flyte-core
Name:       flyte-core
Labels:      <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
Namespace:    great-falls
Address:     <http://internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com|internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com>
Ingress Class:  <none>
Default backend: <default>
Rules:
 Host    Path Backends
 ----    ---- --------
 *      
       /*        ssl-redirect:use-annotation (<error: endpoints "ssl-redirect" not found>)
       /console     flyteconsole:80 ()
       /console/*    flyteconsole:80 ()
       /api       flyteadmin:80 ()
       /api/*      flyteadmin:80 ()
       /healthcheck   flyteadmin:80 ()
       /v1/*      flyteadmin:80 ()
       /.well-known   flyteadmin:80 ()
       /.well-known/*  flyteadmin:80 ()
       /login      flyteadmin:80 ()
       /login/*     flyteadmin:80 ()
       /logout     flyteadmin:80 ()
       /logout/*    flyteadmin:80 ()
       /callback    flyteadmin:80 ()
       /callback/*   flyteadmin:80 ()
       /me       flyteadmin:80 ()
       /config     flyteadmin:80 ()
       /config/*    flyteadmin:80 ()
       /oauth2     flyteadmin:80 ()
       /oauth2/*    flyteadmin:80 ()
Annotations: <http://alb.ingress.kubernetes.io/actions.ssl-redirect|alb.ingress.kubernetes.io/actions.ssl-redirect>:
        {"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}
       <http://alb.ingress.kubernetes.io/certificate-arn|alb.ingress.kubernetes.io/certificate-arn>: arn:aws:acm:us-east-1:11111111111111:certificate/[UUID-ish ID]
       <http://alb.ingress.kubernetes.io/group.name|alb.ingress.kubernetes.io/group.name>: flyte
       <http://alb.ingress.kubernetes.io/listen-ports|alb.ingress.kubernetes.io/listen-ports>: [{"HTTP": 80}, {"HTTPS":443}]
       <http://alb.ingress.kubernetes.io/scheme|alb.ingress.kubernetes.io/scheme>: internal
       <http://alb.ingress.kubernetes.io/tags|alb.ingress.kubernetes.io/tags>: service_instance=production
       <http://external-dns.alpha.kubernetes.io/hostname|external-dns.alpha.kubernetes.io/hostname>: <http://somethingsomething-great-falls.dev.companyname.com|somethingsomething-great-falls.dev.companyname.com>
       <http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: alb
       <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: great-falls
       <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: great-falls
       <http://nginx.ingress.kubernetes.io/app-root|nginx.ingress.kubernetes.io/app-root>: /console
Events:    <none>

fancy-leather-59976

07/28/2023, 3:16 PM

Copy code

$ kubectl -n great-falls describe ingress flyte-core
Name:       flyte-core
Labels:      <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
Namespace:    great-falls
Address:     <http://internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com|internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com>
Ingress Class:  <none>
Default backend: <default>
Rules:
 Host    Path Backends
 ----    ---- --------
 *      
       /*        ssl-redirect:use-annotation (<error: endpoints "ssl-redirect" not found>)
       /console     flyteconsole:80 ()
       /console/*    flyteconsole:80 ()
       /api       flyteadmin:80 ()
       /api/*      flyteadmin:80 ()
       /healthcheck   flyteadmin:80 ()
       /v1/*      flyteadmin:80 ()
       /.well-known   flyteadmin:80 ()
       /.well-known/*  flyteadmin:80 ()
       /login      flyteadmin:80 ()
       /login/*     flyteadmin:80 ()
       /logout     flyteadmin:80 ()
       /logout/*    flyteadmin:80 ()
       /callback    flyteadmin:80 ()
       /callback/*   flyteadmin:80 ()
       /me       flyteadmin:80 ()
       /config     flyteadmin:80 ()
       /config/*    flyteadmin:80 ()
       /oauth2     flyteadmin:80 ()
       /oauth2/*    flyteadmin:80 ()
Annotations: <http://alb.ingress.kubernetes.io/actions.ssl-redirect|alb.ingress.kubernetes.io/actions.ssl-redirect>:
        {"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}
       <http://alb.ingress.kubernetes.io/certificate-arn|alb.ingress.kubernetes.io/certificate-arn>: arn:aws:acm:us-east-1:1111111111:certificate/[UUID-ish ID]
       <http://alb.ingress.kubernetes.io/group.name|alb.ingress.kubernetes.io/group.name>: flyte
       <http://alb.ingress.kubernetes.io/listen-ports|alb.ingress.kubernetes.io/listen-ports>: [{"HTTP": 80}, {"HTTPS":443}]
       <http://alb.ingress.kubernetes.io/scheme|alb.ingress.kubernetes.io/scheme>: internal
       <http://alb.ingress.kubernetes.io/tags|alb.ingress.kubernetes.io/tags>: service_instance=production
       <http://external-dns.alpha.kubernetes.io/hostname|external-dns.alpha.kubernetes.io/hostname>: <http://somethingsomething-flyte-great-falls.dev.companyname.com|somethingsomething-flyte-great-falls.dev.companyname.com>
       <http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: alb
       <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: great-falls
       <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: great-falls
       <http://nginx.ingress.kubernetes.io/app-root|nginx.ingress.kubernetes.io/app-root>: /console
Events:    <none>
(base) ip-xx-xx-xx-xx:flyte_deployment sfromm$ kubectl -n great-falls describe ingress flyte-core-grpc
Name:       flyte-core-grpc
Labels:      <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
Namespace:    great-falls
Address:     <http://internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com|internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com>
Ingress Class:  <none>
Default backend: <default>
Rules:
 Host    Path Backends
 ----    ---- --------
 *      
       /flyteidl.service.AdminService      flyteadmin:81 ()
       /flyteidl.service.AdminService/*     flyteadmin:81 ()
       /flyteidl.service.DataProxyService    flyteadmin:81 ()
       /flyteidl.service.DataProxyService/*   flyteadmin:81 ()
       /flyteidl.service.AuthMetadataService   flyteadmin:81 ()
       /flyteidl.service.AuthMetadataService/*  flyteadmin:81 ()
       /flyteidl.service.IdentityService     flyteadmin:81 ()
       /flyteidl.service.IdentityService/*    flyteadmin:81 ()
       /grpc.health.v1.Health          flyteadmin:81 ()
       /grpc.health.v1.Health/*         flyteadmin:81 ()
Annotations: <http://alb.ingress.kubernetes.io/actions.ssl-redirect|alb.ingress.kubernetes.io/actions.ssl-redirect>:
        {"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}
       <http://alb.ingress.kubernetes.io/backend-protocol-version|alb.ingress.kubernetes.io/backend-protocol-version>: HTTP2
       <http://alb.ingress.kubernetes.io/certificate-arn|alb.ingress.kubernetes.io/certificate-arn>: arn:aws:acm:us-east-1:11111111111:certificate/[UUID-ish ID]
       <http://alb.ingress.kubernetes.io/group.name|alb.ingress.kubernetes.io/group.name>: flyte
       <http://alb.ingress.kubernetes.io/listen-ports|alb.ingress.kubernetes.io/listen-ports>: [{"HTTP": 80}, {"HTTPS":443}]
       <http://alb.ingress.kubernetes.io/scheme|alb.ingress.kubernetes.io/scheme>: internal
       <http://alb.ingress.kubernetes.io/tags|alb.ingress.kubernetes.io/tags>: service_instance=production
       <http://external-dns.alpha.kubernetes.io/hostname|external-dns.alpha.kubernetes.io/hostname>: <http://somethingsomething-great-falls.dev.companyname.com|somethingsomething-great-falls.dev.companyname.com>
       <http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: alb
       <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: great-falls
       <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: great-falls
       <http://nginx.ingress.kubernetes.io/app-root|nginx.ingress.kubernetes.io/app-root>: /console
       <http://nginx.ingress.kubernetes.io/backend-protocol|nginx.ingress.kubernetes.io/backend-protocol>: GRPC
Events:    <none>

cool-lifeguard-49380

07/28/2023, 3:16 PM

I’ve never done this with AWS but in GCP there is a page in the UI where one can see the health of the ingress

cool-lifeguard-49380

07/28/2023, 3:16 PM

Can you pls check whether this is the case in aws?

fancy-leather-59976

07/28/2023, 3:20 PM

I'll have to look at the docs. Might be away from keyboard for awhile for other reasons. It'd be great if you checked this thread later today, but if you're too busy I understand (since you're doing this for free!).

cool-lifeguard-49380

07/28/2023, 3:21 PM

I’ll try to check tonight 🙂 5:30 pm here

🙌 1

fancy-leather-59976

07/28/2023, 8:56 PM

Unfortunately I've done some looking and cannot find anything more about ingress health.

cool-lifeguard-49380

07/31/2023, 4:44 PM

I unfortunately also don’t know where this is found in AWS. The fact that

Events:    <none>

at least on GCP would be a sign that the ingress is not doing so well. But can’t tell on AWS, sorry.

👍 1

fancy-leather-59976

07/31/2023, 4:46 PM

My colleagues and I looked at it more. It appears the pods were in the Ready state, but the nodes were in Not Ready. I would have thought that can't happen, though I did find a github issue (for k8s) where it happens, at least for older versions of k8s (and we're using an outdated version). So now we're cordoning and draining the nodes.

fancy-leather-59976

07/31/2023, 5:12 PM

We figured out what happened. In an attempt to give me access to read the cluster state (using k9s, kubectl, etc), we added an identity mapping with me as the user, which shadowed the system user, which basically halted all communication within the network. Whoops.

cool-lifeguard-49380

07/31/2023, 6:15 PM

Good to hear you figured it out 🙂

fancy-leather-59976

07/31/2023, 6:16 PM

Sorry to have wasted so much of your time... 😢

cool-lifeguard-49380

07/31/2023, 6:19 PM

No worries 🙂

31 Views

Open in Slack

Previous Next