Flyteconsole is now giving "502 Bad Gateway" error...
# ask-the-community
s
Flyteconsole is now giving "502 Bad Gateway" errors. How do I interrogate Flyte to find out what the problem is? And what are solutions to likely causes? I figure something that runs the process for that functionality is dead, and I don't know how to query whether it's dead, nor how to restart it. (FWIW, it's running on top of eks at AWS.) (Sorry for the naive question. My team has been using Flyte for the past year or so. Unfortunately the leading engineers who introduced Flyte have left our company, and I know only a little about it. This isn't a prod cluster, so it's not 100% important that I get this right.)
f
Can you pls connect to the k8s cluster and post the result of
kubectl -n flyte get pods
?
s
I get
No resources found in flyte namespace.
Though (again, I don't know much about these things) there's stuff that looks flyte-ish:
Copy code
NAMESPACE        NAME                           READY  STATUS  RESTARTS     AGE
<snipped irrelevant content>
great-falls       flyte-pod-webhook-67b567f698-pbkg7            1/1   Running  0        170d
great-falls       flyteadmin-6bfb478fdf-6hq8m               1/1   Running  0        176d
great-falls       flyteadmin-6bfb478fdf-fw4z8               1/1   Running  0        176d
great-falls       flyteconsole-66d7468bf9-99tkk              1/1   Running  27 (9d ago)   176d
great-falls       flyteconsole-66d7468bf9-lj82g              1/1   Running  24 (3d17h ago)  176d
great-falls       flytepropeller-c888bf854-zpcgx              1/1   Running  0        176d
great-falls       flytescheduler-58dcdf9b96-8t4wd             1/1   Running  0        170d
f
Ah ok not running in the default namespace.
When you do
kubectl get pods --all-namespaces
is ther anything that is not running? Like completed, or failed
s
Copy code
$ kubectl -n great-falls get pods
NAME                 READY  STATUS  RESTARTS     AGE
datacatalog-c485d95d4-75vgn     1/1   Running  0        176d
datacatalog-c485d95d4-pbkfk     1/1   Running  0        176d
flyte-pod-webhook-67b567f698-pbkg7  1/1   Running  0        170d
flyteadmin-6bfb478fdf-6hq8m     1/1   Running  0        176d
flyteadmin-6bfb478fdf-fw4z8     1/1   Running  0        176d
flyteconsole-66d7468bf9-99tkk    1/1   Running  27 (9d ago)   176d
flyteconsole-66d7468bf9-lj82g    1/1   Running  24 (3d17h ago)  176d
flytepropeller-c888bf854-zpcgx    1/1   Running  0        176d
flytescheduler-58dcdf9b96-8t4wd   1/1   Running  0        170d
syncresources-5d9df9b699-gjdqx    1/1   Running  0        170d
f
or crashloopbackoff
Some restarts in flyte console but a few days old and seems to be running now
s
Everything is
Running
, except there's some prometheus stuff that is
Pending
(which doesn't seem relevant)
No, I'm still getting
502 Bad Gateway
. Was working yesterday.
f
Are there any nginx pods in your namespaces?
Or do you know how flyte console is exposed?
Does the error when accessing flyte console through the browser say anything about nginx?
s
I don't see anything (in the output of
get pods
or in the error message, or in the local documentation at my company) about nginx.
f
any other ingress? How do you access flyte console?
Is it exposed in the public internet? Or local network?
s
We use VPN
Under
kubectl describe ep
, I see some things which are a little suspicious:
Copy code
Annotations: <http://endpoints.kubernetes.io/last-change-trigger-time|endpoints.kubernetes.io/last-change-trigger-time>: 2023-07-27T17:52:46Z
and
Copy code
Subsets:
 Addresses:     <none>
 NotReadyAddresses: 5.0.12.118
 Ports:
  Name Port Protocol
  ---- ---- --------
  http 7979 TCP
Those are both under
Name:     external-dns
f
What’s the name of the corresponding service?
s
Not sure. Only thing I see associated is
external-dns
. Is there a command I should type to get that?
f
Mh I still don’t fully understand how your deployment looks like.
What is the output of
kubectl get service -n great-falls
?
Or also in all namespaces?
There must be some kind of ingress or reverse proxy I’d say.
s
Copy code
$ kubectl get service -n great-falls
NAME        TYPE      CLUSTER-IP    EXTERNAL-IP                                PORT(S)                         AGE
datacatalog     LoadBalancer  10.100.244.46  <http://a2491ed93a79c4e4c9c3659352ba7ad5-1661450188.us-east-1.elb.amazonaws.com|a2491ed93a79c4e4c9c3659352ba7ad5-1661450188.us-east-1.elb.amazonaws.com>  8089:32312/TCP,88:31369/TCP,89:32734/TCP         176d
flyte-pod-webhook  ClusterIP   10.100.53.162  <none>                                  443/TCP                         176d
flyteadmin     LoadBalancer  10.100.205.238  <http://ab6eeae927cb148ea831f9a0482954f0-911285910.us-east-1.elb.amazonaws.com|ab6eeae927cb148ea831f9a0482954f0-911285910.us-east-1.elb.amazonaws.com>  80:31117/TCP,81:30601/TCP,87:30822/TCP,10254:31187/TCP  176d
flyteconsole    LoadBalancer  10.100.150.24  <http://a1f320958d4844bd081a46e0c3fd4485-1851160411.us-east-1.elb.amazonaws.com|a1f320958d4844bd081a46e0c3fd4485-1851160411.us-east-1.elb.amazonaws.com>  80:30722/TCP                       176d
Copy code
$ kubectl get service
NAME      TYPE    CLUSTER-IP    EXTERNAL-IP  PORT(S)  AGE
external-dns  ClusterIP  10.100.185.214  <none>    7979/TCP  177d
kubernetes   ClusterIP  10.100.0.1    <none>    443/TCP  177d
f
(Are these IPs all reachable only from within the VPN? Otherwise maybe redact them)
s
Copy code
$ kubectl get service -A | grep -v prometheus
NAMESPACE        NAME                         TYPE      CLUSTER-IP    EXTERNAL-IP                                PORT(S)                         AGE
actions-runner-system  actions-runner-controller-metrics-service      ClusterIP   10.100.186.11  <none>                                  8443/TCP                         136d
actions-runner-system  actions-runner-controller-webhook          ClusterIP   10.100.193.93  <none>                                  443/TCP                         136d
cert-manager      cert-manager                     ClusterIP   10.100.159.173  <none>                                  9402/TCP                         136d
cert-manager      cert-manager-webhook                 ClusterIP   10.100.183.161  <none>                                  443/TCP                         136d
default         external-dns                     ClusterIP   10.100.185.214  <none>                                  7979/TCP                         177d
default         kubernetes                      ClusterIP   10.100.0.1    <none>                                  443/TCP                         177d
great-falls       datacatalog                     LoadBalancer  10.100.244.46  <http://a2491ed93a79c4e4c9c3659352ba7ad5-1661450188.us-east-1.elb.amazonaws.com|a2491ed93a79c4e4c9c3659352ba7ad5-1661450188.us-east-1.elb.amazonaws.com>  8089:32312/TCP,88:31369/TCP,89:32734/TCP         176d
great-falls       flyte-pod-webhook                  ClusterIP   10.100.53.162  <none>                                  443/TCP                         176d
great-falls       flyteadmin                      LoadBalancer  10.100.205.238  <http://ab6eeae927cb148ea831f9a0482954f0-911285910.us-east-1.elb.amazonaws.com|ab6eeae927cb148ea831f9a0482954f0-911285910.us-east-1.elb.amazonaws.com>  80:31117/TCP,81:30601/TCP,87:30822/TCP,10254:31187/TCP  176d
great-falls       flyteconsole                     LoadBalancer  10.100.150.24  <http://a1f320958d4844bd081a46e0c3fd4485-1851160411.us-east-1.elb.amazonaws.com|a1f320958d4844bd081a46e0c3fd4485-1851160411.us-east-1.elb.amazonaws.com>  80:30722/TCP                       176d
kube-system       aws-load-balancer-webhook-service          ClusterIP   10.100.154.57  <none>                                  443/TCP                         177d
kube-system       kube-dns                       ClusterIP   10.100.0.10   <none>                                  53/UDP,53/TCP                      177d
monitoring       alertmanager-operated                ClusterIP   None       <none>                                  9093/TCP,9094/TCP,9094/UDP                177d
I assume nothing is reachable unless I'm in the VPN, but I don't really know.
f
<http://a1f320958d4844bd081a46xxxxxx485-1851160411.us-east-1.elb.amazonaws.com/console|a1f320958d4844bd081a46xxxxxx485-1851160411.us-east-1.elb.amazonaws.com/console>
is how you would try to reach the console?
s
f
and there is a dns record somwhere configured in aws that maps this domain to
<http://a1f320958d4844bd081a46e0c3fd4485-1851160411.us-east-1.elb.amazonaws.com|a1f320958d4844bd081a46e0c3fd4485-1851160411.us-east-1.elb.amazonaws.com>?
kubectl get ingress --all-namespaces
Does this give anything?
s
Copy code
$ kubectl get ingress --all-namespaces
NAMESPACE   NAME            CLASS  HOSTS                  ADDRESS                                      PORTS  AGE
great-falls  flyte-core         <none>  *                    <http://internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com|internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com>        80   176d
great-falls  flyte-core-grpc      <none>  *                    <http://internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com|internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com>        80   176d
monitoring  prometheus-stack-grafana  <none>  <http://grafana-great-falls.dev.embarkvet.com|grafana-great-falls.dev.embarkvet.com>  <http://internal-k8s-monitori-promethe-900828f478-1937690874.us-east-1.elb.amazonaws.com|internal-k8s-monitori-promethe-900828f478-1937690874.us-east-1.elb.amazonaws.com>  80   177d
f
Ok, and
kubectl -n great-falls describe ingress flyte-core-(grpc)
?
f
Makes sense
that is the external address of the ingress
s
Copy code
$ kubectl -n great-falls describe ingress flyte-core
Name:       flyte-core
Labels:      <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
Namespace:    great-falls
Address:     <http://internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com|internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com>
Ingress Class:  <none>
Default backend: <default>
Rules:
 Host    Path Backends
 ----    ---- --------
 *      
       /*        ssl-redirect:use-annotation (<error: endpoints "ssl-redirect" not found>)
       /console     flyteconsole:80 ()
       /console/*    flyteconsole:80 ()
       /api       flyteadmin:80 ()
       /api/*      flyteadmin:80 ()
       /healthcheck   flyteadmin:80 ()
       /v1/*      flyteadmin:80 ()
       /.well-known   flyteadmin:80 ()
       /.well-known/*  flyteadmin:80 ()
       /login      flyteadmin:80 ()
       /login/*     flyteadmin:80 ()
       /logout     flyteadmin:80 ()
       /logout/*    flyteadmin:80 ()
       /callback    flyteadmin:80 ()
       /callback/*   flyteadmin:80 ()
       /me       flyteadmin:80 ()
       /config     flyteadmin:80 ()
       /config/*    flyteadmin:80 ()
       /oauth2     flyteadmin:80 ()
       /oauth2/*    flyteadmin:80 ()
Annotations: <http://alb.ingress.kubernetes.io/actions.ssl-redirect|alb.ingress.kubernetes.io/actions.ssl-redirect>:
        {"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}
       <http://alb.ingress.kubernetes.io/certificate-arn|alb.ingress.kubernetes.io/certificate-arn>: arn:aws:acm:us-east-1:11111111111111:certificate/[UUID-ish ID]
       <http://alb.ingress.kubernetes.io/group.name|alb.ingress.kubernetes.io/group.name>: flyte
       <http://alb.ingress.kubernetes.io/listen-ports|alb.ingress.kubernetes.io/listen-ports>: [{"HTTP": 80}, {"HTTPS":443}]
       <http://alb.ingress.kubernetes.io/scheme|alb.ingress.kubernetes.io/scheme>: internal
       <http://alb.ingress.kubernetes.io/tags|alb.ingress.kubernetes.io/tags>: service_instance=production
       <http://external-dns.alpha.kubernetes.io/hostname|external-dns.alpha.kubernetes.io/hostname>: <http://somethingsomething-great-falls.dev.companyname.com|somethingsomething-great-falls.dev.companyname.com>
       <http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: alb
       <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: great-falls
       <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: great-falls
       <http://nginx.ingress.kubernetes.io/app-root|nginx.ingress.kubernetes.io/app-root>: /console
Events:    <none>
Copy code
$ kubectl -n great-falls describe ingress flyte-core
Name:       flyte-core
Labels:      <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
Namespace:    great-falls
Address:     <http://internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com|internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com>
Ingress Class:  <none>
Default backend: <default>
Rules:
 Host    Path Backends
 ----    ---- --------
 *      
       /*        ssl-redirect:use-annotation (<error: endpoints "ssl-redirect" not found>)
       /console     flyteconsole:80 ()
       /console/*    flyteconsole:80 ()
       /api       flyteadmin:80 ()
       /api/*      flyteadmin:80 ()
       /healthcheck   flyteadmin:80 ()
       /v1/*      flyteadmin:80 ()
       /.well-known   flyteadmin:80 ()
       /.well-known/*  flyteadmin:80 ()
       /login      flyteadmin:80 ()
       /login/*     flyteadmin:80 ()
       /logout     flyteadmin:80 ()
       /logout/*    flyteadmin:80 ()
       /callback    flyteadmin:80 ()
       /callback/*   flyteadmin:80 ()
       /me       flyteadmin:80 ()
       /config     flyteadmin:80 ()
       /config/*    flyteadmin:80 ()
       /oauth2     flyteadmin:80 ()
       /oauth2/*    flyteadmin:80 ()
Annotations: <http://alb.ingress.kubernetes.io/actions.ssl-redirect|alb.ingress.kubernetes.io/actions.ssl-redirect>:
        {"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}
       <http://alb.ingress.kubernetes.io/certificate-arn|alb.ingress.kubernetes.io/certificate-arn>: arn:aws:acm:us-east-1:1111111111:certificate/[UUID-ish ID]
       <http://alb.ingress.kubernetes.io/group.name|alb.ingress.kubernetes.io/group.name>: flyte
       <http://alb.ingress.kubernetes.io/listen-ports|alb.ingress.kubernetes.io/listen-ports>: [{"HTTP": 80}, {"HTTPS":443}]
       <http://alb.ingress.kubernetes.io/scheme|alb.ingress.kubernetes.io/scheme>: internal
       <http://alb.ingress.kubernetes.io/tags|alb.ingress.kubernetes.io/tags>: service_instance=production
       <http://external-dns.alpha.kubernetes.io/hostname|external-dns.alpha.kubernetes.io/hostname>: <http://somethingsomething-flyte-great-falls.dev.companyname.com|somethingsomething-flyte-great-falls.dev.companyname.com>
       <http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: alb
       <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: great-falls
       <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: great-falls
       <http://nginx.ingress.kubernetes.io/app-root|nginx.ingress.kubernetes.io/app-root>: /console
Events:    <none>
(base) ip-xx-xx-xx-xx:flyte_deployment sfromm$ kubectl -n great-falls describe ingress flyte-core-grpc
Name:       flyte-core-grpc
Labels:      <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
Namespace:    great-falls
Address:     <http://internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com|internal-k8s-flyte-4805ca007d-2146548039.us-east-1.elb.amazonaws.com>
Ingress Class:  <none>
Default backend: <default>
Rules:
 Host    Path Backends
 ----    ---- --------
 *      
       /flyteidl.service.AdminService      flyteadmin:81 ()
       /flyteidl.service.AdminService/*     flyteadmin:81 ()
       /flyteidl.service.DataProxyService    flyteadmin:81 ()
       /flyteidl.service.DataProxyService/*   flyteadmin:81 ()
       /flyteidl.service.AuthMetadataService   flyteadmin:81 ()
       /flyteidl.service.AuthMetadataService/*  flyteadmin:81 ()
       /flyteidl.service.IdentityService     flyteadmin:81 ()
       /flyteidl.service.IdentityService/*    flyteadmin:81 ()
       /grpc.health.v1.Health          flyteadmin:81 ()
       /grpc.health.v1.Health/*         flyteadmin:81 ()
Annotations: <http://alb.ingress.kubernetes.io/actions.ssl-redirect|alb.ingress.kubernetes.io/actions.ssl-redirect>:
        {"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}
       <http://alb.ingress.kubernetes.io/backend-protocol-version|alb.ingress.kubernetes.io/backend-protocol-version>: HTTP2
       <http://alb.ingress.kubernetes.io/certificate-arn|alb.ingress.kubernetes.io/certificate-arn>: arn:aws:acm:us-east-1:11111111111:certificate/[UUID-ish ID]
       <http://alb.ingress.kubernetes.io/group.name|alb.ingress.kubernetes.io/group.name>: flyte
       <http://alb.ingress.kubernetes.io/listen-ports|alb.ingress.kubernetes.io/listen-ports>: [{"HTTP": 80}, {"HTTPS":443}]
       <http://alb.ingress.kubernetes.io/scheme|alb.ingress.kubernetes.io/scheme>: internal
       <http://alb.ingress.kubernetes.io/tags|alb.ingress.kubernetes.io/tags>: service_instance=production
       <http://external-dns.alpha.kubernetes.io/hostname|external-dns.alpha.kubernetes.io/hostname>: <http://somethingsomething-great-falls.dev.companyname.com|somethingsomething-great-falls.dev.companyname.com>
       <http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: alb
       <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: great-falls
       <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: great-falls
       <http://nginx.ingress.kubernetes.io/app-root|nginx.ingress.kubernetes.io/app-root>: /console
       <http://nginx.ingress.kubernetes.io/backend-protocol|nginx.ingress.kubernetes.io/backend-protocol>: GRPC
Events:    <none>
f
I’ve never done this with AWS but in GCP there is a page in the UI where one can see the health of the ingress
Can you pls check whether this is the case in aws?
s
I'll have to look at the docs. Might be away from keyboard for awhile for other reasons. It'd be great if you checked this thread later today, but if you're too busy I understand (since you're doing this for free!).
f
I’ll try to check tonight 🙂 5:30 pm here
s
Unfortunately I've done some looking and cannot find anything more about ingress health.
f
I unfortunately also don’t know where this is found in AWS. The fact that
Events:    <none>
at least on GCP would be a sign that the ingress is not doing so well. But can’t tell on AWS, sorry.
s
My colleagues and I looked at it more. It appears the pods were in the Ready state, but the nodes were in Not Ready. I would have thought that can't happen, though I did find a github issue (for k8s) where it happens, at least for older versions of k8s (and we're using an outdated version). So now we're cordoning and draining the nodes.
We figured out what happened. In an attempt to give me access to read the cluster state (using k9s, kubectl, etc), we added an identity mapping with me as the user, which shadowed the system user, which basically halted all communication within the network. Whoops.
f
Good to hear you figured it out 🙂
s
Sorry to have wasted so much of your time... 😢
f
No worries 🙂