Hi folks is there a quick sanity check that can be used to v Flyte #flyte-deployment

Hi folks - is there a quick sanity check that can ...

gorgeous-waitress-5026

11/29/2023, 12:18 AM

Hi folks - is there a quick sanity check that can be used to verify that a control plane and data plane are configured properly and talking to one another? (Without actually scheduling a workflow to run that is)

average-finland-92144

11/29/2023, 3:41 PM

@gorgeous-waitress-5026 I usually check the status of the

sync-resources

Pod. If, after configuring everything, that Pod is

Running

, that's a good sign

gorgeous-waitress-5026

11/29/2023, 3:43 PM

I'm digging in a little bit further now and I actually see some rpc errors getting logged in

flytepropeller

inside the data plane. Jobs look to be sent over to the data plane properly, but it looks to me like traffic isn't properly flowing out of that cluster over GRPC back to the control plane. Double-checking my ingress definition now (and my ingress logs)

gorgeous-waitress-5026

11/29/2023, 3:45 PM

Copy code

{"json":{"exec_id":"f194818b7dbea4957858","ns":"flytesnacks-development","res_ver":"10430815","routine":"worker-1","wf":"flytesnacks:development:.flytegen.basic-task.slope"},"level":"warning","msg":"Event recording failed. Error [EventSinkError: Error sending event, caused by [rpc error: code = Unavailable desc = connection error: desc = \"error reading server preface: http2: frame too large\"]]","ts":"2023-11-29T15:41:06Z"}

{"json":{"exec_id":"f194818b7dbea4957858","ns":"flytesnacks-development","res_ver":"10430815","routine":"worker-1","wf":"flytesnacks:development:.flytegen.basic-task.slope"},"level":"error","msg":"Error when trying to reconcile workflow. Error [[]]. Error Type[*errors.WorkflowErrorWithCause]","ts":"2023-11-29T15:41:06Z"}

average-finland-92144

11/29/2023, 3:46 PM

what Ingress controller are you using?

gorgeous-waitress-5026

11/29/2023, 3:46 PM

nginx-ingress (unfortunately ;))

😅 1

average-finland-92144

11/29/2023, 3:47 PM

on EKS?

gorgeous-waitress-5026

11/29/2023, 3:47 PM

Yup!

average-finland-92144

11/29/2023, 3:50 PM

from a reference implementation, these are the annotations used:

Copy code

common:
  ingress:
    host: "{{ .Values.userSettings.hostName }}"
    tls:
      enabled: true
      secretName: flyte-secret-tls
    annotations:
      <http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: nginx
      <http://ingress.kubernetes.io/rewrite-target|ingress.kubernetes.io/rewrite-target>: /
      <http://nginx.ingress.kubernetes.io/ssl-redirect|nginx.ingress.kubernetes.io/ssl-redirect>: "true"
      <http://cert-manager.io/issuer|cert-manager.io/issuer>: "letsencrypt-production"
      <http://acme.cert-manager.io/http01-edit-in-place|acme.cert-manager.io/http01-edit-in-place>: "true"
    # --- separateGrpcIngress puts GRPC routes into a separate ingress if true. Required for certain ingress controllers like nginx.
    separateGrpcIngress: true
    # --- Extra Ingress annotations applied only to the GRPC ingress. Only makes sense if `separateGrpcIngress` is enabled.
    separateGrpcIngressAnnotations:
      <http://nginx.ingress.kubernetes.io/backend-protocol|nginx.ingress.kubernetes.io/backend-protocol>: "GRPC"

You can remove the cert-manager related content

gorgeous-waitress-5026

11/29/2023, 3:51 PM

Yeah, I should have mentioned I'm already using the separate ingress definition with the GRPC annotation

gorgeous-waitress-5026

11/29/2023, 3:53 PM

Just for sanity, the grpc endpoint config for the remote data plane should be like this, yeah?

Copy code

configmap:
  admin:
    admin:
      endpoint: publichost.domain.com:443
      insecure: false
  catalog:
    catalog-cache:
      endpoint: publichost.domain.com:443
      insecure: false

👍🏽 1

gorgeous-waitress-5026

11/29/2023, 4:07 PM

Are there particular services that I could call using

grpcurl

from outside the ingress to validate things?

average-finland-92144

11/29/2023, 4:15 PM

there should be a flyteadmin Service exposing the 8089 port if I remember correctly

gorgeous-waitress-5026

11/29/2023, 8:40 PM

From inside the cluster, an easy way to test connectivity to the service (over plaintext) is with:

grpcurl -plaintext flyteadmin.flyte:81 list

gorgeous-waitress-5026

11/29/2023, 8:40 PM

Then it's easy to kind of move outward from there to find the point at which the service becomes unreachable

gratitude thank you 1

7 Views

Open in Slack

Previous Next