Does anyone have experience with TLS/SSL between t...
# ask-the-community
f
Does anyone have experience with TLS/SSL between the Ingress and Flyte? 🔒 🔒 I’m currently trying to get a reference implementation to work for deploying Flyte on GCP with the GKE Ingress controller (instead of nginx) wich would allow us to use Google Identity Aware Proxy (IAP) for authentication. IAP requires the user to login “at the load balancer”, before a request reaches the backend, instead of being redirected to the login screen by the backend. Because this is more secure it’s a requirement by our company (and actually also my previous one) to use IAP. To make gRPC backends work with the GKE Ingress, HTTP2 needs to be enabled for the backend. Flyte already does so in the gcp helm values. Using HTTP2 between the load balancer and the backend requires TLS between lb and backend. So I’m trying to deploy Flyte with this config:
Copy code
configmap:
  adminServer:
    server:
      httpPort: 8088
      grpcPort: 8089
      security:
        secure: false #true
        ssl:
          certificateFile: "/etc/tls/cert.pem"
          keyFile: "/etc/tls/key.pem"
Certificate is self-signed and mounted into the flyteadmin pod via a secret. I can’t talk to flyteadmin though and would appreciate help getting this to work 🙂 Details in 🧵
• I got a minimal working example “hello world” gRPC app to work with GKE Ingress and IAP. For it to work, the server needs to use a TLS certificate which can be self signed. • I create the certificate with
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes -subj "/CN=0.0.0.0"
(which works for my hello world example) • But from another pod in the flyte namespace, I cannot talk to admin (apart from health check):
Copy code
curl -v --insecure <https://flyteadmin:80/healthcheck>
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
> GET /healthcheck HTTP/1.1
> User-Agent: curl/7.35.0
> Host: flyteadmin
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Fri, 30 Jun 2023 17:53:27 GMT
< Content-Length: 0
But
Copy code
curl -v --insecure <https://flyteadmin:80/api/v1/projects>
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
> GET /api/v1/projects HTTP/1.1
> User-Agent: curl/7.35.0
> Host: flyteadmin
> Accept: */*
> 
< HTTP/1.1 503 Service Unavailable
< Content-Type: application/json
< Date: Fri, 30 Jun 2023 17:53:45 GMT
< Content-Length: 337
< 
{"error":"connection error: desc = \"transport: authentication handshake failed: x509: certificate is not valid for any names, but wanted to match 0.0.0.0:8088\"","code":14,"message":"connection error: desc = \"transport: authentication handshake failed: x509: certificate is not valid for any names, but wanted to match 0.0.0.0:8088\""}
Does anyone know how the certificate needs to be created?
Screenshot_2023-06-30_at_19_54_57.png
Update: With
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes -subj "/CN=0.0.0.0" -addext "subjectAltName = DNS:0.0.0.0:8088"
I can now to to the http server of admin from within the cluster and also flyte console works.
Copy code
curl --insecure <https://flyteadmin:80/api/v1/projects>
{"projects":[{"id":"flytesnacks","name":"flytesnacks","domains":[{"id":"development","name":"development"},{"id":"staging","name":"staging"},{"id":"production","name":"production"}],"description":"flytesnacks description"}]}
But I still cannot make requests to the gRPC server 😕 (
flytectl get projects
works when disabling ssl between lb and admin)
Flyte scheduler init container also fails with this error:
Copy code
panic: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.88.13.168:81: connect: connection refused"
(This is the flyteadmin IP)
Screenshot_2023-06-30_at_20_39_47.png
So something is still off with the certificate I assume. Would really appreciate if somebody knows how the certificate needs to be generated 🙈
j
@Fabio Grätz: it doesnt work within the cluster either?
Copy code
flytectl get projects --admin.endpoint=flyteadmin:81 --admin.insecureSkipVerify=true
f
I actually didn’t try that but will now! ⏳
Copy code
./bin/flytectl get projects --admin.endpoint=flyteadmin:81 --admin.insecureSkipVerify=true
INFO[0000] [0] Couldn't find a config file []. Relying on env vars and pflags. 
{"json":{},"level":"warning","msg":"using insecureSkipVerify. Server's certificate chain and host name wont be verified. Caution : shouldn't be used for production usecases","ts":"2023-06-30T18:54:04Z"}
Error: Connection Info: [Endpoint: flyteadmin:81, InsecureConnection?: false, AuthMode: ClientSecret]: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.88.13.201:81: connect: connection refused"
{"json":{},"level":"error","msg":"Connection Info: [Endpoint: flyteadmin:81, InsecureConnection?: false, AuthMode: ClientSecret]: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.88.13.201:81: connect: connection refused\"","ts":"2023-06-30T18:54:04Z"}
Not it doesn’t also from within the cluster. (Actually this makes me very happy 🙈 if this had worked but the load balancer hadn’t, I would have become really pessimistic as the lb is very painful to debug)
As a sanity check: From the same pod this works
Copy code
curl --insecure <https://flyteadmin:80/api/v1/projects>
{"projects":[{"id":"flytesnacks","name":"flytesnacks","domains":[{"id":"development","name":"development"},{"id":"staging","name":"staging"},{"id":"production","name":"production"}],"description":"flytesnacks description"}]}
j
so the grpc server is basically not starting it looks like
anything in admin logs?
f
Only this:
Copy code
k logs -f flyteadmin-5764f7c76c-wthhs
Defaulted container "flyteadmin" out of: flyteadmin, run-migrations (init), seed-projects (init), generate-secrets (init)
time="2023-06-30T18:49:07Z" level=info msg="Using config file: [/etc/flyte/config/clusters.yaml /etc/flyte/config/db.yaml /etc/flyte/config/domain.yaml /etc/flyte/config/namespace_config.yaml /etc/flyte/config/remoteData.yaml /etc/flyte/config/server.yaml /etc/flyte/config/storage.yaml /etc/flyte/config/task_resource_defaults.yaml]"
j
maybe crank up the log level to 5?
f
Yes will do. Need to figure out where in the config i need to set this ⏳
j
are you using flyte-core?
Copy code
configmap:
  logger:
    logger:
      show-source: true
      level: 5
in the values
f
Restarting now
Nothing in the logs 🤔
Copy code
k logs -f flyteadmin-df5754849-jwjqn
Defaulted container "flyteadmin" out of: flyteadmin, run-migrations (init), seed-projects (init), generate-secrets (init)
time="2023-06-30T19:05:36Z" level=info msg="Using config file: [/etc/flyte/config/clusters.yaml /etc/flyte/config/db.yaml /etc/flyte/config/domain.yaml /etc/flyte/config/logger.yaml /etc/flyte/config/namespace_config.yaml /etc/flyte/config/remoteData.yaml /etc/flyte/config/server.yaml /etc/flyte/config/storage.yaml /etc/flyte/config/task_resource_defaults.yaml]"
{"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [auth] updated. No update handler registered.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [notifications] updated. No update handler registered.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [cloudevents] updated. No update handler registered.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [task_resources] updated. No update handler registered.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [task_type_whitelist] updated. No update handler registered.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"viper.go:400"},"level":"debug","msg":"Config section [admin] updated. Firing updated event.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [propeller] updated. No update handler registered.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [server] updated. No update handler registered.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [scheduler] updated. No update handler registered.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [externalevents] updated. No update handler registered.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [cluster_resources] updated. No update handler registered.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [queues] updated. No update handler registered.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"viper.go:398"},"level":"debug","msg":"Config section [namespace_mapping] updated. No update handler registered.","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"service.go:68"},"level":"info","msg":"setting metrics keys to [project domain wf task phase tasktype runtime_type runtime_version app_name]","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"cert_utils.go:23"},"level":"info","msg":"Constructing SSL credentials","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"service.go:79"},"level":"info","msg":"Registering default middleware with blanket auth validation","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"service.go:94"},"level":"info","msg":"Creating gRPC server without authentication","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"server.go:97"},"level":"info","msg":"Starting profiling server on port [10254]","ts":"2023-06-30T19:05:36Z"}
{"json":{"src":"database.go:171"},"level":"info","msg":"Set connection pool values to [{MaxOpenConnections:100 OpenConnections:1 InUse:0 Idle:1 WaitCount:0 WaitDuration:0s MaxIdleClosed:0 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}]","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"client.go:73"},"level":"debug","msg":"successfully loaded kube configuration from in cluster config","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"base.go:98"},"level":"info","msg":"Successfully created a workflow executor engine","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"factory.go:177"},"level":"info","msg":"Using default noop notifications publisher implementation for config type [local]","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"factory.go:126"},"level":"info","msg":"Using default noop notifications processor implementation for config type [local]","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"base.go:106"},"level":"info","msg":"Started processing notifications.","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"noop_notifications.go:43"},"level":"debug","msg":"call to noop start processing.","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"factory.go:86"},"level":"info","msg":"Using default flyte scheduler implementation","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"factory.go:104"},"level":"info","msg":"Using default noop workflow executor implementation for cloud provider type [local]","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"factory.go:88"},"level":"info","msg":"Using default noop remote url implementation for cloud provider type [gcs]","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"base.go:150"},"level":"info","msg":"Successfully initialized a new scheduled workflow executor","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"base.go:152"},"level":"info","msg":"Starting the scheduled workflow executor","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"base.go:166"},"level":"info","msg":"Initializing a new AdminService","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"database.go:171"},"level":"info","msg":"Set connection pool values to [{MaxOpenConnections:100 OpenConnections:1 InUse:0 Idle:1 WaitCount:0 WaitDuration:0s MaxIdleClosed:0 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}]","ts":"2023-06-30T19:05:37Z"}
{"json":{"src":"signal_service.go:77"},"level":"info","msg":"Initializing a new SignalService","ts":"2023-06-30T19:05:37Z"}

2023/06/30 19:06:31 /go/pkg/mod/gorm.io/gorm@v1.24.1-0.20221019064659-5dd2bb482755/callbacks.go:134
[5.209ms] [rows:1] SELECT * FROM "projects" WHERE state != 1 ORDER BY identifier asc

2023/06/30 19:06:47 /go/pkg/mod/gorm.io/gorm@v1.24.1-0.20221019064659-5dd2bb482755/callbacks.go:134
[3.220ms] [rows:1] SELECT * FROM "projects" WHERE state != 1 ORDER BY identifier asc
The 2 queries in the end both come from curl -> http server.
j
the last time i tried doing this, i dont think i used a cert at all for internal
let me confirm
f
Screenshot 2023-06-22 at 18.15.17.png
I’m pretty sure that on GCP there is no way without TLS between LB and backend.
Screenshot 2023-06-22 at 18.44.48.png
j
f
Yes with this ingress controller this might be different.
j
same issue
didnt work without insecure=false
did you try without any changes to flyteadmin config?
just to rule it out?
f
You mean turn of ssl again and see if it works then?
j
im not doing anything different for ALB either
turn off SSL in the flyteadmin config, but inform the LB to hit the backend using http2
i think youdo that via the port spec scheme
Here it says explicitly that for the GKE Ingress controller this doesn’t work without TLS.
With my minimal working example greeter gRPC app I confirmed that the lb only forwards requests to the backend once a cert is added to the backend.
But I’m happy to try again ^^
Thanks for supporting btw 🙏 🙏
f
On this page it says:
Note: To ensure the load balancer can make a correct HTTP2 request to your backend, your backend must be configured with SSL. For more information on what types of certificates are accepted, see Encryption from the load balancer to the backends .
j
hmm maybe that app auto-generates certs then?
f
The one in the docs does yes, I checked this
You can see it in the logs.
With
insecure=true
and without
ssl
in the helm values, it works:
Copy code
./bin/flytectl get projects --admin.endpoint=flyteadmin:81 --admin.insecure=true
INFO[0000] [0] Couldn't find a config file []. Relying on env vars and pflags. 
 ------------- ------------- ------------------------- 
| ID          | NAME        | DESCRIPTION             |
 ------------- ------------- ------------------------- 
| flytesnacks | flytesnacks | flytesnacks description |
 ------------- ------------- -------------------------
The load balancer won’t work but this shows that the rest of the config is likely fine.
j
👍
Copy code
configmap:
  adminServer:
    server:
      httpPort: 8088
      grpcPort: 8089
      security:
        secure: true
        ssl:
          certificateFile: "/etc/tls/cert.pem"
          keyFile: "/etc/tls/key.pem"
is this what you are running?
if you just set
secure: false
in-cluster connections work?
f
Copy code
server:
      httpPort: 8088
      grpcPort: 8089
      security:
        secure: true
        ssl:
          certificateFile: "/etc/tls/cert.pem"
          keyFile: "/etc/tls/key.pem"
If I set
secure: false
and comment out
ssl
in-cluster works. (I assume it would also if I didn’t comment out ssl since it shouldn’t have an effect)
j
right
ok
ill have to play around with this
i suspect we can repro this in sandbox for faster iteration
can confirm that adding enabling ssl breaks flyteadmin grpc on sandbox as well.
not sure whats happening though.
h
@Drew OConnor
n
Has there been any movement on this issue? I'm seeing the same thing on an OpenShift deployment, where after enabling secure and ssl entries, using passthrough for tls on the OpenShift Route, I receive the (effectively) same issue of cert errors with 0.0.0.0:8088 not being a valid SAN in the logs, but I am unable to generate (by policy) a DNS SAN with 0.0.0.0:8088 and an IP SAN of 0.0.0.0 is ineffectual, so I'm even further back on this issue.
297 Views