Hi Flyte team We see this issue with our flyte pod webhook T Flyte #flyte-deployment

Hi Flyte team, We see this issue with our flyte-po...

adamant-sandwich-39749

09/14/2022, 10:17 PM

Hi Flyte team, We see this issue with our flyte-pod-webhook. The issues goes way if restart the deployment.

Copy code

kubectl logs flyte-pod-webhook-57cff5dd75-w596k
Defaulted container "webhook" out of: webhook, generate-secrets (init)
time="2022-09-14T22:11:09Z" level=info msg=------------------------------------------------------------------------
time="2022-09-14T22:11:09Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2022-09-14 22:11:09.195785843 +0000 UTC m=+0.480189926]"
time="2022-09-14T22:11:09Z" level=info msg=------------------------------------------------------------------------
time="2022-09-14T22:11:09Z" level=info msg="Detected: 64 CPU's\n"
{"json":{},"level":"fatal","msg":"Failed to create controller manager. Error: failed to initialize controller-runtime manager: error listening on :8080: listen tcp :8080: bind: address already in use","ts":"2022-09-14T22:11:13Z"}

freezing-airport-6809

09/14/2022, 11:04 PM

the certificate expired?

freezing-airport-6809

09/14/2022, 11:04 PM

cc @hallowed-mouse-14616

tall-lock-23197

09/15/2022, 5:13 AM

Dan filed an issue already: https://github.com/flyteorg/flyte/issues/2871

hallowed-mouse-14616

09/15/2022, 3:06 PM

@freezing-airport-6809 where are we seeing an expired cert? @adamant-sandwich-39749 can you provide a little more context? Is this a deployment using the helm chart? Did this occur on the initial deployment or sometime after?

little-cricket-84530

09/15/2022, 8:11 PM

We definitely have not had this running for a year, so I don’t think this is a case of the certificate expiring

freezing-airport-6809

09/15/2022, 11:46 PM

ohh sorry i assumed this was the “Frederick” from some other company. I see it now, bind address already in use

freezing-airport-6809

09/15/2022, 11:46 PM

that is weird?

freezing-airport-6809

09/16/2022, 12:25 AM

i have been staring at the message this is weird. Why is the pod already in use. Looking into K8s controller manager core

little-cricket-84530

09/21/2022, 5:27 PM

Hello… any updates? no pressure 🙂

freezing-airport-6809

09/21/2022, 5:34 PM

Cc @hallowed-mouse-14616 can you look into this

hallowed-mouse-14616

09/21/2022, 6:01 PM

@little-cricket-84530 I was unable to repro this locally. Has this occurred more than once? Can I get a little more context? maybe dump the pod yaml?

little-cricket-84530

09/21/2022, 6:01 PM

@adamant-sandwich-39749 ^

adamant-sandwich-39749

09/21/2022, 6:39 PM

We used the official flyte helm chart to deploy. here is the webhook pod yaml

flyte-webhook-pod.yaml

hallowed-mouse-14616

09/21/2022, 7:00 PM

@adamant-sandwich-39749 Has this occurred more than once? just on the initial startup or after awhile?

hallowed-mouse-14616

09/23/2022, 1:24 PM

I am still unable to repro this locally. Diving through the code-base the 8080 is default for the controller manager (unsurprisingly!). There is a k8s leader election configuration that we can set on the controller manager, but it's not clear that would help because this is the only container in the Pod. In some searching it sounds like ingress controllers can sometimes be problematic here, again this is very difficult to debug without diving further. If this happens again can we figure out what is on port 8080?

freezing-airport-6809

09/23/2022, 1:30 PM

@hallowed-mouse-14616 thank you for looking- @little-cricket-84530 we willl take a look, is this blocking you?

adamant-sandwich-39749

09/23/2022, 5:50 PM

No its not a blocker. It does not happen very often.

little-cricket-84530

10/10/2022, 4:09 PM

This is now giving me a lot of grief.. I couldn’t get any workflows to succeed in the last couple of days even after restarting the pod multiple times 😞

little-cricket-84530

10/10/2022, 4:19 PM

@hallowed-mouse-14616 can you please look into this when you get a chance?

little-cricket-84530

10/10/2022, 4:19 PM

Please let us know if you need any further information

hallowed-mouse-14616

10/10/2022, 4:25 PM

@little-cricket-84530, so sorry about this. I will take a deep dive here immediately. Will update accordingly.

🙏🏼 1

hallowed-mouse-14616

10/10/2022, 4:46 PM

filed an issue here so we can track this a little more officially.

🙏🏼 1

hallowed-mouse-14616

10/10/2022, 4:56 PM

@great-school-54368 do you have any idea what would be binding to port 8080 here? What is a good route to debug?

hallowed-mouse-14616

10/10/2022, 5:01 PM

So the problem is obviously that something is already bound to port 8080. We first need to figure out what it is, can you show the output of:

Copy code

`kubectl -n flyte exec flyte-pod-webhook-788c4c876c-6f58f -- netstat -tulpn

On my deployment locally - it shows propeller (ie. pod-webhook) is correctly bound to 8080. It may be useful to see this output both on the defauted

webhook

container and on the

generate-secrets

init container. So something like:

Copy code

hamersaw@ragnarok:~$ kubectl -n flyte exec flyte-pod-webhook-788c4c876c-6f58f -- netstat -tulpn
Defaulted container "webhook" out of: webhook, generate-secrets (init)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 :::9443                 :::*                    LISTEN      1/flytepropeller
tcp        0      0 :::10254                :::*                    LISTEN      1/flytepropeller
tcp        0      0 :::8080                 :::*                    LISTEN      1/flytepropeller

little-cricket-84530

10/10/2022, 5:01 PM

In case it’s relevant, @adamant-sandwich-39749 mentioned that in our deployment we have

hostNetwork: true

without which web-hooks don’t work on EKS

little-cricket-84530

10/10/2022, 5:04 PM

Right now after another restart the webhook is running. Let me run this command and give you the output once I run into this issue again

👍 1

hallowed-mouse-14616

10/10/2022, 5:13 PM

So if the

hostNetwork: true

this means the Pod will have access to the host network. In this case it means that what is probably happening is something else is bound to :8080 on that host. This also explains why restarting the Pod sometimes helps, k8s schedules it to another host which has 8080 open. I'm wondering if there is another solution to enable networking here that will mitigate these issues - pinging internally.

hallowed-mouse-14616

10/10/2022, 5:20 PM

Can you elaborate on webhooks not working on EKS? Are you using a custom CNI?

adamant-sandwich-39749

10/10/2022, 6:13 PM

yes we are using EKS with calico CNI. I understand with hostNetwork there can be conflict the ports. But why is flyte-pod-webhook listening on 8080 ? i thought it used 9443

hallowed-mouse-14616

10/10/2022, 7:08 PM

OK, so was able to dive into this and figure it out a bit. So the 9443 is the pod webhook listener, you are correct - where k8s calls to update the Pod definition during secrets. The 8080 port is what the k8s controller-runtime manager metrics server binds to by default. The controller-runtime manager is started in the pod-webhook because it needs to have an informer, cache, etc with the k8s api server to work correctly. For an immediate fix, we can set the metrics server port to 0 (which disables it). However, this will require a change to the FlytePropeller codebase. We're hoping to have this out out quickly (by tomorrow) and it should alleviate this issue you're having. As an end goal, by disabling the metrics server we loose access to some k8s prometheus metrics, we would like to add these metrics to our prometheus metrics server. This will require a little more work, but is certainly possible.

hallowed-mouse-14616

10/10/2022, 7:09 PM

I'm not well versed in the calico CNI, so I can't offer much insight into alternative networking configurations. In the event there is no fix there, I think disabling the k8s controller-runtime manager will mitigate this issue. Does this sound like it will work?

adamant-sandwich-39749

10/10/2022, 7:24 PM

Yes we can disable controller-runtime manager's metrics server. Let us know when we have this ability to do that.

hallowed-mouse-14616

10/10/2022, 11:17 PM

@adamant-sandwich-39749 I just opened a PR to disable this, once it's merged (will push through quickly) we will automatically build a new flytepropeller image that can be used. I will update once it's ready.

hallowed-mouse-14616

10/10/2022, 11:17 PM

Then in a later PR (hopefully soon) we can update this code to server the controller-runtime manager k8s metrics from the prometheus endpoint that we're currently opening to serve flyte metrics.

freezing-airport-6809

10/11/2022, 12:30 AM

@adamant-sandwich-39749 and @little-cricket-84530 thank you for the patience and helping us find a potential bug

hallowed-mouse-14616

10/12/2022, 4:59 PM

@adamant-sandwich-39749 @little-cricket-84530 OK we merged the PR to disable the k8s controller-runtime metrics server, the newer FlytePropeller image is "cr.flyte.org/flyteorg/flytepropeller:v1.1.42" as denoted by the release. You can update the flyte-pod-webhook deployment with this new image and should be able to verify that the pod does not open port 8080. Please let us know if you run into any problems!

little-cricket-84530

10/12/2022, 5:00 PM

Thank you so much.. We’ll keep you posted

🙏 1

adamant-sandwich-39749

10/14/2022, 5:31 PM

@hallowed-mouse-14616 after upgrading to v1.1.42, the flyte-pod-webhook is not coming up

Copy code

kubectl logs flyte-pod-webhook-587d7d5977-lxwwb
Defaulted container "webhook" out of: webhook, generate-secrets (init)
time="2022-10-14T17:29:32Z" level=info msg=------------------------------------------------------------------------
time="2022-10-14T17:29:32Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2022-10-14 17:29:32.394880612 +0000 UTC m=+0.247511472]"
time="2022-10-14T17:29:32Z" level=info msg=------------------------------------------------------------------------
time="2022-10-14T17:29:32Z" level=info msg="Detected: 64 CPU's\n"
{"metrics-prefix":"flyte:","certDir":"/etc/webhook/certs","localCert":false,"listenPort":9443,"serviceName":"flyte-pod-webhook","servicePort":443,"secretName":"flyte-pod-webhook","secretManagerType":"K8s","awsSecretManager":{"sidecarImage":"<http://docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4|docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4>","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"vaultSecretManager":{"role":"flyte","kvVersion":"2"}}
{"json":{},"level":"error","msg":"Failed to start profiling server. Error: listen tcp :10254: bind: address already in use","ts":"2022-10-14T17:29:34Z"}
{"json":{},"level":"fatal","msg":"Failed to Start profiling and metrics server. Error: failed to start profiling server, listen tcp :10254: bind: address already in use","ts":"2022-10-14T17:29:34Z"}

hallowed-mouse-14616

10/14/2022, 5:56 PM

so this should be a little easier to reconcile. 10254 is the default port that FlytePropeller uses to serve metrics. so this is opened in both FlytePropeller pod and the flyte-pod-webhook pod. You can update this in the configuration in the propeller configmap with this value. however, i do believe that the flytepropeller pod and the flyte-pod-webhook pod use the same configmap by default, so to enforce two separate values you probably need to make a separate configmap for the flyte-pod-webhook.

adamant-sandwich-39749

10/14/2022, 6:57 PM

is it possible to disable profiler ?

adamant-sandwich-39749

10/14/2022, 6:58 PM

if its set to zero, can we disable profiler

hallowed-mouse-14616

10/14/2022, 7:02 PM

So currently that is not implemented, but I would support it as an option. If you filed an issue it would be a pretty quick change. This would be a great community contribution.

adamant-sandwich-39749

10/14/2022, 8:59 PM

created a ticket https://github.com/flyteorg/flyte/issues/2985

🙌 1

163 Views

Open in Slack

Previous Next