Hi Flyte team, We see this issue with our flyte-po...
# flyte-deployment
f
Hi Flyte team, We see this issue with our flyte-pod-webhook. The issues goes way if restart the deployment.
Copy code
kubectl logs flyte-pod-webhook-57cff5dd75-w596k
Defaulted container "webhook" out of: webhook, generate-secrets (init)
time="2022-09-14T22:11:09Z" level=info msg=------------------------------------------------------------------------
time="2022-09-14T22:11:09Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2022-09-14 22:11:09.195785843 +0000 UTC m=+0.480189926]"
time="2022-09-14T22:11:09Z" level=info msg=------------------------------------------------------------------------
time="2022-09-14T22:11:09Z" level=info msg="Detected: 64 CPU's\n"
{"json":{},"level":"fatal","msg":"Failed to create controller manager. Error: failed to initialize controller-runtime manager: error listening on :8080: listen tcp :8080: bind: address already in use","ts":"2022-09-14T22:11:13Z"}
k
the certificate expired?
cc @Dan Rammer (hamersaw)
s
d
@Ketan (kumare3) where are we seeing an expired cert? @Fredrick can you provide a little more context? Is this a deployment using the helm chart? Did this occur on the initial deployment or sometime after?
r
We definitely have not had this running for a year, so I don’t think this is a case of the certificate expiring
k
ohh sorry i assumed this was the “Frederick” from some other company. I see it now, bind address already in use
that is weird?
i have been staring at the message this is weird. Why is the pod already in use. Looking into K8s controller manager core
r
Hello… any updates? no pressure 🙂
k
Cc @Dan Rammer (hamersaw) can you look into this
d
@Rupsha Chaudhuri I was unable to repro this locally. Has this occurred more than once? Can I get a little more context? maybe dump the pod yaml?
r
@Fredrick ^
d
@Fredrick Has this occurred more than once? just on the initial startup or after awhile?
I am still unable to repro this locally. Diving through the code-base the 8080 is default for the controller manager (unsurprisingly!). There is a k8s leader election configuration that we can set on the controller manager, but it's not clear that would help because this is the only container in the Pod. In some searching it sounds like ingress controllers can sometimes be problematic here, again this is very difficult to debug without diving further. If this happens again can we figure out what is on port 8080?
k
@Dan Rammer (hamersaw) thank you for looking- @Rupsha Chaudhuri we willl take a look, is this blocking you?
f
No its not a blocker. It does not happen very often.
r
This is now giving me a lot of grief.. I couldn’t get any workflows to succeed in the last couple of days even after restarting the pod multiple times 😞
@Dan Rammer (hamersaw) can you please look into this when you get a chance?
Please let us know if you need any further information
d
@Rupsha Chaudhuri, so sorry about this. I will take a deep dive here immediately. Will update accordingly.
filed an issue here so we can track this a little more officially.
@Yuvraj do you have any idea what would be binding to port 8080 here? What is a good route to debug?
So the problem is obviously that something is already bound to port 8080. We first need to figure out what it is, can you show the output of:
Copy code
`kubectl -n flyte exec flyte-pod-webhook-788c4c876c-6f58f -- netstat -tulpn
On my deployment locally - it shows propeller (ie. pod-webhook) is correctly bound to 8080. It may be useful to see this output both on the defauted
webhook
container and on the
generate-secrets
init container. So something like:
Copy code
hamersaw@ragnarok:~$ kubectl -n flyte exec flyte-pod-webhook-788c4c876c-6f58f -- netstat -tulpn
Defaulted container "webhook" out of: webhook, generate-secrets (init)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 :::9443                 :::*                    LISTEN      1/flytepropeller
tcp        0      0 :::10254                :::*                    LISTEN      1/flytepropeller
tcp        0      0 :::8080                 :::*                    LISTEN      1/flytepropeller
r
In case it’s relevant, @Fredrick mentioned that in our deployment we have
hostNetwork: true
without which web-hooks don’t work on EKS
Right now after another restart the webhook is running. Let me run this command and give you the output once I run into this issue again
d
So if the
hostNetwork: true
this means the Pod will have access to the host network. In this case it means that what is probably happening is something else is bound to :8080 on that host. This also explains why restarting the Pod sometimes helps, k8s schedules it to another host which has 8080 open. I'm wondering if there is another solution to enable networking here that will mitigate these issues - pinging internally.
Can you elaborate on webhooks not working on EKS? Are you using a custom CNI?
f
yes we are using EKS with calico CNI. I understand with hostNetwork there can be conflict the ports. But why is flyte-pod-webhook listening on 8080 ? i thought it used 9443
d
OK, so was able to dive into this and figure it out a bit. So the 9443 is the pod webhook listener, you are correct - where k8s calls to update the Pod definition during secrets. The 8080 port is what the k8s controller-runtime manager metrics server binds to by default. The controller-runtime manager is started in the pod-webhook because it needs to have an informer, cache, etc with the k8s api server to work correctly. For an immediate fix, we can set the metrics server port to 0 (which disables it). However, this will require a change to the FlytePropeller codebase. We're hoping to have this out out quickly (by tomorrow) and it should alleviate this issue you're having. As an end goal, by disabling the metrics server we loose access to some k8s prometheus metrics, we would like to add these metrics to our prometheus metrics server. This will require a little more work, but is certainly possible.
I'm not well versed in the calico CNI, so I can't offer much insight into alternative networking configurations. In the event there is no fix there, I think disabling the k8s controller-runtime manager will mitigate this issue. Does this sound like it will work?
f
Yes we can disable controller-runtime manager's metrics server. Let us know when we have this ability to do that.
d
@Fredrick I just opened a PR to disable this, once it's merged (will push through quickly) we will automatically build a new flytepropeller image that can be used. I will update once it's ready.
Then in a later PR (hopefully soon) we can update this code to server the controller-runtime manager k8s metrics from the prometheus endpoint that we're currently opening to serve flyte metrics.
k
@Fredrick and @Rupsha Chaudhuri thank you for the patience and helping us find a potential bug
d
@Fredrick @Rupsha Chaudhuri OK we merged the PR to disable the k8s controller-runtime metrics server, the newer FlytePropeller image is "cr.flyte.org/flyteorg/flytepropeller:v1.1.42" as denoted by the release. You can update the flyte-pod-webhook deployment with this new image and should be able to verify that the pod does not open port 8080. Please let us know if you run into any problems!
r
Thank you so much.. We’ll keep you posted
f
@Dan Rammer (hamersaw) after upgrading to v1.1.42, the flyte-pod-webhook is not coming up
Copy code
kubectl logs flyte-pod-webhook-587d7d5977-lxwwb
Defaulted container "webhook" out of: webhook, generate-secrets (init)
time="2022-10-14T17:29:32Z" level=info msg=------------------------------------------------------------------------
time="2022-10-14T17:29:32Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2022-10-14 17:29:32.394880612 +0000 UTC m=+0.247511472]"
time="2022-10-14T17:29:32Z" level=info msg=------------------------------------------------------------------------
time="2022-10-14T17:29:32Z" level=info msg="Detected: 64 CPU's\n"
{"metrics-prefix":"flyte:","certDir":"/etc/webhook/certs","localCert":false,"listenPort":9443,"serviceName":"flyte-pod-webhook","servicePort":443,"secretName":"flyte-pod-webhook","secretManagerType":"K8s","awsSecretManager":{"sidecarImage":"<http://docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4|docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4>","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"vaultSecretManager":{"role":"flyte","kvVersion":"2"}}
{"json":{},"level":"error","msg":"Failed to start profiling server. Error: listen tcp :10254: bind: address already in use","ts":"2022-10-14T17:29:34Z"}
{"json":{},"level":"fatal","msg":"Failed to Start profiling and metrics server. Error: failed to start profiling server, listen tcp :10254: bind: address already in use","ts":"2022-10-14T17:29:34Z"}
d
so this should be a little easier to reconcile. 10254 is the default port that FlytePropeller uses to serve metrics. so this is opened in both FlytePropeller pod and the flyte-pod-webhook pod. You can update this in the configuration in the propeller configmap with this value. however, i do believe that the flytepropeller pod and the flyte-pod-webhook pod use the same configmap by default, so to enforce two separate values you probably need to make a separate configmap for the flyte-pod-webhook.
f
is it possible to disable profiler ?
if its set to zero, can we disable profiler
d
So currently that is not implemented, but I would support it as an option. If you filed an issue it would be a pretty quick change. This would be a great community contribution.
f
105 Views