Hi, we are seeing a cert issue with webhook when r...
# ask-the-community
n
Hi, we are seeing a cert issue with webhook when running a workflow on a CronSchedule of 2 mins via a LaunchPlan. We are pulling in some k8s secrets in the tasks, looks like the webhook has something to do with that. The workflow failed for a couple of times and then recovered.
Copy code
RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": failed to call webhook: Post "<https://flyte-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "<http://flyte.org|flyte.org>")
Not sure if cert expiry is the issue, but looks like there was a change to increase the cert expiry from 1 yr to 99 yr to fix this bug. We have pkg version 0.0.48 not sure if we have that fix. Will scaling the webhook help with this issue? How do we do it?
h
Hi @Nandakumar Raghu, great investigative skills right there 🙂 If it's related to expiry, you would have received a different, and more specific, error I believe. So I'm inclined to say this is not.
were there many pods (that pull secrets) running at the same time? how many if you have to guess?
Which chart are you using? flyte-binary? flyte-core?
n
I don't see any other executions at the time of this failure on the UI. This happened exactly starting 12.28 PM UTC and the last failure was at 12.28 PM UTC and then it recovered. We are using the flyte-binary chart.
@Kevin Su - thoughts?
@Haytham Abuelfutuh - I read in the cookbook about scaling the webhook pod, but I guess with flyte-binary, there is no separate webhook pod? The flyte binary (
flyte-flyte-binary-877b879d5-pcwzb
) pod is the webhook pod correct? We have 3 replicas of this pod running, but we have a launch plan that runs a workflow every 2 mins, that pulls some secrets. Do you think the number of replicas is enough or we should try increasing it? Or is there something else going on here with the cert issue?
h
3 should be plenty...
the same binary serves the webhook endpoint as well...
n
Any idea how I can debug this? We are running batch predictions using a 2 min cron schedule and it fails for a couple of runs with this cert error and then comes back. We would like none of the runs failing 🙂
@Haytham Abuelfutuh - Couldn't find any fix or reason for this so created a bug. Also, it would be good to be able to use cert-manager issued certs instead of self signed for production. So, created a feature request for that as well.
k
Aah sounds like it - cc @Haytham Abuelfutuh does webhook have leader election? Does it need it
@Nandakumar Raghu you can always go to a fully deployment
n
@Ketan (kumare3) - Sorry, I don't understand what you mean by "fully deployment"
e
Single binary was not designed to support multiple replicas. What @Ketan (kumare3) is suggesting is that instead of deploying single-binary you should use the flyte-core helm chart that allows you to install the Flyte components separately.