curved-petabyte-84246
06/26/2024, 2:13 PMflyte-binary
deployed on k8s cluster hosted on DigitalOcean. It's been working great but recently I wanted to change the release name from "flyte-backend" to "flyte-binary".
The rationale was simple - just wanted short service names (in hindsight, could have went with "nameOverride" in the values).
After doing that, workflows fails to execute. The error was
Workflow[flyte-tasks:development:services.workflows.example.hello_world_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "<https://flyte-backend-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": service "flyte-backend-flyte-binary-webhook" not found
Notice how it's trying to call old name of the service. Of course, I uninstalled the previous deployment, and reinstall with the new name. I can see the k8s service objects and pods with the correct new names.
After spending several hours investigating it, re-deploying, uninstalling, deleting all of "flyte" namespaces (include the project+domain ones), deleting all the tables from the database - still the same damn error!
It's unclear to me where it's getting old service name from.
By now, I've pin-pointed that it only happens the task requests a secret. Any secret.
And even more weird, after uninstalling the new deployment, deleting namespaces and dropping all tables in the database, installing Flyte with the previous release name "flyte-backend", now results in the same error, but the "new" service name:
Workflow[flytesnacks:development:hello_world.hello_world_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "<https://flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": service "flyte-binary-webhook" not found
I really don't get it. The configmaps or looks OK. I shelled into the pod and the config under /etc/flyte/config.d
looks ok.
After reverting back to the old name the 000-core.yaml
has the correct value under webhook
section:
...
webhook:
certDir: /var/run/flyte/certs
localCert: true
secretName: flyte-backend-flyte-binary-webhook-secret
serviceName: flyte-backend-flyte-binary-webhook
servicePort: 443
...
curved-petabyte-84246
06/26/2024, 9:41 PMkubectl get mutatingwebhookconfigurations
NAME WEBHOOKS AGE
cert-manager-webhook 1 212d
flyte-backend-flyte-binary-webhook 1 171d
flyte-binary-webhook 1 172m
kube-prometheus-stack-admission 1 212d
scaleops-mutating-webhook-configuration 1 20d
As you can see, there are 2 Flyte webhooks, one with old name and the other with new name. I suppose Flyte somehow selects a suitable webhook but fails calling it because the deployment with the old name no longer exist.
This resource is not managed by Helm so upgrade/uninstall won't modify it.average-finland-92144
06/26/2024, 9:46 PMaverage-finland-92144
06/26/2024, 9:52 PMThis resource is not managed by Helm so upgrade/uninstall won't modify it.Right! And as you noticed, it's created when your Task requests a Secret (see)
average-finland-92144
06/26/2024, 9:53 PMcurved-petabyte-84246
06/26/2024, 10:11 PMaverage-finland-92144
06/26/2024, 10:13 PMthankful-minister-83577
curved-petabyte-84246
06/27/2024, 4:02 AMcurved-petabyte-84246
06/27/2024, 5:23 AMcurved-petabyte-84246
06/27/2024, 10:14 AMthankful-minister-83577
thankful-minister-83577
thankful-minister-83577
curved-petabyte-84246
06/27/2024, 5:40 PMthankful-minister-83577
thankful-minister-83577
wide-lion-54536
08/14/2024, 7:19 PMwide-lion-54536
08/14/2024, 7:22 PMMutatingWebhook
in flyteorg/flyte github issues didn't turn this up, but huge upvote from us on prioritizing this issue.curved-petabyte-84246
08/14/2024, 7:23 PM