Hi again folks, I'm using `flyte-binary` deployed...
# flyte-deployment
c
Hi again folks, I'm using
flyte-binary
deployed on k8s cluster hosted on DigitalOcean. It's been working great but recently I wanted to change the release name from "flyte-backend" to "flyte-binary". The rationale was simple - just wanted short service names (in hindsight, could have went with "nameOverride" in the values). After doing that, workflows fails to execute. The error was
Copy code
Workflow[flyte-tasks:development:services.workflows.example.hello_world_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "<https://flyte-backend-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": service "flyte-backend-flyte-binary-webhook" not found
Notice how it's trying to call old name of the service. Of course, I uninstalled the previous deployment, and reinstall with the new name. I can see the k8s service objects and pods with the correct new names. After spending several hours investigating it, re-deploying, uninstalling, deleting all of "flyte" namespaces (include the project+domain ones), deleting all the tables from the database - still the same damn error! It's unclear to me where it's getting old service name from. By now, I've pin-pointed that it only happens the task requests a secret. Any secret. And even more weird, after uninstalling the new deployment, deleting namespaces and dropping all tables in the database, installing Flyte with the previous release name "flyte-backend", now results in the same error, but the "new" service name:
Copy code
Workflow[flytesnacks:development:hello_world.hello_world_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "<https://flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": service "flyte-binary-webhook" not found
I really don't get it. The configmaps or looks OK. I shelled into the pod and the config under
/etc/flyte/config.d
looks ok. After reverting back to the old name the
000-core.yaml
has the correct value under
webhook
section:
Copy code
...
webhook:
  certDir: /var/run/flyte/certs
  localCert: true
  secretName: flyte-backend-flyte-binary-webhook-secret
  serviceName: flyte-backend-flyte-binary-webhook
  servicePort: 443
...
👀 1
RESOLVED As it turns, Flyte creates a resource called "*MutatingWebhookcConfigurations*": E.g.
Copy code
kubectl get mutatingwebhookconfigurations
NAME                                      WEBHOOKS   AGE
cert-manager-webhook                      1          212d
flyte-backend-flyte-binary-webhook        1          171d
flyte-binary-webhook                      1          172m
kube-prometheus-stack-admission           1          212d
scaleops-mutating-webhook-configuration   1          20d
As you can see, there are 2 Flyte webhooks, one with old name and the other with new name. I suppose Flyte somehow selects a suitable webhook but fails calling it because the deployment with the old name no longer exist. This resource is not managed by Helm so upgrade/uninstall won't modify it.
🙏 1
a
man! I was like checking the codebase and still not getting there. Glad you solved it!
This resource is not managed by Helm so upgrade/uninstall won't modify it.
Right! And as you noticed, it's created when your Task requests a Secret (see)
but I guess there are more streamlined ways to consume secrets
c
I don't really know what do these webhooks are doing... and why is one configuration is selected over another. anyway to disable it altogether or is this important for something (events? metrics?)
a
I think it's used to read injected secrets. What if you just delete the old one?
t
webhooks is used to inject secrets yes.
c
@average-finland-92144 yes, deleted the old one and now everything works as expected (as i said at the start - resolved 🙂 ) What other ways do I have to consume secrets? Actually, all i need is a way to communicate information to a task that's not via parameters. And not during registration.
Thanks @thankful-minister-83577!
@average-finland-92144 @thankful-minister-83577 Are you considering this a bug? I mean, I believe that Flyte should somehow remove old such webhook, or somehow uninstall/update when using helm, no? In other words, right now, if you uninstall Flyte, or modify service names, this issue breaks Flyte completely - all tasks that require secrets will not be able to run
t
the bug is that helm uninstall doesn’t actually remove all components right?
yeah that is a bug, not sure when we will get to it, but definitely worth filing, mind putting in a ticket for that?
would be helpful to have the commands you ran as well
c
Sure, I'll do that. Not really helm's fault because the webhook is created as part of service startup so not part deployment life cycle technically
t
maybe there’s a way to register it with helm though
💪 1
but yeah, agreed
w
It'd be difficult to articulate how grateful I was to find this thread today. Thanks @curved-petabyte-84246!
🙏 1
A search for
MutatingWebhook
in flyteorg/flyte github issues didn't turn this up, but huge upvote from us on prioritizing this issue.
c
I was supposed to open an issue but didnt get around to it... 🤐
102 Views