We're seeing flytepropeller auth failures after installation/upgrade (via helm) with a message like this:
E1220 16:10:36.391892 1 workers.go:102] error syncing 'mandant1-development/f3359d6b5cd941830000': Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Unauthenticated desc = token parse error [JWT_VERIFICATION_FAILED] Could not retrieve id token from metadata, caused by: rpc error: code = Unauthenticated desc = Request unauthenticated with IDToken]
It leaves the system in a weird state because there's no hard failure. flytepropeller ist running after all so our monitoring does not trigger an alarm, but no workflow execution is happening. After a flytepropeller restart, it suddenly starts working again.
So after some digging I found out why this happens in our setup:
flyte-secret-auth
is populated with
.Values.secrets.adminOauthClientCredentials.clientSecret
during installation which is set to the placeholder
foobar
in
values.yaml
. We set the secret value dynamically though with a helm hook during installation because we need to fetch the real client-secret from Keycloak. That happens only
after flytepropeller is deployed and it seems that flytepropeller does not reload the secret on changes. Since
flyte-secret-auth
is managed by helm, this happens again on every
helm upgrade
.
I see mainly two (non exclusive) ways to improve this behavior:
• Remove the default
clientSecret
and only create
flyte-secret-auth
via helm if the value is actually set. Only mount
flyte-secret-auth
if external auth is enabled. That would cause flytepropeller to fail to start until
flyte-secret-auth
is created by other means.
• Trigger a flytepropeller reload when
flyte-secret-auth
changes.
Any thoughts on this? Happy to contribute here but I'd like to discuss the best way forward with you first.