Hi all, we have a self-hosted deployment of flyte-...
# flyte-deployment
c
Hi all, we have a self-hosted deployment of flyte-binary (on Orcale Cloud) and we started getting this error (only sometimes):
Copy code
Workflow[flyte-tasks:production:some-workflow] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: worker error(s) encountered: [0]: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": failed to call webhook: Post "<https://flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": EOF
It's possible it happens mostly on higher loads. I have few questions: 1. Any way to increase or manage the retry attempts? 2. How can we understand why this actually failed? (the flyte-binary pod was overloaded?) 3. What does the webhook actually do? 4. If the answer to [3] is not something important - can we disable it? (I noticed
propeller.disableWebhook
configuration, or maybe the delete the relevant webhook resource?) Thanks!
f
Are you using secrets, if not you can disable the webhook
Also check resources etc
c
I didn't see anything specific with the resources memory was high, but CPU low
oddly, the flyte-binary pod has no resource requests/limts configured - how's that?
f
you did not configure it from the helm?
also this explains why its dying
please give it some memory
a
@curved-petabyte-84246 running Flyte on OCI! I want to learn more 🙂 Seems like you're hitting the `max-workflow-retries`which is set to 10 by default. There must be a good reason the worker is running out of retries budget and I'd suggest using the Grafana dashboard and look at patterns, especially during high load to understand better. Adding more resources to the Pod can help, but the next question would be: how much to add? Let us know if that helps
c
@average-finland-92144 thanks for getting back to me! I didn't find where I should add memory/cpu to the pod. I'm currently using the binary deployment and maybe with high load it's not good enough. Can I add replicas for the binary deployment?
a
I didn't find where I should add memory/cpu to the pod
You can uncomment and adjust `deployment.resources`to override default resources for the Pod: https://github.com/flyteorg/flyte/blob/6c4f8dbfc6d23a0cd7bf81480856e9ae1dfa1b27/charts/flyte-binary/values.yaml#L235-L240
Can I add replicas for the binary deployment?
You can but there's no leader election mechanism enabled by default in single binary to handle properly multiple propeller instances so, for scaling out, flyte-core has these mechanisms available
c
@average-finland-92144 thanks! Re: leader-election - what do I care? there's a database no?
a
@curved-petabyte-84246 the leader election in propeller is not really about consistency because the controller itself is stateless, it records execution state in
etcd
The mechanism is used to ensure that, while there may be multiple
replicas
in the propeller deployment, only one instance is active at a time and the other(s) remain "warm" in case the leader fails. Without leader election, K8s would still recreate the propeller pod in case of a failure but that could take a bit longer than just switching leaders. Also, the propeller replicas would be competing with each other, potentially trying to update the FlyteWorkflow CRD simultaneously.