Hi all we have a self hosted deployment of flyte binary on O Flyte #flyte-deployment

Hi all, we have a self-hosted deployment of flyte-...

curved-petabyte-84246

10/09/2024, 9:18 AM

Hi all, we have a self-hosted deployment of flyte-binary (on Orcale Cloud) and we started getting this error (only sometimes):

Copy code

Workflow[flyte-tasks:production:some-workflow] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: worker error(s) encountered: [0]: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": failed to call webhook: Post "<https://flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": EOF

It's possible it happens mostly on higher loads. I have few questions: 1. Any way to increase or manage the retry attempts? 2. How can we understand why this actually failed? (the flyte-binary pod was overloaded?) 3. What does the webhook actually do? 4. If the answer to [3] is not something important - can we disable it? (I noticed

propeller.disableWebhook

configuration, or maybe the delete the relevant webhook resource?) Thanks!

freezing-airport-6809

10/09/2024, 1:49 PM

Are you using secrets, if not you can disable the webhook

freezing-airport-6809

10/09/2024, 1:49 PM

Also check resources etc

curved-petabyte-84246

10/09/2024, 7:43 PM

I didn't see anything specific with the resources memory was high, but CPU low

curved-petabyte-84246

10/09/2024, 7:43 PM

oddly, the flyte-binary pod has no resource requests/limts configured - how's that?

freezing-airport-6809

10/10/2024, 5:07 AM

you did not configure it from the helm?

freezing-airport-6809

10/10/2024, 5:07 AM

also this explains why its dying

freezing-airport-6809

10/10/2024, 5:07 AM

please give it some memory

average-finland-92144

10/16/2024, 10:32 AM

@curved-petabyte-84246 running Flyte on OCI! I want to learn more 🙂 Seems like you're hitting the `max-workflow-retries`which is set to 10 by default. There must be a good reason the worker is running out of retries budget and I'd suggest using the Grafana dashboard and look at patterns, especially during high load to understand better. Adding more resources to the Pod can help, but the next question would be: how much to add? Let us know if that helps

curved-petabyte-84246

10/16/2024, 12:52 PM

@average-finland-92144 thanks for getting back to me! I didn't find where I should add memory/cpu to the pod. I'm currently using the binary deployment and maybe with high load it's not good enough. Can I add replicas for the binary deployment?

average-finland-92144

10/17/2024, 10:59 AM

I didn't find where I should add memory/cpu to the pod

You can uncomment and adjust `deployment.resources`to override default resources for the Pod: https://github.com/flyteorg/flyte/blob/6c4f8dbfc6d23a0cd7bf81480856e9ae1dfa1b27/charts/flyte-binary/values.yaml#L235-L240

Can I add replicas for the binary deployment?

You can but there's no leader election mechanism enabled by default in single binary to handle properly multiple propeller instances so, for scaling out, flyte-core has these mechanisms available

curved-petabyte-84246

10/31/2024, 3:12 PM

@average-finland-92144 thanks! Re: leader-election - what do I care? there's a database no?

average-finland-92144

10/31/2024, 5:06 PM

@curved-petabyte-84246 the leader election in propeller is not really about consistency because the controller itself is stateless, it records execution state in

etcd

The mechanism is used to ensure that, while there may be multiple

replicas

in the propeller deployment, only one instance is active at a time and the other(s) remain "warm" in case the leader fails. Without leader election, K8s would still recreate the propeller pod in case of a failure but that could take a bit longer than just switching leaders. Also, the propeller replicas would be competing with each other, potentially trying to update the FlyteWorkflow CRD simultaneously.

9 Views

Open in Slack

Previous Next