curved-petabyte-84246
10/09/2024, 9:18 AMWorkflow[flyte-tasks:production:some-workflow] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: worker error(s) encountered: [0]: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": failed to call webhook: Post "<https://flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": EOF
It's possible it happens mostly on higher loads.
I have few questions:
1. Any way to increase or manage the retry attempts?
2. How can we understand why this actually failed? (the flyte-binary pod was overloaded?)
3. What does the webhook actually do?
4. If the answer to [3] is not something important - can we disable it? (I noticed propeller.disableWebhook
configuration, or maybe the delete the relevant webhook resource?)
Thanks!freezing-airport-6809
freezing-airport-6809
curved-petabyte-84246
10/09/2024, 7:43 PMcurved-petabyte-84246
10/09/2024, 7:43 PMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
average-finland-92144
10/16/2024, 10:32 AMcurved-petabyte-84246
10/16/2024, 12:52 PMaverage-finland-92144
10/17/2024, 10:59 AMI didn't find where I should add memory/cpu to the podYou can uncomment and adjust `deployment.resources`to override default resources for the Pod: https://github.com/flyteorg/flyte/blob/6c4f8dbfc6d23a0cd7bf81480856e9ae1dfa1b27/charts/flyte-binary/values.yaml#L235-L240
Can I add replicas for the binary deployment?You can but there's no leader election mechanism enabled by default in single binary to handle properly multiple propeller instances so, for scaling out, flyte-core has these mechanisms available
curved-petabyte-84246
10/31/2024, 3:12 PMaverage-finland-92144
10/31/2024, 5:06 PMetcd
The mechanism is used to ensure that, while there may be multiple replicas
in the propeller deployment, only one instance is active at a time and the other(s) remain "warm" in case the leader fails.
Without leader election, K8s would still recreate the propeller pod in case of a failure but that could take a bit longer than just switching leaders. Also, the propeller replicas would be competing with each other, potentially trying to update the FlyteWorkflow CRD simultaneously.