Victor Delépine
11/08/2023, 6:54 PMmax_parallelism
set to 500
. We're sometimes seeing large spikes of errors in propeller, with various error messages:
Workflow[<redacted>] failed. RuntimeExecutionError: max number of system retry attempts [31/30] exhausted. Last known status message: [SystemError] failed to launch workflow [<redacted>], system error, caused by: rpc error: code = DeadlineExceeded desc = context deadline exceeded
This error causes the whole workflow to fail ^
We also see this:
failed Execute for node. Error: EventSinkError: Error sending event, caused by [rpc error: code = DeadlineExceeded desc = context deadline exceeded]
and these warnings, but maybe that's unrelated:
Failed to record taskEvent, error [AlreadyExists: Event already exists, caused by [rpc error: code = AlreadyExists desc = have already recorded task execution phase ABORTED (version: 0) for {resource_type:TASK project:\"<project>\" domain:\"production\" name:\"<workflow>\" version:\"6KLmT5WBECbwj7w_fSLTPw==\" node_id:\"n0-0-dn2121\" execution_id:\u003cproject:\"<project>\" domain:\"<domain>\" name:\"f0a0d716e3d924aa7a1e\" \u003e 1 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!"
Does anyone know what could be the issue?
cc @Thomas NewtonThomas Newton
11/08/2023, 7:56 PMKetan (kumare3)
Thomas Newton
11/09/2023, 12:55 AMKetan (kumare3)
Thomas Newton
11/09/2023, 7:22 AMVictor Delépine
11/09/2023, 7:23 AMThomas Newton
11/09/2023, 7:50 AM{"json":{"exec_id":"anbwtd2v2x5p8px67rzt","ns":"nerf-test-production","routine":"worker-672"},"level":"error","msg":"Failed to update workflow. Error [etcdserver: request is too large]","ts":"2023-11-09T11:37:18Z"}
E1109 11:37:18.581775 1 workers.go:103] error syncing '<project-domain>/anbwtd2v2x5p8px67rzt': etcdserver: request is too large
David Espejo (he/him)
11/09/2023, 4:56 PMVictor Delépine
11/09/2023, 5:00 PMThomas Newton
11/09/2023, 5:03 PMKetan (kumare3)
Thomas Newton
11/10/2023, 8:35 PM@workflow(failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE)
. Maybe I should have created a better example rather that giving you back the example you gave us 😅kubectl get flyteworkflows f0319e66fb12740eaa4d -o yaml > etcd_sync_failure.yaml
the file size is 1.9MB
. If I use a python script to delete all the message
fields the size is only 0.38MB
. I think its quite likely that tracebacks in the CRDs are pushing them over the edge to be too big.useOffloadedWorkflowClosure
option won'h help us much because the majority of the CRD size is non-static given our use of dynamic workflows.
Is there any possibility of improving this? For example I tried hacking something together to preventing the tracebacks from getting into the CRD. That helped a bit but the resulting CRD was still much larger than one where everything succeeded and it probably only allows maybe 4X greater scalability (based on the CRD size numbers I got above). Would it be possible to store the CRD in a more compressed format? My best guess is that currently the CRD data is stored as YAML or JSON but if it was stored as a protobuf probably the size would be much smaller.Ketan (kumare3)
Thomas Newton
11/12/2023, 7:21 PMspec
field is null
when using offload but in this case that is quite a small portion of the total CRD.
I think we will need to come up with a solution for this. Possibly we could make a contribution to flyte. Would you mind explaining why etcd is used for storing this. What stuff updates the CRD? Is it just flytepropeller or does flyteadmin also make updates? Would it be feasible to use something like a k8s persistent volume claim?
Its quite likely I will experiment with some of my ideas so if you can provide any insight on whether they are feasible that would be much appreciated 🙂.Ketan (kumare3)
l
is map task ideal? map tasks are designed to automatically compress the storage usage in etcdThomas Newton
11/13/2023, 12:58 PMfdf9c218e2eaa400083d
) the first layer of fannout is spawned tasks 2000 tasks. The second layer of fannout spawned 2124 copies of the core_workflow
.
We've been able to run the same workflow structure successfully with 50,000 copies of the core_workflow
if the core_workflow
always succeeds.Ketan (kumare3)
Thomas Newton
11/13/2023, 5:50 PMKetan (kumare3)
Thomas Newton
11/13/2023, 7:09 PMcontext deadline exceeded
errors we saw first.
I'm reasonably confident these context deadline exceeded
errors are coming from networking issues between our control plane cluster and the data plane cluster. When I run the same workflow all in the same cluster as the control plane I haven't been able to reproduce the issue. I'm hoping that scaling up our nginx ingress controller will resolve the context deadline exceeded errors.Dan Rammer (hamersaw)
11/29/2023, 8:54 PM