flat-exabyte-79377
11/08/2023, 6:54 PMmax_parallelism
set to 500
. We're sometimes seeing large spikes of errors in propeller, with various error messages:
Workflow[<redacted>] failed. RuntimeExecutionError: max number of system retry attempts [31/30] exhausted. Last known status message: [SystemError] failed to launch workflow [<redacted>], system error, caused by: rpc error: code = DeadlineExceeded desc = context deadline exceeded
This error causes the whole workflow to fail ^
We also see this:
failed Execute for node. Error: EventSinkError: Error sending event, caused by [rpc error: code = DeadlineExceeded desc = context deadline exceeded]
and these warnings, but maybe that's unrelated:
Failed to record taskEvent, error [AlreadyExists: Event already exists, caused by [rpc error: code = AlreadyExists desc = have already recorded task execution phase ABORTED (version: 0) for {resource_type:TASK project:\"<project>\" domain:\"production\" name:\"<workflow>\" version:\"6KLmT5WBECbwj7w_fSLTPw==\" node_id:\"n0-0-dn2121\" execution_id:\u003cproject:\"<project>\" domain:\"<domain>\" name:\"f0a0d716e3d924aa7a1e\" \u003e 1 {} [] 0}]]. Trying to record state: ABORTED. Ignoring this error!"
Does anyone know what could be the issue?
cc @calm-pilot-2010calm-pilot-2010
11/08/2023, 7:56 PMfreezing-airport-6809
freezing-airport-6809
calm-pilot-2010
11/09/2023, 12:55 AMcalm-pilot-2010
11/09/2023, 1:31 AMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
calm-pilot-2010
11/09/2023, 7:22 AMflat-exabyte-79377
11/09/2023, 7:23 AMcalm-pilot-2010
11/09/2023, 7:50 AMcalm-pilot-2010
11/09/2023, 11:39 AM{"json":{"exec_id":"anbwtd2v2x5p8px67rzt","ns":"nerf-test-production","routine":"worker-672"},"level":"error","msg":"Failed to update workflow. Error [etcdserver: request is too large]","ts":"2023-11-09T11:37:18Z"}
E1109 11:37:18.581775 1 workers.go:103] error syncing '<project-domain>/anbwtd2v2x5p8px67rzt': etcdserver: request is too large
calm-pilot-2010
11/09/2023, 11:46 AMaverage-finland-92144
11/09/2023, 4:56 PMflat-exabyte-79377
11/09/2023, 5:00 PMcalm-pilot-2010
11/09/2023, 5:03 PMcalm-pilot-2010
11/09/2023, 8:31 PMfreezing-airport-6809
calm-pilot-2010
11/10/2023, 8:35 PM@workflow(failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE)
. Maybe I should have created a better example rather that giving you back the example you gave us 😅calm-pilot-2010
11/10/2023, 8:37 PMcalm-pilot-2010
11/10/2023, 10:25 PMkubectl get flyteworkflows f0319e66fb12740eaa4d -o yaml > etcd_sync_failure.yaml
the file size is 1.9MB
. If I use a python script to delete all the message
fields the size is only 0.38MB
. I think its quite likely that tracebacks in the CRDs are pushing them over the edge to be too big.calm-pilot-2010
11/12/2023, 7:06 PMuseOffloadedWorkflowClosure
option won'h help us much because the majority of the CRD size is non-static given our use of dynamic workflows.
Is there any possibility of improving this? For example I tried hacking something together to preventing the tracebacks from getting into the CRD. That helped a bit but the resulting CRD was still much larger than one where everything succeeded and it probably only allows maybe 4X greater scalability (based on the CRD size numbers I got above). Would it be possible to store the CRD in a more compressed format? My best guess is that currently the CRD data is stored as YAML or JSON but if it was stored as a protobuf probably the size would be much smaller.freezing-airport-6809
freezing-airport-6809
calm-pilot-2010
11/12/2023, 7:21 PMcalm-pilot-2010
11/12/2023, 11:30 PMspec
field is null
when using offload but in this case that is quite a small portion of the total CRD.
I think we will need to come up with a solution for this. Possibly we could make a contribution to flyte. Would you mind explaining why etcd is used for storing this. What stuff updates the CRD? Is it just flytepropeller or does flyteadmin also make updates? Would it be feasible to use something like a k8s persistent volume claim?
Its quite likely I will experiment with some of my ideas so if you can provide any insight on whether they are feasible that would be much appreciated 🙂.freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
l
is map task ideal? map tasks are designed to automatically compress the storage usage in etcdfreezing-airport-6809
calm-pilot-2010
11/13/2023, 12:58 PMfdf9c218e2eaa400083d
) the first layer of fannout is spawned tasks 2000 tasks. The second layer of fannout spawned 2124 copies of the core_workflow
.
We've been able to run the same workflow structure successfully with 50,000 copies of the core_workflow
if the core_workflow
always succeeds.freezing-airport-6809
calm-pilot-2010
11/13/2023, 5:50 PMfreezing-airport-6809
calm-pilot-2010
11/13/2023, 7:09 PMcalm-pilot-2010
11/17/2023, 10:10 AMcalm-pilot-2010
11/21/2023, 2:15 PMcontext deadline exceeded
errors we saw first.
I'm reasonably confident these context deadline exceeded
errors are coming from networking issues between our control plane cluster and the data plane cluster. When I run the same workflow all in the same cluster as the control plane I haven't been able to reproduce the issue. I'm hoping that scaling up our nginx ingress controller will resolve the context deadline exceeded errors.calm-pilot-2010
11/29/2023, 7:53 PMhallowed-mouse-14616
11/29/2023, 8:54 PMhallowed-mouse-14616
11/29/2023, 8:56 PM