rough-rose-81585
08/31/2022, 1:33 AMFailed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution
i feel like it is related to this ticket. My main question however is
When something like this happens what is the best way to reset flytepropeller? I tried restarting the pod but it seems like is a problem with the database state?hallowed-mouse-14616
08/31/2022, 2:19 AMhallowed-mouse-14616
08/31/2022, 2:20 AMhallowed-mouse-14616
08/31/2022, 2:21 AMhallowed-mouse-14616
08/31/2022, 2:25 AMFailed to update workflow. Error [%v]
or Failed storing workflow to the store, reason: %s
? I'm wondering if the task failed, the CRD was too large to update (or failed to update for some other reason), and then you manually aborted. This sequence could explain what you're seeing.rough-rose-81585
08/31/2022, 11:41 AMkubectl get fly -o yaml.
I do not see either of those lines in our flytepropeller/flyteadmin logs.
Some other info about the run.
• Contour seems to have had an issue pulling images during the run. I suspect this was because we scaled out to 900 nodes fairly quickly and were being throttled. We’ve since updated our deployment to have the contour image (and all external images) copied into our internal registry. Not sure if the contour failure caused Flyte to get into this strange state, but seems plausible.
• This workflow is fairly large and has multiple dynamic workflows and tasks. One dynamic workflow in particular spawns 1280 tasks per input file.
• I’ve limited the number of input files per workflow invocation to 10. For this particular run we had 150 input files.
• So we had 15 launchplan invocations, 10 files per workflow, 1280+ tasks per workflow. This utilized 10,000 CPU across ~900 nodes. It was supposed to take around ~7 hours to completegifted-raincoat-59712
08/31/2022, 4:48 PMrough-rose-81585
08/31/2022, 7:14 PMfly
crds that are causing the log spam
two attempt to delete but get hung waiting for the flyte-finalizer
to complete
one gets an error from etcd that the request is too large
kubectl delete fly f8114eedda3854878b11
Error from server: etcdserver: request is too large
rough-rose-81585
08/31/2022, 7:23 PMhallowed-mouse-14616
08/31/2022, 9:30 PMhallowed-mouse-14616
08/31/2022, 9:31 PMgifted-raincoat-59712
09/01/2022, 7:01 PMkubectl delete fly -n dpp-default f8114eedda3854878b11
Error from server: etcdserver: request is too large
kubectl get fly -n dpp-default f8114eedda3854878b11 -o yaml > flyte-fly.yaml
stat -f %z flyte-fly.yaml
2825235
wc -l flyte-fly.yaml
61110 flyte-fly.yaml
gifted-raincoat-59712
09/01/2022, 7:01 PMstatus.nodeStatus.subNodeStatus
gifted-raincoat-59712
09/01/2022, 7:03 PMkubectl edit fly -n dpp-default f8114eedda3854878b11
# set status.nodeStatus.subNodeStatus: {}
# delete original content
# save
<http://flyteworkflow.flyte.lyft.com/f8114eedda3854878b11|flyteworkflow.flyte.lyft.com/f8114eedda3854878b11> edited
gifted-raincoat-59712
09/01/2022, 7:03 PMkubectl delete fly -n dpp-default f8114eedda3854878b11
<http://flyteworkflow.flyte.lyft.com|flyteworkflow.flyte.lyft.com> "f8114eedda3854878b11" deleted
gifted-raincoat-59712
09/01/2022, 7:06 PM{"json":{"exec_id":"f8114eedda3854878b11","ns":"dpp-default","routine":"worker-43","src":"passthrough.go:95"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"f8114eedda3854878b11\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2022-09-01T18:59:08Z"}
E0901 18:59:08.375477 1 workers.go:102] error syncing 'dpp-default/f8114eedda3854878b11': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "f8114eedda3854878b11": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"f8114eedda3854878b11","node":"n0","ns":"dpp-default","res_ver":"71132693","routine":"worker-43","src":"handler.go:338","wf":"dpp:default:msat.level2.workflow.level2_wf"},"level":"warning","msg":"No plugin found for Handler-type [python-task], defaulting to [container]","ts":"2022-09-01T18:59:08Z"}
{"json":{"exec_id":"f8114eedda3854878b11","ns":"dpp-default","routine":"worker-45","src":"passthrough.go:39"},"level":"warning","msg":"Workflow not found in cache.","ts":"2022-09-01T18:59:59Z"}
{"json":{"exec_id":"f8114eedda3854878b11","ns":"dpp-default","routine":"worker-45","src":"handler.go:176"},"level":"warning","msg":"Workflow namespace[dpp-default]/name[f8114eedda3854878b11] not found, may be deleted.","ts":"2022-09-01T18:59:59Z"}
hallowed-mouse-14616
09/01/2022, 8:46 PMrough-rose-81585
09/02/2022, 4:58 PMhallowed-mouse-14616
09/02/2022, 5:04 PM