Hi we recently had a few large workflows get in a bad state Flyte #flyte-deployment

Hi we recently had a few large workflows get in a ...

rough-rose-81585

08/31/2022, 1:33 AM

Hi we recently had a few large workflows get in a bad state. I let them run for a long time and eventually had to abort a few. Nothing is running anymore but the flytepropeller logs are flooded with the following

Copy code

Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution

i feel like it is related to this ticket. My main question however is When something like this happens what is the best way to reset flytepropeller? I tried restarting the pod but it seems like is a problem with the database state?

hallowed-mouse-14616

08/31/2022, 2:19 AM

Hey @rough-rose-81585! This is interesting. Is the FlyteWorkflow CRD still around? It would be interesting to look at a dump of this. Basically, what is happening here is that FlytePropeller is attempting to abort a task in the workflow. As part of that operation it sends an event to FlyteAdmin to update the task state in the UI. FlyteAdmin is responding that the task has already FAILED (as stored in the DB), so it can not be updated to ABORTED (an invalid transition).

hallowed-mouse-14616

08/31/2022, 2:20 AM

If you can verify that all of the Pods in the workflow have been aborted and deleted (finalizers removed if necessary), then you should be able to delete the FlyteWorkflow CRD manually and that will stop FlytePropeller from attempting to re-abort the workflow.

hallowed-mouse-14616

08/31/2022, 2:21 AM

However, this is very unintended behavior and I would like to figure out how we entered this state to ensure it doesn't happen in the future.

hallowed-mouse-14616

08/31/2022, 2:25 AM

Is there anything in the logs about failing to update the workflow CRD? Something like

Failed to update workflow. Error [%v]

Failed storing workflow to the store, reason: %s

? I'm wondering if the task failed, the CRD was too large to update (or failed to update for some other reason), and then you manually aborted. This sequence could explain what you're seeing.

rough-rose-81585

08/31/2022, 11:41 AM

Hi Dan, thanks for the info. I’ve sent you one of the CRDs I retrieved with

kubectl get fly -o yaml.

I do not see either of those lines in our flytepropeller/flyteadmin logs. Some other info about the run. • Contour seems to have had an issue pulling images during the run. I suspect this was because we scaled out to 900 nodes fairly quickly and were being throttled. We’ve since updated our deployment to have the contour image (and all external images) copied into our internal registry. Not sure if the contour failure caused Flyte to get into this strange state, but seems plausible. • This workflow is fairly large and has multiple dynamic workflows and tasks. One dynamic workflow in particular spawns 1280 tasks per input file. • I’ve limited the number of input files per workflow invocation to 10. For this particular run we had 150 input files. • So we had 15 launchplan invocations, 10 files per workflow, 1280+ tasks per workflow. This utilized 10,000 CPU across ~900 nodes. It was supposed to take around ~7 hours to complete

gifted-raincoat-59712

08/31/2022, 4:48 PM

@hallowed-mouse-14616 Did you find a cause on Flyte side? Can I try to provide more debugging info?

rough-rose-81585

08/31/2022, 7:14 PM

I tried to delete the three

fly

crds that are causing the log spam two attempt to delete but get hung waiting for the

flyte-finalizer

to complete one gets an error from etcd that the request is too large

Copy code

kubectl delete fly f8114eedda3854878b11
Error from server: etcdserver: request is too large

rough-rose-81585

08/31/2022, 7:23 PM

the two that appeared hung eventually did delete. Not sure what to do about f811 if etcdserver says it is too large

hallowed-mouse-14616

08/31/2022, 9:30 PM

OK, so to cleanup, for the immediate fix, we need to delete the CRD that is too large - @great-school-54368 do you know how we can do this?

hallowed-mouse-14616

08/31/2022, 9:31 PM

@rough-rose-81585 of course this doesn't fix the issue of how this happened. Do you mind filing an issue for this? I have a feeling it's going to take some digging, so it will be worth it to track in an issue.

gifted-raincoat-59712

09/01/2022, 7:01 PM

the flyte CRD that we cannot delete is 61K lines and 2.8 MB.

Copy code

kubectl delete fly -n dpp-default f8114eedda3854878b11                   
Error from server: etcdserver: request is too large

kubectl get fly -n dpp-default f8114eedda3854878b11 -o yaml > flyte-fly.yaml

stat -f %z flyte-fly.yaml  
2825235

wc -l flyte-fly.yaml     
   61110 flyte-fly.yaml

gifted-raincoat-59712

09/01/2022, 7:01 PM

most of this is the flyte node status under

status.nodeStatus.subNodeStatus

gifted-raincoat-59712

09/01/2022, 7:03 PM

so i hacked that attribute to zero it out.

Copy code

kubectl edit fly -n dpp-default f8114eedda3854878b11
# set status.nodeStatus.subNodeStatus: {}
# delete original content
# save
<http://flyteworkflow.flyte.lyft.com/f8114eedda3854878b11|flyteworkflow.flyte.lyft.com/f8114eedda3854878b11> edited

gifted-raincoat-59712

09/01/2022, 7:03 PM

then i was able to delete the monster

Copy code

kubectl delete fly -n dpp-default f8114eedda3854878b11
<http://flyteworkflow.flyte.lyft.com|flyteworkflow.flyte.lyft.com> "f8114eedda3854878b11" deleted

gifted-raincoat-59712

09/01/2022, 7:06 PM

that calmed the flytepropeller logs down. flytepropeller noticed the change to the flyteworkflow CRD, then forgot about it after i removed it.

Copy code

{"json":{"exec_id":"f8114eedda3854878b11","ns":"dpp-default","routine":"worker-43","src":"passthrough.go:95"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"f8114eedda3854878b11\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2022-09-01T18:59:08Z"}
E0901 18:59:08.375477       1 workers.go:102] error syncing 'dpp-default/f8114eedda3854878b11': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "f8114eedda3854878b11": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"f8114eedda3854878b11","node":"n0","ns":"dpp-default","res_ver":"71132693","routine":"worker-43","src":"handler.go:338","wf":"dpp:default:msat.level2.workflow.level2_wf"},"level":"warning","msg":"No plugin found for Handler-type [python-task], defaulting to [container]","ts":"2022-09-01T18:59:08Z"}
{"json":{"exec_id":"f8114eedda3854878b11","ns":"dpp-default","routine":"worker-45","src":"passthrough.go:39"},"level":"warning","msg":"Workflow not found in cache.","ts":"2022-09-01T18:59:59Z"}
{"json":{"exec_id":"f8114eedda3854878b11","ns":"dpp-default","routine":"worker-45","src":"handler.go:176"},"level":"warning","msg":"Workflow namespace[dpp-default]/name[f8114eedda3854878b11] not found, may be deleted.","ts":"2022-09-01T18:59:59Z"}

hallowed-mouse-14616

09/01/2022, 8:46 PM

@gifted-raincoat-59712 oh great! Thanks for diving into this, I had it only "follow up" list today. So in the near-term it sounds like the plan is to reduce the size of these workflows by restricting them to 10 files rather than 150? I know I have had discussions about using launchplans to scale by distributing nodes over multiple CRDs with @rough-rose-81585, I believe this is what your already doing. And again, the goal it to ensure flyte workflow CRDs are a managable size - but in the case that they're too large it shouldn't leave the system in this broken state. If you want to file an issue I'm sure this is something that we can work on to iron out a bit.

rough-rose-81585

09/02/2022, 4:58 PM

Hi @hallowed-mouse-14616 yes I was trying to keep the CRDs at a manageable size by capping each launchplan at 10 files max, but we do cram a lot of info into each input so I’ll reduce it to either 1 or 2 files for now, and that should resolve the issue in the short term

🙌 1

hallowed-mouse-14616

09/02/2022, 5:04 PM

Fantastic! We just implemented a feature to offload static portions (workflow spec, etc) of the CRD to the blobstore which will reduce CRD size, but it's not clear this will help in your usecase as dynamic tasks are already offloading the spec. Rather, managing the large status seems to be the issue. This is known and we have a few thoughts about how we can further reduce CRD sizes by stripping already completed status' from the CRD (ground truth maintained in flyteadmin DB), but they require a lot of effort and testing. I'm hoping to devote some time to this later this year.

167 Views

Open in Slack

Previous Next