Hi we recently had a few large workflows get in a ...
# flyte-deployment
n
Hi we recently had a few large workflows get in a bad state. I let them run for a long time and eventually had to abort a few. Nothing is running anymore but the flytepropeller logs are flooded with the following
Copy code
Failed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution
i feel like it is related to this ticket. My main question however is When something like this happens what is the best way to reset flytepropeller? I tried restarting the pod but it seems like is a problem with the database state?
d
Hey @Nicholas LoFaso! This is interesting. Is the FlyteWorkflow CRD still around? It would be interesting to look at a dump of this. Basically, what is happening here is that FlytePropeller is attempting to abort a task in the workflow. As part of that operation it sends an event to FlyteAdmin to update the task state in the UI. FlyteAdmin is responding that the task has already FAILED (as stored in the DB), so it can not be updated to ABORTED (an invalid transition).
If you can verify that all of the Pods in the workflow have been aborted and deleted (finalizers removed if necessary), then you should be able to delete the FlyteWorkflow CRD manually and that will stop FlytePropeller from attempting to re-abort the workflow.
However, this is very unintended behavior and I would like to figure out how we entered this state to ensure it doesn't happen in the future.
Is there anything in the logs about failing to update the workflow CRD? Something like
Failed to update workflow. Error [%v]
or
Failed storing workflow to the store, reason: %s
? I'm wondering if the task failed, the CRD was too large to update (or failed to update for some other reason), and then you manually aborted. This sequence could explain what you're seeing.
n
Hi Dan, thanks for the info. I’ve sent you one of the CRDs I retrieved with
kubectl get fly -o yaml.
I do not see either of those lines in our flytepropeller/flyteadmin logs. Some other info about the run. • Contour seems to have had an issue pulling images during the run. I suspect this was because we scaled out to 900 nodes fairly quickly and were being throttled. We’ve since updated our deployment to have the contour image (and all external images) copied into our internal registry. Not sure if the contour failure caused Flyte to get into this strange state, but seems plausible. • This workflow is fairly large and has multiple dynamic workflows and tasks. One dynamic workflow in particular spawns 1280 tasks per input file. • I’ve limited the number of input files per workflow invocation to 10. For this particular run we had 150 input files. • So we had 15 launchplan invocations, 10 files per workflow, 1280+ tasks per workflow. This utilized 10,000 CPU across ~900 nodes. It was supposed to take around ~7 hours to complete
j
@Dan Rammer (hamersaw) Did you find a cause on Flyte side? Can I try to provide more debugging info?
n
I tried to delete the three
fly
crds that are causing the log spam two attempt to delete but get hung waiting for the
flyte-finalizer
to complete one gets an error from etcd that the request is too large
Copy code
kubectl delete fly f8114eedda3854878b11
Error from server: etcdserver: request is too large
the two that appeared hung eventually did delete. Not sure what to do about f811 if etcdserver says it is too large
d
OK, so to cleanup, for the immediate fix, we need to delete the CRD that is too large - @Yuvraj do you know how we can do this?
@Nicholas LoFaso of course this doesn't fix the issue of how this happened. Do you mind filing an issue for this? I have a feeling it's going to take some digging, so it will be worth it to track in an issue.
j
the flyte CRD that we cannot delete is 61K lines and 2.8 MB.
Copy code
kubectl delete fly -n dpp-default f8114eedda3854878b11                   
Error from server: etcdserver: request is too large

kubectl get fly -n dpp-default f8114eedda3854878b11 -o yaml > flyte-fly.yaml

stat -f %z flyte-fly.yaml  
2825235

wc -l flyte-fly.yaml     
   61110 flyte-fly.yaml
most of this is the flyte node status under
status.nodeStatus.subNodeStatus
so i hacked that attribute to zero it out.
Copy code
kubectl edit fly -n dpp-default f8114eedda3854878b11
# set status.nodeStatus.subNodeStatus: {}
# delete original content
# save
<http://flyteworkflow.flyte.lyft.com/f8114eedda3854878b11|flyteworkflow.flyte.lyft.com/f8114eedda3854878b11> edited
then i was able to delete the monster
Copy code
kubectl delete fly -n dpp-default f8114eedda3854878b11
<http://flyteworkflow.flyte.lyft.com|flyteworkflow.flyte.lyft.com> "f8114eedda3854878b11" deleted
that calmed the flytepropeller logs down. flytepropeller noticed the change to the flyteworkflow CRD, then forgot about it after i removed it.
Copy code
{"json":{"exec_id":"f8114eedda3854878b11","ns":"dpp-default","routine":"worker-43","src":"passthrough.go:95"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"f8114eedda3854878b11\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2022-09-01T18:59:08Z"}
E0901 18:59:08.375477       1 workers.go:102] error syncing 'dpp-default/f8114eedda3854878b11': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "f8114eedda3854878b11": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"f8114eedda3854878b11","node":"n0","ns":"dpp-default","res_ver":"71132693","routine":"worker-43","src":"handler.go:338","wf":"dpp:default:msat.level2.workflow.level2_wf"},"level":"warning","msg":"No plugin found for Handler-type [python-task], defaulting to [container]","ts":"2022-09-01T18:59:08Z"}
{"json":{"exec_id":"f8114eedda3854878b11","ns":"dpp-default","routine":"worker-45","src":"passthrough.go:39"},"level":"warning","msg":"Workflow not found in cache.","ts":"2022-09-01T18:59:59Z"}
{"json":{"exec_id":"f8114eedda3854878b11","ns":"dpp-default","routine":"worker-45","src":"handler.go:176"},"level":"warning","msg":"Workflow namespace[dpp-default]/name[f8114eedda3854878b11] not found, may be deleted.","ts":"2022-09-01T18:59:59Z"}
d
@Justin Tyberg oh great! Thanks for diving into this, I had it only "follow up" list today. So in the near-term it sounds like the plan is to reduce the size of these workflows by restricting them to 10 files rather than 150? I know I have had discussions about using launchplans to scale by distributing nodes over multiple CRDs with @Nicholas LoFaso, I believe this is what your already doing. And again, the goal it to ensure flyte workflow CRDs are a managable size - but in the case that they're too large it shouldn't leave the system in this broken state. If you want to file an issue I'm sure this is something that we can work on to iron out a bit.
n
Hi @Dan Rammer (hamersaw) yes I was trying to keep the CRDs at a manageable size by capping each launchplan at 10 files max, but we do cram a lot of info into each input so I’ll reduce it to either 1 or 2 files for now, and that should resolve the issue in the short term
d
Fantastic! We just implemented a feature to offload static portions (workflow spec, etc) of the CRD to the blobstore which will reduce CRD size, but it's not clear this will help in your usecase as dynamic tasks are already offloading the spec. Rather, managing the large status seems to be the issue. This is known and we have a few thoughts about how we can further reduce CRD sizes by stripping already completed status' from the CRD (ground truth maintained in flyteadmin DB), but they require a lot of effort and testing. I'm hoping to devote some time to this later this year.
103 Views