Calvin Leather
07/26/2022, 7:20 PM[0]: code:"ResourceDeletedExternally" message:"resource not found, name [e2e-workflows-development/fb2xnzxy-n2-0-0]. reason: pods \"fb2xnzxy-n2-0-0\" not found"
We then checked control plan logs and they suggested the pod was being evicted due to memory pressure (137 = k8s OOM status code):
"containerStatuses": [
{
"name": "fb2xnzxy-n2-0-0",
"state": {
"terminated": {
"exitCode": 137,
....
However when we look at grafana, we see that memory used is really low, way below requests/limits... however, we found that the memory cache was quite high. We then found a k8s issue about memory cache being incorrectly counted as "used" memory by kubelet when it looks at memory pressure.
Note quite a flyte issue, more of a k8s issue, but the log was a bit mysterious and we're still figuring out resolution.Louis DiNatale
07/26/2022, 7:21 PMCalvin Leather
07/26/2022, 7:24 PMKetan (kumare3)
Calvin Leather
07/26/2022, 9:06 PMKetan (kumare3)
Eduardo Apolinario (eapolinario)
07/27/2022, 5:12 PMDan Rammer (hamersaw)
07/27/2022, 6:20 PMLouis DiNatale
07/27/2022, 7:14 PMCalvin Leather
07/27/2022, 8:23 PM--memory
and appears to reserve a bunch of memory if you don't override defaults. We're still tracing the source code to figure out what this flag does (and why the ram is showing up as cached), but definitely not a flyte problem it seemsKetan (kumare3)
Calvin Leather
07/27/2022, 9:44 PMKetan (kumare3)
Calvin Leather
07/27/2022, 9:45 PMKetan (kumare3)
Calvin Leather
07/27/2022, 9:46 PMMike Zhong
07/29/2022, 4:58 PMI0729 16:10:23.041121 4106 kuberuntime_manager.go:484] "No sandbox for pod can be found. Need to start a new one" pod="e2e-workflows-development/fnrte65a-n3-0-108"
but no other meaningful logs between that and the time it is deleted and removed.
We’ve included screenshots of the finalizers in the configmap and being applied to the pods, as well the ResourceDeletedExternally
error. Any thoughts on what could be happening here or where else we could look for insight?Ketan (kumare3)
Mike Zhong
07/29/2022, 6:24 PMResourceDeletedExternally
from the flyte console. We are assuming this is indicating that flytepropeller is unable to gather the logs and the pod is being cleaned up by the kubelet, despite the finalizersDan Rammer (hamersaw)
07/29/2022, 6:35 PMCalvin Leather
07/29/2022, 7:14 PMKetan (kumare3)
Calvin Leather
08/18/2022, 2:40 PMKetan (kumare3)
Calvin Leather
08/18/2022, 4:03 PM