Hey we ran into an interesting problem today... we have a map task that calls a C binary that uses a large reference dataset from disk to do some computations on a new smaller dataset. It keeps failing with a rather mysterious error:
[0]: code:"ResourceDeletedExternally" message:"resource not found, name [e2e-workflows-development/fb2xnzxy-n2-0-0]. reason: pods \"fb2xnzxy-n2-0-0\" not found"
We then checked control plan logs and they suggested the pod was being evicted due to memory pressure (137 = k8s OOM status code):
"containerStatuses": [
{
"name": "fb2xnzxy-n2-0-0",
"state": {
"terminated": {
"exitCode": 137,
....
However when we look at grafana, we see that memory used is really low, way below requests/limits... however, we found that the memory cache was quite high. We then found
a k8s issue about memory cache being incorrectly counted as "used" memory by kubelet when it looks at memory pressure.
Note quite a flyte issue, more of a k8s issue, but the log was a bit mysterious and we're still figuring out resolution.