Hey we ran into an interesting problem today... we...
# ask-the-community
c
Hey we ran into an interesting problem today... we have a map task that calls a C binary that uses a large reference dataset from disk to do some computations on a new smaller dataset. It keeps failing with a rather mysterious error:
[0]: code:"ResourceDeletedExternally" message:"resource not found, name [e2e-workflows-development/fb2xnzxy-n2-0-0]. reason: pods \"fb2xnzxy-n2-0-0\" not found"
We then checked control plan logs and they suggested the pod was being evicted due to memory pressure (137 = k8s OOM status code):
Copy code
"containerStatuses": [
                {
                    "name": "fb2xnzxy-n2-0-0",
                    "state": {
                        "terminated": {
                            "exitCode": 137,
....
However when we look at grafana, we see that memory used is really low, way below requests/limits... however, we found that the memory cache was quite high. We then found a k8s issue about memory cache being incorrectly counted as "used" memory by kubelet when it looks at memory pressure. Note quite a flyte issue, more of a k8s issue, but the log was a bit mysterious and we're still figuring out resolution.
👀 1
@Louis DiNatale can you share that grafana screenshot from earlier?
l
message has been deleted
thx 1
c
We're testing out https://github.com/Feh/nocache to see if it can allow us to reduce cache usage to work around this
k
hmm but Flyte has a solution for this @Calvin Leather
you have to enable finalizers
👀 1
whats happening is number of pods is too high
so k8s will randomly delete pods
c
Interesting, we'll look into this. We didn't override this default, and it looks like values-eks.yaml doesn't either.
We're wrapping up our work day, will investigate this setting more tomorrow and read the source to understand what this finalizer is doing. The error from this was stochastic, so we just retried our way around it for today.
Thanks for giving us some nice direction here!
k
absolutely
e
cc: @Dan Rammer (hamersaw)
d
@Calvin Leather, I understand you have a few questions regarding how injecting the finalizer works. Basically, k8s will not garbage collect any resource that has a finalizer on it until the finalizer is removed. So the error you're seeing is a result of the following sequence of events: (1) Flyte creates the subtask Pod (2) Pod is OOM deleted - which only marks the Pod as deleted rather than actually deletes anything (3) k8s garbage collects Pod (4) Flyte attempts to get the Pod status to determine the task state. It does not exist, so Flyte throws an error that the resource can not be found. In the scenario where the finalizer is injected the sequence will be a little different: (1) Flyte creates the subtask Pod with a finalizer (2) Pod is OOM deleted - which only marks the Pod as deleted rather than actually deletes anything (3) Flyte retreives the Pod status to determine task state and detects that the Pod has been deleted by an external entity. It then marks the task as a retryable failure which subsequently removes the Pod finalizer (4) k8s garbage collects the Pod So basically, injecting the finalizer will not stop the Pod from being OOM deleted. However, Flyte may be able to provide a better error message as to why the Pod was deleted because it still exists and Flyte is responsible for OKing the Pod deletion.
👀 1
l
We set the finalizer to true and we can see it in our propeller config map, but it seems to had no impact on our job. Still failed with OOM
Ok so I think i understand this better then, the finalizer isnt the solution to mem cache but it will keep the pods around longer so you can view the error?
c
Thanks for that explanation Dan!
So the finalizer is the fix for the pod getting deleted after the OOM, not the fix for the OOM itself
e.g., flyte will correctly register the OOM, instead of this error: ``reason: pods \"fb2xnzxy-n2-0-0\" not found"`
Thank you!
We're still figuring out the OOM issue, its confusing because it seems like the RAM cache is causing us to get OOM problems (actually malloc'd memory is way below our limit, but acutally used memory + cache is up to the limit, which we think is causing K8s to evict/OOM)
Hmm, we may have found the issue... gotta love bioinformatics libraries... one of the C binaries we use to analyze data has a command called
--memory
and appears to reserve a bunch of memory if you don't override defaults. We're still tracing the source code to figure out what this flag does (and why the ram is showing up as cached), but definitely not a flyte problem it seems
Or k8s problem probably
k
@Calvin Leather - TBH - it is possible to submit a memory profile as part of FlyteDecks
if you are interested in contributing
that could help debug these situations?
c
Ooh
That would be awesome!
We have grafana + prometheus set up
But we got led astray by the outstanding k8s issue w/ RAM caching
(that was a red herring, it was just a regular OOM)
Traces in flyte deck + jupyter notebook papermill task outputs as flyte decks are both interesting to us, would be fun to contribute both/either of those
k
ohh papermill one is reallly simple
c
Yeah we've been doing the deck as a follow up task
k
its already created as an html, should be simple to add
✔️ 1
@Calvin Leather ❤️ all contributions are loved
c
I may grab that jupyter notebook deck one when I have a spare minute, seems like a good first contribution
🙇 1
m
Thanks for providing that information on finalizers @Dan Rammer (hamersaw) and @Ketan (kumare3). Just to follow-up on this thread, we have enabled finalizers in our flyte deployment and have confirmed the configuration in the configmap. However, we are still encountering the error. We queried the pod names in the dataplane log group and see no obvious indication as to why the pod is being killed. We do see
I0729 16:10:23.041121 4106 kuberuntime_manager.go:484] "No sandbox for pod can be found. Need to start a new one" pod="e2e-workflows-development/fnrte65a-n3-0-108"
but no other meaningful logs between that and the time it is deleted and removed. We’ve included screenshots of the finalizers in the configmap and being applied to the pods, as well the
ResourceDeletedExternally
error. Any thoughts on what could be happening here or where else we could look for insight?
k
this is a kubelet / k8s configuration. Sorry Flyte folks wont be able to help. I think Flyte is doing the right think - the pod is being held on now?
thx 1
m
We see the
ResourceDeletedExternally
from the flyte console. We are assuming this is indicating that flytepropeller is unable to gather the logs and the pod is being cleaned up by the kubelet, despite the finalizers
d
Correct, I know exactly where the happens in the code. When retrieving the Pod (to check status) k8s returns the "object does not exist", but we know that it should because of the node status (ie. creating it succeeded earlier). Injecting finalizers is the only thing that Flyte can do to ensure a Pod is not deleted externally, but I know we've ran into scenarios where a cleanup mechanism will not recognize the finalizers and will delete and garbage collect the Pods regardless.
thx 1
c
Thanks for the info! We're investigating now (edit: about what caused the pod to terminate/be terminated + garbaged collected), will post back here when we figure something out (for future searchers, if we don't also just add to AWS deploy docs in the unlikely case this is some issue with how we have EKS configured)
Okay this took us entirely too long (couldn't reproduce outside of prod, only happened in map tasks at decent scale) We're pretty sure this is some kind of interaction between flyte propeller, eks node-manager, and ASG Availability Zone rebalancing. The ultimate error message seems variable depending on some details I"m still not 100% sure of, but show up either as "resource not found", "resource manually deleted", or a "panic when executing a plugin [k8s-array]" (typically with the stack trace showing LaunchAndCheckSubTasksState or something else near go/tasks/plugins/array/k8s/management.go:24, i.e., while propeller is trying to check on the state of the task) We explored the control plane and K8s API logs in more detail, and discovered that these failures seem to always occur after an eviction request by a lambda function that is owned by AWS (i.e., part of the managed EKS node group). On more exploration, we suspect (waiting on AWS support to confirm) that this lambda function (AWSWesleyClusterManagerLambda) is the one responsible for "graceful" eviction after AZ rebalancing activities . We also confirmed in ASG logs that an AZ rebalancing occured before each of these errors, and that this lambda send an eviction request to the K8s API. The missing piece here (to me anyway) is why the finalizers don't prevent this. We've confirmed using kubectl that the finalizers are getting applied correctly. I would expect the eviction to respect them (need to read up more on k8s internals around eviction I suppose). We disabled AZ rebalancing on the ASG yeasterday, and this particular error seems to have stopped (time will tell if it fully eradicate it) Also, one thing I'm less confident about- this issue seems to occur more (maybe only) on the last/terminal retry of a task. I.e. if we have 2 retries on a task, we get this error when an AZ rebalancing disrupts the 2nd retry. This occurs rarely enough (we're running map tasks with 100-1000 elements, it'll often be like 1 or 2 pods that are on the 2nd/final retry).
k
Can you capture this on an issue
I do not want to lose this info
c
Yes totally!
❤️ 1
I'll get it into an issue today and include cloudwatch screenshots
k
Perfect helps on Google search
👍 1
c
Moved to https://github.com/flyteorg/flyte/issues/2788, will keep adding detail (its still a bit light on detail)
🙌 2
136 Views