Hey we ran into an interesting problem today we have a map t Flyte #flyte-support

Hey we ran into an interesting problem today... we...

shy-holiday-15500

07/26/2022, 7:20 PM

Hey we ran into an interesting problem today... we have a map task that calls a C binary that uses a large reference dataset from disk to do some computations on a new smaller dataset. It keeps failing with a rather mysterious error:

[0]: code:"ResourceDeletedExternally" message:"resource not found, name [e2e-workflows-development/fb2xnzxy-n2-0-0]. reason: pods \"fb2xnzxy-n2-0-0\" not found"

We then checked control plan logs and they suggested the pod was being evicted due to memory pressure (137 = k8s OOM status code):

Copy code

"containerStatuses": [
                {
                    "name": "fb2xnzxy-n2-0-0",
                    "state": {
                        "terminated": {
                            "exitCode": 137,
....

However when we look at grafana, we see that memory used is really low, way below requests/limits... however, we found that the memory cache was quite high. We then found a k8s issue about memory cache being incorrectly counted as "used" memory by kubelet when it looks at memory pressure. Note quite a flyte issue, more of a k8s issue, but the log was a bit mysterious and we're still figuring out resolution.

👀 1

shy-holiday-15500

07/26/2022, 7:20 PM

@nice-zebra-99977 can you share that grafana screenshot from earlier?

nice-zebra-99977

07/26/2022, 7:21 PM

message has been deleted

thx 1

shy-holiday-15500

07/26/2022, 7:24 PM

We're testing out https://github.com/Feh/nocache to see if it can allow us to reduce cache usage to work around this

freezing-airport-6809

07/26/2022, 9:04 PM

hmm but Flyte has a solution for this @shy-holiday-15500

freezing-airport-6809

07/26/2022, 9:04 PM

you have to enable finalizers

👀 1

freezing-airport-6809

07/26/2022, 9:04 PM

whats happening is number of pods is too high

freezing-airport-6809

07/26/2022, 9:04 PM

so k8s will randomly delete pods

freezing-airport-6809

07/26/2022, 9:04 PM

https://docs.flyte.org/en/latest/deployment/cluster_config/scheduler_config.html#inject-finalizer-bool

👀 1

shy-holiday-15500

07/26/2022, 9:06 PM

Interesting, we'll look into this. We didn't override this default, and it looks like values-eks.yaml doesn't either.

shy-holiday-15500

07/26/2022, 9:08 PM

We're wrapping up our work day, will investigate this setting more tomorrow and read the source to understand what this finalizer is doing. The error from this was stochastic, so we just retried our way around it for today.

shy-holiday-15500

07/26/2022, 9:12 PM

Thanks for giving us some nice direction here!

freezing-airport-6809

07/26/2022, 9:12 PM

absolutely

high-accountant-32689

07/27/2022, 5:12 PM

cc: @hallowed-mouse-14616

hallowed-mouse-14616

07/27/2022, 6:20 PM

@shy-holiday-15500, I understand you have a few questions regarding how injecting the finalizer works. Basically, k8s will not garbage collect any resource that has a finalizer on it until the finalizer is removed. So the error you're seeing is a result of the following sequence of events: (1) Flyte creates the subtask Pod (2) Pod is OOM deleted - which only marks the Pod as deleted rather than actually deletes anything (3) k8s garbage collects Pod (4) Flyte attempts to get the Pod status to determine the task state. It does not exist, so Flyte throws an error that the resource can not be found. In the scenario where the finalizer is injected the sequence will be a little different: (1) Flyte creates the subtask Pod with a finalizer (2) Pod is OOM deleted - which only marks the Pod as deleted rather than actually deletes anything (3) Flyte retreives the Pod status to determine task state and detects that the Pod has been deleted by an external entity. It then marks the task as a retryable failure which subsequently removes the Pod finalizer (4) k8s garbage collects the Pod So basically, injecting the finalizer will not stop the Pod from being OOM deleted. However, Flyte may be able to provide a better error message as to why the Pod was deleted because it still exists and Flyte is responsible for OKing the Pod deletion.

👀 1

nice-zebra-99977

07/27/2022, 7:14 PM

We set the finalizer to true and we can see it in our propeller config map, but it seems to had no impact on our job. Still failed with OOM

nice-zebra-99977

07/27/2022, 7:16 PM

Ok so I think i understand this better then, the finalizer isnt the solution to mem cache but it will keep the pods around longer so you can view the error?

shy-holiday-15500

07/27/2022, 8:23 PM

Thanks for that explanation Dan!

shy-holiday-15500

07/27/2022, 8:23 PM

So the finalizer is the fix for the pod getting deleted after the OOM, not the fix for the OOM itself

shy-holiday-15500

07/27/2022, 8:23 PM

e.g., flyte will correctly register the OOM, instead of this error: ``reason: pods \"fb2xnzxy-n2-0-0\" not found"`

shy-holiday-15500

07/27/2022, 8:23 PM

Thank you!

shy-holiday-15500

07/27/2022, 8:44 PM

We're still figuring out the OOM issue, its confusing because it seems like the RAM cache is causing us to get OOM problems (actually malloc'd memory is way below our limit, but acutally used memory + cache is up to the limit, which we think is causing K8s to evict/OOM)

shy-holiday-15500

07/27/2022, 9:35 PM

Hmm, we may have found the issue... gotta love bioinformatics libraries... one of the C binaries we use to analyze data has a command called

--memory

and appears to reserve a bunch of memory if you don't override defaults. We're still tracing the source code to figure out what this flag does (and why the ram is showing up as cached), but definitely not a flyte problem it seems

shy-holiday-15500

07/27/2022, 9:35 PM

Or k8s problem probably

freezing-airport-6809

07/27/2022, 9:43 PM

@shy-holiday-15500 - TBH - it is possible to submit a memory profile as part of FlyteDecks

freezing-airport-6809

07/27/2022, 9:44 PM

if you are interested in contributing

freezing-airport-6809

07/27/2022, 9:44 PM

that could help debug these situations?

shy-holiday-15500

07/27/2022, 9:44 PM

Ooh

shy-holiday-15500

07/27/2022, 9:44 PM

That would be awesome!

shy-holiday-15500

07/27/2022, 9:44 PM

We have grafana + prometheus set up

shy-holiday-15500

07/27/2022, 9:44 PM

But we got led astray by the outstanding k8s issue w/ RAM caching

shy-holiday-15500

07/27/2022, 9:44 PM

(that was a red herring, it was just a regular OOM)

shy-holiday-15500

07/27/2022, 9:45 PM

Traces in flyte deck + jupyter notebook papermill task outputs as flyte decks are both interesting to us, would be fun to contribute both/either of those

freezing-airport-6809

07/27/2022, 9:45 PM

ohh papermill one is reallly simple

shy-holiday-15500

07/27/2022, 9:45 PM

Yeah we've been doing the deck as a follow up task

freezing-airport-6809

07/27/2022, 9:45 PM

its already created as an html, should be simple to add

✔️ 1

freezing-airport-6809

07/27/2022, 9:46 PM

@shy-holiday-15500 ❤️ all contributions are loved

shy-holiday-15500

07/27/2022, 9:46 PM

I may grab that jupyter notebook deck one when I have a spare minute, seems like a good first contribution

🙇 1

thousands-area-8239

07/29/2022, 4:58 PM

Thanks for providing that information on finalizers @hallowed-mouse-14616 and @freezing-airport-6809. Just to follow-up on this thread, we have enabled finalizers in our flyte deployment and have confirmed the configuration in the configmap. However, we are still encountering the error. We queried the pod names in the dataplane log group and see no obvious indication as to why the pod is being killed. We do see

I0729 16:10:23.041121 4106 kuberuntime_manager.go:484] "No sandbox for pod can be found. Need to start a new one" pod="e2e-workflows-development/fnrte65a-n3-0-108"

but no other meaningful logs between that and the time it is deleted and removed. We’ve included screenshots of the finalizers in the configmap and being applied to the pods, as well the

ResourceDeletedExternally

error. Any thoughts on what could be happening here or where else we could look for insight?

freezing-airport-6809

07/29/2022, 6:02 PM

this is a kubelet / k8s configuration. Sorry Flyte folks wont be able to help. I think Flyte is doing the right think - the pod is being held on now?

thx 1

thousands-area-8239

07/29/2022, 6:24 PM

We see the

ResourceDeletedExternally

from the flyte console. We are assuming this is indicating that flytepropeller is unable to gather the logs and the pod is being cleaned up by the kubelet, despite the finalizers

hallowed-mouse-14616

07/29/2022, 6:35 PM

Correct, I know exactly where the happens in the code. When retrieving the Pod (to check status) k8s returns the "object does not exist", but we know that it should because of the node status (ie. creating it succeeded earlier). Injecting finalizers is the only thing that Flyte can do to ensure a Pod is not deleted externally, but I know we've ran into scenarios where a cleanup mechanism will not recognize the finalizers and will delete and garbage collect the Pods regardless.

thx 1

shy-holiday-15500

07/29/2022, 7:14 PM

Thanks for the info! We're investigating now (edit: about what caused the pod to terminate/be terminated + garbaged collected), will post back here when we figure something out (for future searchers, if we don't also just add to AWS deploy docs in the unlikely case this is some issue with how we have EKS configured)

shy-holiday-15500

08/18/2022, 11:24 AM

Okay this took us entirely too long (couldn't reproduce outside of prod, only happened in map tasks at decent scale) We're pretty sure this is some kind of interaction between flyte propeller, eks node-manager, and ASG Availability Zone rebalancing. The ultimate error message seems variable depending on some details I"m still not 100% sure of, but show up either as "resource not found", "resource manually deleted", or a "panic when executing a plugin [k8s-array]" (typically with the stack trace showing LaunchAndCheckSubTasksState or something else near go/tasks/plugins/array/k8s/management.go:24, i.e., while propeller is trying to check on the state of the task) We explored the control plane and K8s API logs in more detail, and discovered that these failures seem to always occur after an eviction request by a lambda function that is owned by AWS (i.e., part of the managed EKS node group). On more exploration, we suspect (waiting on AWS support to confirm) that this lambda function (AWSWesleyClusterManagerLambda) is the one responsible for "graceful" eviction after AZ rebalancing activities . We also confirmed in ASG logs that an AZ rebalancing occured before each of these errors, and that this lambda send an eviction request to the K8s API. The missing piece here (to me anyway) is why the finalizers don't prevent this. We've confirmed using kubectl that the finalizers are getting applied correctly. I would expect the eviction to respect them (need to read up more on k8s internals around eviction I suppose). We disabled AZ rebalancing on the ASG yeasterday, and this particular error seems to have stopped (time will tell if it fully eradicate it) Also, one thing I'm less confident about- this issue seems to occur more (maybe only) on the last/terminal retry of a task. I.e. if we have 2 retries on a task, we get this error when an AZ rebalancing disrupts the 2nd retry. This occurs rarely enough (we're running map tasks with 100-1000 elements, it'll often be like 1 or 2 pods that are on the 2nd/final retry).

freezing-airport-6809

08/18/2022, 2:39 PM

Can you capture this on an issue

freezing-airport-6809

08/18/2022, 2:39 PM

I do not want to lose this info

shy-holiday-15500

08/18/2022, 2:40 PM

Yes totally!

❤️ 1

shy-holiday-15500

08/18/2022, 2:40 PM

I'll get it into an issue today and include cloudwatch screenshots

freezing-airport-6809

08/18/2022, 2:40 PM

Perfect helps on Google search

👍 1

shy-holiday-15500

08/18/2022, 4:03 PM

Moved to https://github.com/flyteorg/flyte/issues/2788, will keep adding detail (its still a bit light on detail)

🙌 2

291 Views

Open in Slack

Previous Next