hi team happy Monday We are getting `terminated in the backg Flyte #announcements

hi team, happy Monday. We are getting `terminated ...

orange-hairdresser-63684

04/18/2022, 6:30 PM

hi team, happy Monday. We are getting

terminated in the background, manually

intermittently. No errors are reported in the POD log. Any suggestions on how to troubleshoot?

high-accountant-32689

04/18/2022, 6:35 PM

@orange-hairdresser-63684, what service is this?

hallowed-mouse-14616

04/18/2022, 6:38 PM

Hi Alex, so this happens in FlytePropeller when it detects a pod that has the deletion timestamp set but the pod is not in one of flytes terminal states (ex. success, failure). This means that something that is not Flyte deletes the pod.

hallowed-mouse-14616

04/18/2022, 6:38 PM

As I understand we have seen this in a number of different scenarios, most easily, somebody attempts to delete the pod, but if a kubelet fails it can mark the pods as deleted as well.

hallowed-mouse-14616

04/18/2022, 6:39 PM

Can you say anything more about the setup that might help debugging? It is certainly a difficult scenario.

orange-hairdresser-63684

04/18/2022, 6:43 PM

we keep seeing this on some workflows, but haven't found a correlation yet

orange-hairdresser-63684

04/18/2022, 6:45 PM

most recently, a task was running for about 15 hours and then failed with this error. But no errors reported in the container log. Then 3 attempts to restart the task failed immediately (looks like, they show

terminated in the background, manually

and no POD logs at all)

hallowed-mouse-14616

04/18/2022, 6:49 PM

Are you saying that the Flyte error doesn't have any information about the pod logs? Or the k8s pod logs (ie.

kubectl describe pod

kubectl logs ...

) show nothing interesting?

orange-hairdresser-63684

04/18/2022, 6:53 PM

both. The only error that Flyte shows is

terminated in the background, manually

. We send logs to stackdriver and nothing interesting there. I can't do describe/logs because by the time we learn about the issue the pod is gone

hallowed-mouse-14616

04/18/2022, 6:57 PM

Ah ok. So FlytePropeller has a few config options that might help debug this.

hallowed-mouse-14616

04/18/2022, 6:58 PM

Copy code

--plugins.k8s.delete-resource-on-finalize                                   Instructs the system to delete the resource on finalize. This ensures that no resources are kept around (potentially consuming cluster resources). This,  however,  will cause k8s log links to expire as soon as the resource is finalized.
      --plugins.k8s.inject-finalizer                                              Instructs the plugin to inject a finalizer on startTask and remove it on task termination.

hallowed-mouse-14616

04/18/2022, 6:58 PM

if you enable inject finalizer then propeller will add a Flyte finalizer onto each pod created and then disabling the delete resource on finalize means that propeller will not delete the pod when it find a terminal state.

hallowed-mouse-14616

04/18/2022, 6:59 PM

using these you should be able to make sure k8s doesn't delete the pod before you can view the logs. i suspect that this is the best path to figuring out why the pods are being deleted.

orange-hairdresser-63684

04/18/2022, 7:00 PM

i see that we have

inject-finalizer

enabled

orange-hairdresser-63684

04/18/2022, 7:02 PM

sorry, i'm not sure i fully understand what these do.. We have

inject-finalizer

enabled, are you saying the pods should not be deleted?

freezing-airport-6809

04/18/2022, 7:04 PM

hmm, this usually means, that the pod was reclaimed, without telling flyte about it

freezing-airport-6809

04/18/2022, 7:04 PM

so inject-finalizer should prevent this

freezing-airport-6809

04/18/2022, 7:04 PM

@orange-hairdresser-63684 do you want to hop on a call in a bit?

freezing-airport-6809

04/18/2022, 7:04 PM

i want to see something

orange-hairdresser-63684

04/18/2022, 7:04 PM

disabling the delete resource on finalize means that propeller will not delete the pod when it find a terminal state.

but as you said earlier

This means that something that is not Flyte deletes the pod.

orange-hairdresser-63684

04/18/2022, 7:04 PM

@freezing-airport-6809 sure

freezing-airport-6809

04/18/2022, 7:05 PM

https://meet.google.com/iio-ejno-zrs

freezing-airport-6809

04/18/2022, 7:05 PM

@orange-hairdresser-63684 / @hallowed-mouse-14616 ^?

freezing-airport-6809

04/18/2022, 8:31 PM

cc @hallowed-mouse-14616 did you guys solve it?

hallowed-mouse-14616

04/18/2022, 8:57 PM

There was nothing interesting in the logs around the failures. Just propeller detecting the deleted pod and aborting the workflow. @orange-hairdresser-63684 was going to keep us in the loop about checking in the garbage collector.

orange-hairdresser-63684

04/22/2022, 6:01 PM

@freezing-airport-6809 @hallowed-mouse-14616 - just FYI, @astonishing-lizard-78628 found one cause of the issue. There's a bug in k8s that prevents mounting a volume with aws creds used by OIDC: https://github.com/kubernetes/kubernetes/issues/100047 This however only explain why some tasks won't start. But not why tasks fail fail while running. For the later case we explore two suspects - our custom GC and resource leaks

203 Views

Open in Slack

Previous Next