https://flyte.org logo
#announcements
Title
# announcements
a

Alex Pozimenko

04/18/2022, 6:30 PM
hi team, happy Monday. We are getting
terminated in the background, manually
intermittently. No errors are reported in the POD log. Any suggestions on how to troubleshoot?
e

Eduardo Apolinario (eapolinario)

04/18/2022, 6:35 PM
@Alex Pozimenko, what service is this?
d

Dan Rammer (hamersaw)

04/18/2022, 6:38 PM
Hi Alex, so this happens in FlytePropeller when it detects a pod that has the deletion timestamp set but the pod is not in one of flytes terminal states (ex. success, failure). This means that something that is not Flyte deletes the pod.
As I understand we have seen this in a number of different scenarios, most easily, somebody attempts to delete the pod, but if a kubelet fails it can mark the pods as deleted as well.
Can you say anything more about the setup that might help debugging? It is certainly a difficult scenario.
a

Alex Pozimenko

04/18/2022, 6:43 PM
we keep seeing this on some workflows, but haven't found a correlation yet
most recently, a task was running for about 15 hours and then failed with this error. But no errors reported in the container log. Then 3 attempts to restart the task failed immediately (looks like, they show
terminated in the background, manually
and no POD logs at all)
d

Dan Rammer (hamersaw)

04/18/2022, 6:49 PM
Are you saying that the Flyte error doesn't have any information about the pod logs? Or the k8s pod logs (ie.
kubectl describe pod
/
kubectl logs ...
) show nothing interesting?
a

Alex Pozimenko

04/18/2022, 6:53 PM
both. The only error that Flyte shows is
terminated in the background, manually
. We send logs to stackdriver and nothing interesting there. I can't do describe/logs because by the time we learn about the issue the pod is gone
d

Dan Rammer (hamersaw)

04/18/2022, 6:57 PM
Ah ok. So FlytePropeller has a few config options that might help debug this.
Copy code
--plugins.k8s.delete-resource-on-finalize                                   Instructs the system to delete the resource on finalize. This ensures that no resources are kept around (potentially consuming cluster resources). This,  however,  will cause k8s log links to expire as soon as the resource is finalized.
      --plugins.k8s.inject-finalizer                                              Instructs the plugin to inject a finalizer on startTask and remove it on task termination.
if you enable inject finalizer then propeller will add a Flyte finalizer onto each pod created and then disabling the delete resource on finalize means that propeller will not delete the pod when it find a terminal state.
using these you should be able to make sure k8s doesn't delete the pod before you can view the logs. i suspect that this is the best path to figuring out why the pods are being deleted.
a

Alex Pozimenko

04/18/2022, 7:00 PM
i see that we have
inject-finalizer
enabled
sorry, i'm not sure i fully understand what these do.. We have
inject-finalizer
enabled, are you saying the pods should not be deleted?
k

Ketan (kumare3)

04/18/2022, 7:04 PM
hmm, this usually means, that the pod was reclaimed, without telling flyte about it
so inject-finalizer should prevent this
@Alex Pozimenko do you want to hop on a call in a bit?
i want to see something
a

Alex Pozimenko

04/18/2022, 7:04 PM
disabling the delete resource on finalize means that propeller will not delete the pod when it find a terminal state.
but as you said earlier
This means that something that is not Flyte deletes the pod.
@Ketan (kumare3) sure
k

Ketan (kumare3)

04/18/2022, 7:05 PM
@Alex Pozimenko / @Dan Rammer (hamersaw) ^?
cc @Dan Rammer (hamersaw) did you guys solve it?
d

Dan Rammer (hamersaw)

04/18/2022, 8:57 PM
There was nothing interesting in the logs around the failures. Just propeller detecting the deleted pod and aborting the workflow. @Alex Pozimenko was going to keep us in the loop about checking in the garbage collector.
a

Alex Pozimenko

04/22/2022, 6:01 PM
@Ketan (kumare3) @Dan Rammer (hamersaw) - just FYI, @Alex Bain found one cause of the issue. There's a bug in k8s that prevents mounting a volume with aws creds used by OIDC: https://github.com/kubernetes/kubernetes/issues/100047 This however only explain why some tasks won't start. But not why tasks fail fail while running. For the later case we explore two suspects - our custom GC and resource leaks
104 Views