<#354 Improve demystifying GKE spot node preemtion...
# flyte-github
a
#354 Improve demystifying GKE spot node preemtion #patch Pull request opened by bstadlbauer TL;DR Fixes a bug where propeller would incorrectly label spot node preemption as user error Type ☑︎ Bug Fix ☐ Feature ☐ Plugin Are all requirements met? ☐ Code completed ☐ Smoke tested ☑︎ Unit tests added ☐ Code documentation added ☐ Any pending items have an associated Issue Complete description We've had a lot of issues recently where tasks running on spot instances would not be re-scheduled onto regular instances after they've been preempted by GKE. We've been able to consistently replicate this by: 1. Starting an interruptible task with some retries 2. As soon as it's up, going to the VM instances page and Stopping (not deleting) the instance. According to the Google Cloud docs this is the same as preemption Doing this, non of the retry attempts have been scheduled onto non-spot instances. I've debugged this by running a local instance of
flytepropeller
with a local version of
flyteplugins
. I've used this instance instead of the one usually running in our cluster. I then set a breakpoint here: flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper.go Line 625 in </flyteorg/flyteplugins/commit/8a2f8ca2e723d067c4915b8a9ec2960eb4ff6526|8a2f8ca> And could see that the code is
"Terminated"
instead of
"Shutdown"
. Once I've added
"Terminated"
things worked as expected flyteorg/flyteplugins All checks have passed 7/7 successful checks