#354 Improve demystifying GKE spot node preemtion #patch
Pull request opened by
bstadlbauer
TL;DR
Fixes a bug where propeller would incorrectly label spot node preemption as user error
Type
☑︎ Bug Fix
☐ Feature
☐ Plugin
Are all requirements met?
☐ Code completed
☐ Smoke tested
☑︎ Unit tests added
☐ Code documentation added
☐ Any pending items have an associated Issue
Complete description
We've had a lot of issues recently where tasks running on spot instances would not be re-scheduled onto regular instances after they've been preempted by GKE.
We've been able to consistently replicate this by:
1. Starting an interruptible task with some retries
2. As soon as it's up, going to the VM instances page and Stopping (not deleting) the instance. According to the
Google Cloud docs this is the same as preemption
Doing this, non of the retry attempts have been scheduled onto non-spot instances.
I've debugged this by running a local instance of
flytepropeller
with a local version of
flyteplugins
. I've used this instance instead of the one usually running in our cluster. I then set a breakpoint here:
flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper.go
Line 625 in </flyteorg/flyteplugins/commit/8a2f8ca2e723d067c4915b8a9ec2960eb4ff6526|8a2f8ca>
And could see that the code is
"Terminated"
instead of
"Shutdown"
. Once I've added
"Terminated"
things worked as expected
flyteorg/flyteplugins
✅ All checks have passed
7/7 successful checks