https://flyte.org logo
#announcements
Title
# announcements
s

Sören Brunk

07/26/2022, 6:22 AM
Hi, I was surprised that my long running interruptible training task was scheduled onto a non-spot instance after a failure (probably due to preemption). Thinking about it, it makes sense as default behavior, and is also documented:
If your task gets preempted, Flyte will retry your task on a non-spot (regular) instance. This retry will not count towards a retry that a user sets.
Still I'd like to configure Flyte to try one more time on a spot instance. There's a interruptible-failure-threshold ("number of failures for a node to be still considered interruptible"). Is this the right config to tweak that behavior?
j

Jay Ganbat

07/26/2022, 8:11 AM
@jeev this is the same behavior we were seeing right
d

Dan Rammer (hamersaw)

07/26/2022, 12:02 PM
@Sören Brunk yes, that is the correct configuration. It looks like here we compare the number of system failures (designation for interruptions) with the
interruptible-failure-threshold
and unset
interruptible
if we see the number of failures exceed the threshold. Intuitively, you should be able to set the
interruptble-failure-threshold
to a number higher than the number of retries to ensure the task is only executed on SPOT instances.
s

Sören Brunk

07/26/2022, 12:13 PM
Awesome! Thanks for the explanation @Dan Rammer (hamersaw)
👍 1
34 Views