Hi, I was surprised that my long running interrupt...
# announcements
s
Hi, I was surprised that my long running interruptible training task was scheduled onto a non-spot instance after a failure (probably due to preemption). Thinking about it, it makes sense as default behavior, and is also documented:
If your task gets preempted, Flyte will retry your task on a non-spot (regular) instance. This retry will not count towards a retry that a user sets.
Still I'd like to configure Flyte to try one more time on a spot instance. There's a interruptible-failure-threshold ("number of failures for a node to be still considered interruptible"). Is this the right config to tweak that behavior?
j
@jeev this is the same behavior we were seeing right
d
@Sören Brunk yes, that is the correct configuration. It looks like here we compare the number of system failures (designation for interruptions) with the
interruptible-failure-threshold
and unset
interruptible
if we see the number of failures exceed the threshold. Intuitively, you should be able to set the
interruptble-failure-threshold
to a number higher than the number of retries to ensure the task is only executed on SPOT instances.
s
Awesome! Thanks for the explanation @Dan Rammer (hamersaw)
👍 1
167 Views