Hi I was surprised that my long running interruptible traini Flyte #announcements

Hi, I was surprised that my long running interrupt...

boundless-pizza-95864

07/26/2022, 6:22 AM

Hi, I was surprised that my long running interruptible training task was scheduled onto a non-spot instance after a failure (probably due to preemption). Thinking about it, it makes sense as default behavior, and is also documented:

If your task gets preempted, Flyte will retry your task on a non-spot (regular) instance. This retry will not count towards a retry that a user sets.

Still I'd like to configure Flyte to try one more time on a spot instance. There's a interruptible-failure-threshold ("number of failures for a node to be still considered interruptible"). Is this the right config to tweak that behavior?

magnificent-teacher-86590

07/26/2022, 8:11 AM

@freezing-boots-56761 this is the same behavior we were seeing right

hallowed-mouse-14616

07/26/2022, 12:02 PM

@boundless-pizza-95864 yes, that is the correct configuration. It looks like here we compare the number of system failures (designation for interruptions) with the

interruptible-failure-threshold

and unset

interruptible

if we see the number of failures exceed the threshold. Intuitively, you should be able to set the

interruptble-failure-threshold

to a number higher than the number of retries to ensure the task is only executed on SPOT instances.

boundless-pizza-95864

07/26/2022, 12:13 PM

Awesome! Thanks for the explanation @hallowed-mouse-14616

👍 1

175 Views

Open in Slack

Previous Next