We make heavy use of interruptible tasks so that we can run Flyte #flyte-support

We make heavy use of interruptible tasks so that w...

microscopic-furniture-57275

05/09/2025, 3:20 PM

We make heavy use of interruptible tasks so that we can run on spot instances for cost savings. One issue I'd like to solve is that it seems the interruptible mechanism can't really distinguish between a spot instance being taken away by AWS (in which case a retry is appropriate) or when the task OOMs, in which case a retry is not going to help. Is it possible to configure interruptible tasks to distinguish these? From flyte's perspective, I think it just sees "this task was terminated from outside" and so retries -- sometimes resulting in multiple multi-hour runs that end in OOM and end up being more expensive than just making it not interruptible in the first place.

microscopic-furniture-57275

05/09/2025, 6:14 PM

I do see this documentation on the topic. OOM errors usually come with the word USER in the error message:

Copy code

[3/3] currentAttempt done. Last Error: USER::
[primary] terminated with exit code (137). Reason [OOMKilled].

so flyte should not retry on these due to "interruptible" logic unless we've specifically configured retries=X in the task decorator -- which we do, because I thought we had to in order to get retries on spot-instance reclaims. But those docs seem to indicate otherwise. Continuing to investigate...

worried-airplane-87065

05/09/2025, 8:39 PM

https://github.com/flyteorg/flyte/blob/d4a450ac1e49583e96847b8661b3893360a94e5d/flytepropeller/pkg/controller/config/config.go#L278

microscopic-furniture-57275

05/09/2025, 8:40 PM

Ok, here is my current conclusion: the retries parameter in a flyte task does not really control the retries associated with interruptible instance reclaim, despite what the docs here seem to imply. I had my retries set to 2, which the docs say:

Copy code

An interruptible task with retries=n will be attempted n times on an interruptible instance. If it still fails after n attempts, the final (n+1) retry will be done on the fallback on-demand instance.

Instead, I found that spot instance reclaims (which I forced via AWS tools) result in 3 failed attempts on spot instances, followed by an attempt on an on-demand instance. This was true whether I had retries=2, or retries=0, or omitted the retries param altogether in the task decorator. I think this is the result of default settings for Propeller, specifically these, and most specifically the `max-node-retries-system-failures`:

Copy code

default-deadlines:
  node-active-deadline: 0s
  node-execution-deadline: 0s
  workflow-active-deadline: 0s
default-max-attempts: 1
enable-cr-debug-metadata: false
ignore-retry-cause: false
interruptible-failure-threshold: -1
max-node-retries-system-failures: 3

On the other hand, when I OOM fail, the retries=2 means I do 3 total attempts, because this is a USER error, not a SYSTEM error.

worried-airplane-87065

05/09/2025, 8:40 PM

Here are some knobs that might be of interest to you. For us, we just cranked up "interruptible-failure-threshold" 😛

microscopic-furniture-57275

05/09/2025, 8:47 PM

The behavior I want is to never retry on USER errors, which includes OOM, and I want interruptible tasks to retry ONCE on another interruptible instance before going on-demand -- not 2 times for a total of 3 spot attempts -- which seems to be default behavior). I assume this means I should NOT user the @task retries param (or set it to 0) and turn the

max-node-retries-system-failures

down to 1, though I'm not really clear how

interruptible-failure-threshold

interacts with this, or what it really does. The docs say > Additionally, the

interruptible-failure-threshold

option in the

node-config

key defines how many system-level retries are considered interruptible. This is particularly useful for tasks running on preemptible instances. "considered interruptible" ? The task is either interruptible or not, based on if it was scheduled as such and landed on a spot instance.

microscopic-furniture-57275

05/09/2025, 8:48 PM

@worried-airplane-87065 - thanks for your responses.

microscopic-furniture-57275

05/09/2025, 9:33 PM

@worried-airplane-87065 - do you manage to configure this for the single flyte binary (we use this) -- or do you have the more involved install with separate components?

worried-airplane-87065

05/09/2025, 9:35 PM

We use Flyte core so we did this:

Copy code

core:
  propeller:
    node-config:
        max-node-retries-system-failures: 5

microscopic-furniture-57275

05/09/2025, 9:37 PM

How does that value relate to the max-node-retries-system-failures? What is the effect of setting it to 5, as opposed to the default of -1? Since you're not overriding, I think you are still running with max-node-retries-system-failure = 3

worried-airplane-87065

05/09/2025, 9:39 PM

Ah yes sorry we set max-node-retries-system-failures 😛 edited comment above

microscopic-furniture-57275

05/09/2025, 9:42 PM

So I assume this means that when you run interruptible jobs, they make 5 attempts on spot instances before running on-demand -- it my case, with the default of 3, I see 3 attempts on spot, and then a final attempt on ondemand

microscopic-furniture-57275

05/09/2025, 9:45 PM

I just realize that Flyte Core implies no backend (from my quick research as to what that is!), so you're probably not using cloud instances.

worried-airplane-87065

05/09/2025, 9:46 PM

Oh sorry I meant we use the flyte core helm chart. We run it on GCP

microscopic-furniture-57275

05/09/2025, 9:47 PM

Ok, cool, thanks.

11 Views

Open in Slack

Previous Next