We make heavy use of interruptible tasks so that w...
# flyte-support
m
We make heavy use of interruptible tasks so that we can run on spot instances for cost savings. One issue I'd like to solve is that it seems the interruptible mechanism can't really distinguish between a spot instance being taken away by AWS (in which case a retry is appropriate) or when the task OOMs, in which case a retry is not going to help. Is it possible to configure interruptible tasks to distinguish these? From flyte's perspective, I think it just sees "this task was terminated from outside" and so retries -- sometimes resulting in multiple multi-hour runs that end in OOM and end up being more expensive than just making it not interruptible in the first place.
I do see this documentation on the topic. OOM errors usually come with the word USER in the error message:
Copy code
[3/3] currentAttempt done. Last Error: USER::
[primary] terminated with exit code (137). Reason [OOMKilled].
so flyte should not retry on these due to "interruptible" logic unless we've specifically configured retries=X in the task decorator -- which we do, because I thought we had to in order to get retries on spot-instance reclaims. But those docs seem to indicate otherwise. Continuing to investigate...
m
Ok, here is my current conclusion: the retries parameter in a flyte task does not really control the retries associated with interruptible instance reclaim, despite what the docs here seem to imply. I had my retries set to 2, which the docs say:
Copy code
An interruptible task with retries=n will be attempted n times on an interruptible instance. If it still fails after n attempts, the final (n+1) retry will be done on the fallback on-demand instance.
Instead, I found that spot instance reclaims (which I forced via AWS tools) result in 3 failed attempts on spot instances, followed by an attempt on an on-demand instance. This was true whether I had retries=2, or retries=0, or omitted the retries param altogether in the task decorator. I think this is the result of default settings for Propeller, specifically these, and most specifically the `max-node-retries-system-failures`:
Copy code
default-deadlines:
  node-active-deadline: 0s
  node-execution-deadline: 0s
  workflow-active-deadline: 0s
default-max-attempts: 1
enable-cr-debug-metadata: false
ignore-retry-cause: false
interruptible-failure-threshold: -1
max-node-retries-system-failures: 3
On the other hand, when I OOM fail, the retries=2 means I do 3 total attempts, because this is a USER error, not a SYSTEM error.
w
Here are some knobs that might be of interest to you. For us, we just cranked up "interruptible-failure-threshold" 😛
m
The behavior I want is to never retry on USER errors, which includes OOM, and I want interruptible tasks to retry ONCE on another interruptible instance before going on-demand -- not 2 times for a total of 3 spot attempts -- which seems to be default behavior). I assume this means I should NOT user the @task retries param (or set it to 0) and turn the
max-node-retries-system-failures
down to 1, though I'm not really clear how
interruptible-failure-threshold
interacts with this, or what it really does. The docs say > Additionally, the
interruptible-failure-threshold
option in the
node-config
key defines how many system-level retries are considered interruptible. This is particularly useful for tasks running on preemptible instances. "considered interruptible" ? The task is either interruptible or not, based on if it was scheduled as such and landed on a spot instance.
@worried-airplane-87065 - thanks for your responses.
@worried-airplane-87065 - do you manage to configure this for the single flyte binary (we use this) -- or do you have the more involved install with separate components?
w
We use Flyte core so we did this:
Copy code
core:
  propeller:
    node-config:
        max-node-retries-system-failures: 5
m
How does that value relate to the max-node-retries-system-failures? What is the effect of setting it to 5, as opposed to the default of -1? Since you're not overriding, I think you are still running with max-node-retries-system-failure = 3
w
Ah yes sorry we set max-node-retries-system-failures 😛 edited comment above
m
So I assume this means that when you run interruptible jobs, they make 5 attempts on spot instances before running on-demand -- it my case, with the default of 3, I see 3 attempts on spot, and then a final attempt on ondemand
I just realize that Flyte Core implies no backend (from my quick research as to what that is!), so you're probably not using cloud instances.
w
Oh sorry I meant we use the flyte core helm chart. We run it on GCP
m
Ok, cool, thanks.