microscopic-furniture-57275
05/09/2025, 3:20 PMmicroscopic-furniture-57275
05/09/2025, 6:14 PM[3/3] currentAttempt done. Last Error: USER::
[primary] terminated with exit code (137). Reason [OOMKilled].
so flyte should not retry on these due to "interruptible" logic unless we've specifically configured retries=X in the task decorator -- which we do, because I thought we had to in order to get retries on spot-instance reclaims. But those docs seem to indicate otherwise.
Continuing to investigate...worried-airplane-87065
05/09/2025, 8:39 PMmicroscopic-furniture-57275
05/09/2025, 8:40 PMAn interruptible task with retries=n will be attempted n times on an interruptible instance. If it still fails after n attempts, the final (n+1) retry will be done on the fallback on-demand instance.
Instead, I found that spot instance reclaims (which I forced via AWS tools) result in 3 failed attempts on spot instances, followed by an attempt on an on-demand instance. This was true whether I had retries=2, or retries=0, or omitted the retries param altogether in the task decorator.
I think this is the result of default settings for Propeller, specifically these, and most specifically the `max-node-retries-system-failures`:
default-deadlines:
node-active-deadline: 0s
node-execution-deadline: 0s
workflow-active-deadline: 0s
default-max-attempts: 1
enable-cr-debug-metadata: false
ignore-retry-cause: false
interruptible-failure-threshold: -1
max-node-retries-system-failures: 3
On the other hand, when I OOM fail, the retries=2 means I do 3 total attempts, because this is a USER error, not a SYSTEM error.worried-airplane-87065
05/09/2025, 8:40 PMmicroscopic-furniture-57275
05/09/2025, 8:47 PMmax-node-retries-system-failures
down to 1, though I'm not really clear how interruptible-failure-threshold
interacts with this, or what it really does. The docs say
> Additionally, the interruptible-failure-threshold
option in the node-config
key defines how many system-level retries are considered interruptible. This is particularly useful for tasks running on preemptible instances.
"considered interruptible" ? The task is either interruptible or not, based on if it was scheduled as such and landed on a spot instance.microscopic-furniture-57275
05/09/2025, 8:48 PMmicroscopic-furniture-57275
05/09/2025, 9:33 PMworried-airplane-87065
05/09/2025, 9:35 PMcore:
propeller:
node-config:
max-node-retries-system-failures: 5
microscopic-furniture-57275
05/09/2025, 9:37 PMworried-airplane-87065
05/09/2025, 9:39 PMmicroscopic-furniture-57275
05/09/2025, 9:42 PMmicroscopic-furniture-57275
05/09/2025, 9:45 PMworried-airplane-87065
05/09/2025, 9:46 PMmicroscopic-furniture-57275
05/09/2025, 9:47 PM