Hey! I'm seeing some differences in how spot revoc...
# ask-the-community
e
Hey! I'm seeing some differences in how spot revocation / preemption is handled between tasks and map_task. The behavior I want to see is where spot revocations are automatically retried (without adding the
retries=N
decorator) while user errors cause the task to fail and not retry. This works correctly for
task
and
dynamic
but not for tasks spawned through
map_task
. For
map_task
, a spot revocation is handled like a user error and the task fails without a retry. In this case, the error message looks like this
[4]: code:"UnexpectedObjectDeletion" message:"object [flytetester-development/eric-zkb6y8cmtmyzj2sjefbsvw-n0-0-dn0-0-dn0-0-4] terminated in the background, manually"
. In contrast, the error message I see for
task/dynamic
after the max number of retryable spot revocations is the following
[10/10] currentAttempt done. Last Error: SYSTEM::object [flytetester-development/eric-b0iozgurekfuo5ookspxq-n0-0-dn0-10] terminated in the background, manually
. I tried adding
interruptible=True
and
retries=N
but that causes all failures to retry including user errors (which I want to exclude from being retried). Does anyone know how to get subtasks in
map_task
to behave like
task/dynamic
with respect to spot revocations?
s
As per https://docs.flyte.org/projects/cookbook/en/latest/auto_examples/containerization/spot_instances.html#setting-interruptible,
retries
must be set. Not sure how your `task`s are automatically being retried on spot instances without
retries
being set. cc @Dan Rammer (hamersaw)
e
Is there a way to add
retries
but have it only retry spot interruptions and not real errors?
Not sure how your `task`s are automatically being retried on spot instances without
retries
being set.
I was confused by this too.. but I think this is how it is currently behaving. The task that I'm spot revoking has a barebones
task
definition like so
Copy code
@task
def example_task(input: str) -> str:
    <http://logging.info|logging.info>('Starting example_task')

    i = 0
    while True:
        <http://logging.info|logging.info>(f'Task Iteration - {i}')
        sleep(10)
        i += 1

    return "hi"
d
@Eric Song TL;DR interruptible should work the same with maptasks as with regular python tasks - the
retries
field in the task annotation shouldn't be necessary.
So right now Flyte differentiates between errors with two separate tyes, namely
SYSTEM
and
USER
. When encountering an error, both have separate budgets, so a task can be executed allowing 3
SYSTEM
errors and 1
USER
error for example. The
SYSTEM
error configuration is in propeller with the
max-node-retries-system-failures
config value under the
node-config
key. This option also has a
interruptible-failure-threshold
option, which defines the number of system-level retries that will be considered interruptible. So you can say allow 3 retries, but on the last one (ie
2
for the failure threshold) do not label the Pod as interruptible. Alternativley the
USER
budget is set by defining
retries
in the task decorator.
Two other things: (1) This will be fixed with the addition of
ArrayNode
as an experimental feature - https://github.com/flyteorg/flytepropeller/pull/550 (2) There is also ongoing work on adding a flag to unify the retry budget. There can be a lot of confusion with handling errors with separate retry budgets - https://github.com/flyteorg/flyte/pull/3902
That all being said, feel free to file an issue for this. With the hopeful move to GA the ArrayNode work, I'm not sure this can be prioritized, but it should be a pretty quick fix!
e
thanks! are there any docs/examples on how to enable/use
ArrayNode
? I can try that out
y
instructions on how to try array node should be in the release notes for the coming release