Hey I m seeing some differences in how spot revocation preem Flyte #flyte-support

Hey! I'm seeing some differences in how spot revoc...

square-boots-41503

08/08/2023, 4:12 PM

Hey! I'm seeing some differences in how spot revocation / preemption is handled between tasks and map_task. The behavior I want to see is where spot revocations are automatically retried (without adding the

retries=N

decorator) while user errors cause the task to fail and not retry. This works correctly for

task

and

dynamic

but not for tasks spawned through

map_task

. For

map_task

, a spot revocation is handled like a user error and the task fails without a retry. In this case, the error message looks like this

[4]: code:"UnexpectedObjectDeletion" message:"object [flytetester-development/eric-zkb6y8cmtmyzj2sjefbsvw-n0-0-dn0-0-dn0-0-4] terminated in the background, manually"

. In contrast, the error message I see for

task/dynamic

after the max number of retryable spot revocations is the following

[10/10] currentAttempt done. Last Error: SYSTEM::object [flytetester-development/eric-b0iozgurekfuo5ookspxq-n0-0-dn0-10] terminated in the background, manually

. I tried adding

interruptible=True

and

retries=N

but that causes all failures to retry including user errors (which I want to exclude from being retried). Does anyone know how to get subtasks in

map_task

to behave like

task/dynamic

with respect to spot revocations?

tall-lock-23197

08/09/2023, 7:20 AM

As per https://docs.flyte.org/projects/cookbook/en/latest/auto_examples/containerization/spot_instances.html#setting-interruptible,

retries

must be set. Not sure how your `task`s are automatically being retried on spot instances without

retries

being set. cc @hallowed-mouse-14616

square-boots-41503

08/09/2023, 1:24 PM

Is there a way to add

retries

but have it only retry spot interruptions and not real errors?

square-boots-41503

08/09/2023, 2:36 PM

Not sure how your `task`s are automatically being retried on spot instances without
retries
being set.

I was confused by this too.. but I think this is how it is currently behaving. The task that I'm spot revoking has a barebones

task

definition like so

Copy code

@task
def example_task(input: str) -> str:
    <http://logging.info|logging.info>('Starting example_task')

    i = 0
    while True:
        <http://logging.info|logging.info>(f'Task Iteration - {i}')
        sleep(10)
        i += 1

    return "hi"

hallowed-mouse-14616

08/09/2023, 4:24 PM

@square-boots-41503 TL;DR interruptible should work the same with maptasks as with regular python tasks - the

retries

field in the task annotation shouldn't be necessary.

hallowed-mouse-14616

08/09/2023, 4:29 PM

So right now Flyte differentiates between errors with two separate tyes, namely

SYSTEM

and

USER

. When encountering an error, both have separate budgets, so a task can be executed allowing 3

SYSTEM

errors and 1

USER

error for example. The

SYSTEM

error configuration is in propeller with the

max-node-retries-system-failures

config value under the

node-config

key. This option also has a

interruptible-failure-threshold

option, which defines the number of system-level retries that will be considered interruptible. So you can say allow 3 retries, but on the last one (ie

for the failure threshold) do not label the Pod as interruptible. Alternativley the

USER

budget is set by defining

retries

in the task decorator.

hallowed-mouse-14616

08/09/2023, 4:34 PM

Two other things: (1) This will be fixed with the addition of

ArrayNode

as an experimental feature - https://github.com/flyteorg/flytepropeller/pull/550 (2) There is also ongoing work on adding a flag to unify the retry budget. There can be a lot of confusion with handling errors with separate retry budgets - https://github.com/flyteorg/flyte/pull/3902

hallowed-mouse-14616

08/09/2023, 4:34 PM

That all being said, feel free to file an issue for this. With the hopeful move to GA the ArrayNode work, I'm not sure this can be prioritized, but it should be a pretty quick fix!

square-boots-41503

08/09/2023, 5:44 PM

thanks! are there any docs/examples on how to enable/use

ArrayNode

? I can try that out

thankful-minister-83577

08/09/2023, 8:22 PM

instructions on how to try array node should be in the release notes for the coming release

👍 1

55 Views

Open in Slack

Previous Next