square-boots-41503
08/08/2023, 4:12 PMretries=N
decorator) while user errors cause the task to fail and not retry. This works correctly for task
and dynamic
but not for tasks spawned through map_task
.
For map_task
, a spot revocation is handled like a user error and the task fails without a retry. In this case, the error message looks like this [4]: code:"UnexpectedObjectDeletion" message:"object [flytetester-development/eric-zkb6y8cmtmyzj2sjefbsvw-n0-0-dn0-0-dn0-0-4] terminated in the background, manually"
.
In contrast, the error message I see for task/dynamic
after the max number of retryable spot revocations is the following [10/10] currentAttempt done. Last Error: SYSTEM::object [flytetester-development/eric-b0iozgurekfuo5ookspxq-n0-0-dn0-10] terminated in the background, manually
.
I tried adding interruptible=True
and retries=N
but that causes all failures to retry including user errors (which I want to exclude from being retried).
Does anyone know how to get subtasks in map_task
to behave like task/dynamic
with respect to spot revocations?tall-lock-23197
retries
must be set. Not sure how your `task`s are automatically being retried on spot instances without retries
being set.
cc @hallowed-mouse-14616square-boots-41503
08/09/2023, 1:24 PMretries
but have it only retry spot interruptions and not real errors?square-boots-41503
08/09/2023, 2:36 PMNot sure how your `task`s are automatically being retried on spot instances withoutI was confused by this too.. but I think this is how it is currently behaving. The task that I'm spot revoking has a barebonesbeing set.retries
task
definition like so
@task
def example_task(input: str) -> str:
<http://logging.info|logging.info>('Starting example_task')
i = 0
while True:
<http://logging.info|logging.info>(f'Task Iteration - {i}')
sleep(10)
i += 1
return "hi"
hallowed-mouse-14616
08/09/2023, 4:24 PMretries
field in the task annotation shouldn't be necessary.hallowed-mouse-14616
08/09/2023, 4:29 PMSYSTEM
and USER
. When encountering an error, both have separate budgets, so a task can be executed allowing 3 SYSTEM
errors and 1 USER
error for example. The SYSTEM
error configuration is in propeller with the max-node-retries-system-failures
config value under the node-config
key. This option also has a interruptible-failure-threshold
option, which defines the number of system-level retries that will be considered interruptible. So you can say allow 3 retries, but on the last one (ie 2
for the failure threshold) do not label the Pod as interruptible. Alternativley the USER
budget is set by defining retries
in the task decorator.hallowed-mouse-14616
08/09/2023, 4:34 PMArrayNode
as an experimental feature - https://github.com/flyteorg/flytepropeller/pull/550
(2) There is also ongoing work on adding a flag to unify the retry budget. There can be a lot of confusion with handling errors with separate retry budgets - https://github.com/flyteorg/flyte/pull/3902hallowed-mouse-14616
08/09/2023, 4:34 PMsquare-boots-41503
08/09/2023, 5:44 PMArrayNode
? I can try that outthankful-minister-83577