square-boots-41503
08/08/2023, 4:12 PMretries=N decorator) while user errors cause the task to fail and not retry. This works correctly for task and dynamic but not for tasks spawned through map_task.
For map_task , a spot revocation is handled like a user error and the task fails without a retry. In this case, the error message looks like this [4]: code:"UnexpectedObjectDeletion" message:"object [flytetester-development/eric-zkb6y8cmtmyzj2sjefbsvw-n0-0-dn0-0-dn0-0-4] terminated in the background, manually".
In contrast, the error message I see for task/dynamic after the max number of retryable spot revocations is the following [10/10] currentAttempt done. Last Error: SYSTEM::object [flytetester-development/eric-b0iozgurekfuo5ookspxq-n0-0-dn0-10] terminated in the background, manually.
I tried adding interruptible=True and retries=N but that causes all failures to retry including user errors (which I want to exclude from being retried).
Does anyone know how to get subtasks in map_task to behave like task/dynamic with respect to spot revocations?tall-lock-23197
retries must be set. Not sure how your `task`s are automatically being retried on spot instances without retries being set.
cc @hallowed-mouse-14616square-boots-41503
08/09/2023, 1:24 PMretries but have it only retry spot interruptions and not real errors?square-boots-41503
08/09/2023, 2:36 PMNot sure how your `task`s are automatically being retried on spot instances withoutI was confused by this too.. but I think this is how it is currently behaving. The task that I'm spot revoking has a barebonesbeing set.retries
task definition like so
@task
def example_task(input: str) -> str:
<http://logging.info|logging.info>('Starting example_task')
i = 0
while True:
<http://logging.info|logging.info>(f'Task Iteration - {i}')
sleep(10)
i += 1
return "hi"hallowed-mouse-14616
08/09/2023, 4:24 PMretries field in the task annotation shouldn't be necessary.hallowed-mouse-14616
08/09/2023, 4:29 PMSYSTEM and USER. When encountering an error, both have separate budgets, so a task can be executed allowing 3 SYSTEM errors and 1 USER error for example. The SYSTEM error configuration is in propeller with the max-node-retries-system-failures config value under the node-config key. This option also has a interruptible-failure-threshold option, which defines the number of system-level retries that will be considered interruptible. So you can say allow 3 retries, but on the last one (ie 2 for the failure threshold) do not label the Pod as interruptible. Alternativley the USER budget is set by defining retries in the task decorator.hallowed-mouse-14616
08/09/2023, 4:34 PMArrayNode as an experimental feature - https://github.com/flyteorg/flytepropeller/pull/550
(2) There is also ongoing work on adding a flag to unify the retry budget. There can be a lot of confusion with handling errors with separate retry budgets - https://github.com/flyteorg/flyte/pull/3902hallowed-mouse-14616
08/09/2023, 4:34 PMsquare-boots-41503
08/09/2023, 5:44 PMArrayNode ? I can try that outthankful-minister-83577