Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Has anyone seen behavior where a) you have a map task with retries &gt; 0 b) the first attempt of a child task fails and the second succeeds, but the entire child task is still marked as failed?

Also separately if anyone has tips on how to detect that a task is being retried so I can come up with a reproducible example, would appreciate it

```from collections import defaultdict
from flytekit import map_task, task, workflow


NUM_ATTEMPTS = defaultdict(int)


@task(retries=1)
def flakey_map_task(*, task_id: int) -&gt; None:
    NUM_ATTEMPTS[task_id] += 1
    if NUM_ATTEMPTS[task_id] == 1:
        raise ValueError("Bad luck, this one failed")

@workflow
def flakey_map_workflow() -&gt; None:
    task_ids = list(range(100))
    map_task(flakey_map_task)(task_id=task_ids)```
^ this is my repro attempt, doesn't work because I'm guessing Flyte doesn't actually persist global variables across pod restarts

can you try using the newer style map tasks to see if the issue still persists?  `from flytekit.experimental import map_task`

that version getting deployed as default in the next release

Last time I tried that experimental map_task it failed with inscrutable errors… will see if I feel motivated to try this

<https://github.com/flyteorg/flytesnacks/blob/debug/examples/mems/map_retries.py>

changed your example to use the experimental map_task (which again is going live soon).  it works as we expect at least.

and at least with the newer map task, the attempt number is correctly published, so you can use that to fail if you want to write an integration test or something

one thing to note is that you need to use the recoverable exception.

(long-time ago decision… we probably shouldn’t change the default behaviour, but there’s a long standing ticket to improve error logging so we should add whether or not exceptions that are caught are retry-able or not)

the next flyte release is going out very soon so if you just pick up the next release it should just work

Is it possible that map_task could start before the inputs from the prior task are available? We're seeing "Access denied" that is resolved when adding retries