Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Screenshot 2024-11-21 at 11.00.11.png

Screenshot 2024-11-21 at 11.02.03.png

Hi there!

We've been encountering an unusual issue where occasionally creating executions with `overwrite_cache=True` doesn't actually run new tasks and overwrite the cache as expected - instead Flyte tries to recover previous outputs and typically hits an `OutputsNotFound` error due to some mismatch between the previous and new run.

We're using FlyteRemote to programatically create these executions as a way of relaunching previously failed executions - we do this instead of clicking 'relaunch' on the Flyte UI so that we can customise the execution name as well as relaunch executions in bulk.

Has anyone else had this issue, and is there perhaps something we're missing about our relaunching setup? I'll add a code snippet to the thread...

```execution = remote.fetch_execution(domain=domain, name=execution_name)

execution_id = execution.id
inputs = remote.client.get_execution_data(execution_id).full_inputs.to_flyte_idl()
labels = execution.spec.labels.to_flyte_idl()
launch_plan = remote.fetch_launch_plan(domain=domain, name=execution.spec.launch_plan.name)

execution_spec = ExecutionSpec(
    launch_plan=launch_plan.id.to_flyte_idl(),
    metadata=execution.spec.metadata.to_flyte_idl(),
    labels=labels,
    overwrite_cache=True,
)

remote.client.raw.create_execution(
    create_execution_request=ExecutionCreateRequest(
        project=execution_id.project,
        domain=execution_id.domain,
        name=execution_retry_name,
        spec=execution_spec,
        inputs=inputs,
    )
)```

It looks like instead of recreating the execution, you hit the "recover" endpoint... which does exactly as you described...

Thanks <@UNW4VP36V>! How does the remote decide to connect this to the 'recover' endpoint - could it be related to the metadata that gets copied over?

This is weird... recover is a separate endpoint... (`recover_execution`)...

cc <@UPBBNMXD1> <@U0265RTUJ5B>

this is correct! you should be able to call `remote.client.raw.recover_execution` instead: <https://github.com/flyteorg/flytekit/blob/master/flytekit/clients/raw.py#L368>

<@UPBBNMXD1> it's the other way around, we don't want to call recover in this case, we do want to call the regular create_execution but it seems (from the screenshot) that recover is what ended up getting called

Yep, that's it <@UNW4VP36V>! We've been experimenting with leaving out the `metadata` line, and we haven't seen any unexpected recoveries since then :crossed_fingers:  It's still not 100% clear if there was another reason why we might want to re-include the original execution metadata, or if we're safe to leave this out..

What likely has happened is that you recovered once, then because you have been copying the `metadata` field it copies that mode (RecoveryMode) to new executions. You should probably build the metadata field yourself and do not set the mode or recovery execution id