Hi there! We've been encountering an unusual issu...
# flyte-support
a
Hi there! We've been encountering an unusual issue where occasionally creating executions with
overwrite_cache=True
doesn't actually run new tasks and overwrite the cache as expected - instead Flyte tries to recover previous outputs and typically hits an
OutputsNotFound
error due to some mismatch between the previous and new run. We're using FlyteRemote to programatically create these executions as a way of relaunching previously failed executions - we do this instead of clicking 'relaunch' on the Flyte UI so that we can customise the execution name as well as relaunch executions in bulk. Has anyone else had this issue, and is there perhaps something we're missing about our relaunching setup? I'll add a code snippet to the thread...
Copy code
execution = remote.fetch_execution(domain=domain, name=execution_name)

execution_id = execution.id
inputs = remote.client.get_execution_data(execution_id).full_inputs.to_flyte_idl()
labels = execution.spec.labels.to_flyte_idl()
launch_plan = remote.fetch_launch_plan(domain=domain, name=execution.spec.launch_plan.name)

execution_spec = ExecutionSpec(
    launch_plan=launch_plan.id.to_flyte_idl(),
    metadata=execution.spec.metadata.to_flyte_idl(),
    labels=labels,
    overwrite_cache=True,
)

remote.client.raw.create_execution(
    create_execution_request=ExecutionCreateRequest(
        project=execution_id.project,
        domain=execution_id.domain,
        name=execution_retry_name,
        spec=execution_spec,
        inputs=inputs,
    )
)
h
It looks like instead of recreating the execution, you hit the "recover" endpoint... which does exactly as you described...
a
Thanks @high-park-82026! How does the remote decide to connect this to the 'recover' endpoint - could it be related to the metadata that gets copied over?
h
This is weird... recover is a separate endpoint... (
recover_execution
)... cc @acceptable-policeman-57188 @high-accountant-32689
a
this is correct! you should be able to call
remote.client.raw.recover_execution
instead: https://github.com/flyteorg/flytekit/blob/master/flytekit/clients/raw.py#L368
h
@acceptable-policeman-57188 it's the other way around, we don't want to call recover in this case, we do want to call the regular create_execution but it seems (from the screenshot) that recover is what ended up getting called
a
Yep, that's it @high-park-82026! We've been experimenting with leaving out the
metadata
line, and we haven't seen any unexpected recoveries since then 🤞 It's still not 100% clear if there was another reason why we might want to re-include the original execution metadata, or if we're safe to leave this out..
h
What likely has happened is that you recovered once, then because you have been copying the
metadata
field it copies that mode (RecoveryMode) to new executions. You should probably build the metadata field yourself and do not set the mode or recovery execution id