Hi, did anyone encounter this before: ```Workflow...
# ask-the-community
k
Hi, did anyone encounter this before:
Copy code
Workflow[...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: 0: failed at Node[n7]. CatalogCallFailed: failed to release reservation, caused by: failed to read inputs when trying to query catalog: [READ_FAILED] failed to read data from dataDir [<s3://flytemeta/.../n7/data/0/n7/inputs.pb>]., caused by: path:<s3://flytemeta/.../n7/data/0/n7/inputs.pb>: not found
Background: A workflow execution failed at a task A while writing outputs to the remote storage. Relaunching the workflow, however, task A skipped, as according to the UI the result was read from cache, even though the task failed. Task B, depending on task A then failed with the above error, since the outputs of task A did not really exists. My suspicion is, that something went wrong during writing data to the cache? Another strange thing is, that the subworkflow of the relaunched workflow, where the failure occured, remains in Running state, even though the main workflow is in failed state and there is no possibility to kill the workflow entirely.
b
What propeller version are you on?
And has the same propeller version performed the cache write? And is task A a map task?
k
Propeller (flyte-binary): Version 1.6.1 Flytekit: Version 1.7.1b1 Task A is normal python task
To the second question, yes, the deployment has not been changed in between
s
Have you been to resolve this issue, Klemens? If not, can you try disabling the cache and running your workflow again?
k
Hi, invalidating the cache solved the problem for this execution, I just wanted to go down to the root cause of it, since this should not happen in production. Hopefully, it does not happen again.
s
Got it. Let us know if you come across this issue again.
k
Just had the same error again:
Copy code
Workflow[...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: 0: failed at Node[n13]. CatalogCallFailed: failed to release reservation, caused by: failed to read inputs when trying to query catalog: [READ_FAILED] failed to read data from dataDir [<s3://flytemeta/metadata/propeller/.../n2/data/0/n13/inputs.pb>]., caused by: path:<s3://flytemeta/metadata/propeller/.../n2/data/0/n13/inputs.pb>: not found
This time without the previous task failing and instead it failed when reading inputs (that in fact do not exist on the object store). What mechanisms are involved when creating inputs.pb and what could be potential root causes? Could it be a network problem or an instability of the object store? The failed task in this case was a map_task.
s
@Dan Rammer (hamersaw) @Yee
k
For information: We are using on-prem setup with flyte-binary chart 1.8.1 and a self-managed storage cluster (ceph) with s3 api. Flytekit for this particular workflow was 1.7.1b1
Maybe this is somehow related. The error message here indicates that the outputs of the previous tasks are missing (not the inputs as in the above error). The error message also makes sense, but the second task should not be succeeded.
d
@Klemens Kasseroller it's looking like there are a few issues here that are probably related, but just want to clear up to help debugging. (1) The first issue is failing to release the catalog reservation because the
inputs.pb
value could not be read. Did this occur as part of an abort because a different task failed? (2) It looks like the second error is the same - namely failing to release the catalog reservation because the
inputs.pb
file does not exist. You have confirmed that it indeed does not exist in the blobstore right? Again is this part of the abort sequence? And are these maptasks? (3) The image you linked is failing because propeller has a configurable limit on data sizes that are read from s3. The config value is maxDownloadMBs.
k
Hi Dan, thanks for your reply! There should not have been any abort, but it is very well possible, that there were some cluster issues involved (e.g. network issues). The tasks failing with inputs.pb not found are indeed map tasks. The error in the image I was posting is clear to me, what I find strange about this is, that the second task succeeded according to the UI, while the failing task indicates that the outputs of the succeeding task are missing. When relaunching the workflow the second task also reruns, even though cache is enabled (but in the UI it indicates that cache was disabled). Also no outputs are displayed in the webui for all of the 3 above cases, even though the task is displayed as succeeded (remember, it always fails at the next task, while in fact the succeeding task is missing outputs according to the UI). Setting maxDownloadMBs fixes the problem in this particular situation, but it does not really explain, why the second task succeeded even though outputs are missing. I am also a bit puzzled about the differences between max-output-size-bytes and maxDownloadMBs. Why do we need a separate setting for upload and download?
d
Can you give me a brief outline of the workflow that causes this error? It would really help to have a minimal repro. I'm having a very difficult time determining if this is a bug or a deployment issue.
k
I am not able to reproduce the original error, but the example from the image above can be reproduced with the following code:
Copy code
@dataclass_json
@dataclass
class FileStruct:
    a: List[FlyteFile]
    b: List[FlyteFile]
    c: FlyteFile


@task(cache=True, cache_version="1")
def return_large_number_of_files(n_files: int) -> List[FlyteFile]:
    all_files = []
    for i in range(n_files):
        file_path = os.path.join(flytekit.current_context().working_directory, f"file_{i}.txt")
        with open(file_path, "w+") as f:
            f.write("something")
        all_files.append(file_path)
    return [FlyteFile(file) for file in all_files]

@task(cache=True, cache_version="1")
def restructure_these_files(files: List[FlyteFile]) -> List[FileStruct]:
    return [FileStruct(files, files, file) for file in files]


@task(cache=True, cache_version="1")
def do_something_with_all_files(inp: List[FileStruct]) -> None:
    print(inp.c)


@workflow
def demo_wf(n_files: int) -> None:
    files = return_large_number_of_files(n_files=n_files)
    files = restructure_these_files(files=files)
    do_something_with_all_files(inp=files)
Probably depends on your cluster config, but for me it fails with n_files=200 with the error indicating that the output is above the allowed limit of 2MB - yet the task show as succeeded in the console.