white-teacher-47376
07/14/2023, 8:01 AMWorkflow[...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: 0: failed at Node[n7]. CatalogCallFailed: failed to release reservation, caused by: failed to read inputs when trying to query catalog: [READ_FAILED] failed to read data from dataDir [<s3://flytemeta/.../n7/data/0/n7/inputs.pb>]., caused by: path:<s3://flytemeta/.../n7/data/0/n7/inputs.pb>: not found
Background:
A workflow execution failed at a task A while writing outputs to the remote storage. Relaunching the workflow, however, task A skipped, as according to the UI the result was read from cache, even though the task failed. Task B, depending on task A then failed with the above error, since the outputs of task A did not really exists. My suspicion is, that something went wrong during writing data to the cache? Another strange thing is, that the subworkflow of the relaunched workflow, where the failure occured, remains in Running state, even though the main workflow is in failed state and there is no possibility to kill the workflow entirely.agreeable-kitchen-44189
07/14/2023, 8:12 AMagreeable-kitchen-44189
07/14/2023, 8:13 AMwhite-teacher-47376
07/14/2023, 8:24 AMwhite-teacher-47376
07/14/2023, 8:25 AMtall-lock-23197
white-teacher-47376
07/18/2023, 8:00 AMtall-lock-23197
white-teacher-47376
08/10/2023, 6:14 AMWorkflow[...] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: 0: failed at Node[n13]. CatalogCallFailed: failed to release reservation, caused by: failed to read inputs when trying to query catalog: [READ_FAILED] failed to read data from dataDir [<s3://flytemeta/metadata/propeller/.../n2/data/0/n13/inputs.pb>]., caused by: path:<s3://flytemeta/metadata/propeller/.../n2/data/0/n13/inputs.pb>: not found
white-teacher-47376
08/10/2023, 6:15 AMtall-lock-23197
white-teacher-47376
08/10/2023, 6:33 AMwhite-teacher-47376
08/10/2023, 11:03 AMhallowed-mouse-14616
08/10/2023, 2:51 PMinputs.pb
value could not be read. Did this occur as part of an abort because a different task failed?
(2) It looks like the second error is the same - namely failing to release the catalog reservation because the inputs.pb
file does not exist. You have confirmed that it indeed does not exist in the blobstore right? Again is this part of the abort sequence? And are these maptasks?
(3) The image you linked is failing because propeller has a configurable limit on data sizes that are read from s3. The config value is maxDownloadMBs.white-teacher-47376
08/11/2023, 11:25 AMhallowed-mouse-14616
08/11/2023, 2:03 PMwhite-teacher-47376
08/21/2023, 2:33 PM@dataclass_json
@dataclass
class FileStruct:
a: List[FlyteFile]
b: List[FlyteFile]
c: FlyteFile
@task(cache=True, cache_version="1")
def return_large_number_of_files(n_files: int) -> List[FlyteFile]:
all_files = []
for i in range(n_files):
file_path = os.path.join(flytekit.current_context().working_directory, f"file_{i}.txt")
with open(file_path, "w+") as f:
f.write("something")
all_files.append(file_path)
return [FlyteFile(file) for file in all_files]
@task(cache=True, cache_version="1")
def restructure_these_files(files: List[FlyteFile]) -> List[FileStruct]:
return [FileStruct(files, files, file) for file in files]
@task(cache=True, cache_version="1")
def do_something_with_all_files(inp: List[FileStruct]) -> None:
print(inp.c)
@workflow
def demo_wf(n_files: int) -> None:
files = return_large_number_of_files(n_files=n_files)
files = restructure_these_files(files=files)
do_something_with_all_files(inp=files)
Probably depends on your cluster config, but for me it fails with n_files=200 with the error indicating that the output is above the allowed limit of 2MB - yet the task show as succeeded in the console.