boundless-policeman-47000
08/01/2024, 8:49 PMmap_task
in 1.13 for very large jobs (>60,000 tasks) and have found an interesting failure case. It seems like once the job size crosses a certain threshold, we consistently see failures on the same map_task
batch. The error is a generic:
[324-325][328-329][332-333][338-339]: [1/1] currentAttempt done. Last Error: UNKNOWN::Outputs not generated by task execution
The relevant part of the execution graph is attached in a thread. Each _run_experiments_batch
runs 3 sets of map_task
, where each map_task
runs 400 tasks.boundless-policeman-47000
08/01/2024, 8:53 PMboundless-policeman-47000
08/01/2024, 8:55 PMdn9
boundless-policeman-47000
08/01/2024, 8:56 PMmap_task
on that node instead of the 3rd.boundless-policeman-47000
08/01/2024, 8:56 PMflat-area-42876
08/02/2024, 12:53 AMfreezing-airport-6809
flat-area-42876
08/02/2024, 2:01 AMboundless-policeman-47000
08/02/2024, 3:52 AMmap_task
runs 400 tasks and each one of those dynamics in the screencap runs 3 of those map_tasks
in sequence, so roughly 1200 per dynamic. I then batch up the dynamics so that no more than 5 are running concurrentlyboundless-policeman-47000
08/02/2024, 3:56 AMboundless-policeman-47000
08/02/2024, 3:59 AMboundless-policeman-47000
08/02/2024, 4:11 AMFlyteRemote
to distribute our workload across separate pipeline executions because we couldn't hit nearly this many within one execution. I'm testing to see if the (noticeably improved!) scalability in 1.13 will allow us to move our entire workset into a single executionboundless-policeman-47000
08/02/2024, 4:16 AMboundless-policeman-47000
08/02/2024, 4:56 PMflat-area-42876
08/02/2024, 6:01 PMboundless-policeman-47000
08/02/2024, 7:54 PMmap_task
is a single ArrayNode right? In my case I'm running into issues with 50 dynamic workflows that each have 3 map_task
and 2 tasks to materialize results, so ~250 nodesflat-area-42876
08/02/2024, 8:11 PMLast Error: UNKNOWN::Outputs not generated by task execution
is tied to etcd limitations. This error occurs when the task is successful but then propeller's storage client gets a "Not found" when trying to read the output file.
Can you check if the output file exists in your blob store?flat-area-42876
08/02/2024, 8:32 PM