boundless-policeman-47000
08/01/2024, 8:49 PMmap_task in 1.13 for very large jobs (>60,000 tasks) and have found an interesting failure case. It seems like once the job size crosses a certain threshold, we consistently see failures on the same map_task batch. The error is a generic:
[324-325][328-329][332-333][338-339]: [1/1] currentAttempt done. Last Error: UNKNOWN::Outputs not generated by task execution
The relevant part of the execution graph is attached in a thread. Each _run_experiments_batch runs 3 sets of map_task , where each map_task runs 400 tasks.boundless-policeman-47000
08/01/2024, 8:53 PMboundless-policeman-47000
08/01/2024, 8:55 PMdn9boundless-policeman-47000
08/01/2024, 8:56 PMmap_task on that node instead of the 3rd.boundless-policeman-47000
08/01/2024, 8:56 PMflat-area-42876
08/02/2024, 12:53 AMfreezing-airport-6809
flat-area-42876
08/02/2024, 2:01 AMboundless-policeman-47000
08/02/2024, 3:52 AMmap_task runs 400 tasks and each one of those dynamics in the screencap runs 3 of those map_tasks in sequence, so roughly 1200 per dynamic. I then batch up the dynamics so that no more than 5 are running concurrentlyboundless-policeman-47000
08/02/2024, 3:56 AMboundless-policeman-47000
08/02/2024, 3:59 AMboundless-policeman-47000
08/02/2024, 4:11 AMFlyteRemote to distribute our workload across separate pipeline executions because we couldn't hit nearly this many within one execution. I'm testing to see if the (noticeably improved!) scalability in 1.13 will allow us to move our entire workset into a single executionboundless-policeman-47000
08/02/2024, 4:16 AMboundless-policeman-47000
08/02/2024, 4:56 PMflat-area-42876
08/02/2024, 6:01 PMboundless-policeman-47000
08/02/2024, 7:54 PMmap_task is a single ArrayNode right? In my case I'm running into issues with 50 dynamic workflows that each have 3 map_task and 2 tasks to materialize results, so ~250 nodesflat-area-42876
08/02/2024, 8:11 PMLast Error: UNKNOWN::Outputs not generated by task execution is tied to etcd limitations. This error occurs when the task is successful but then propeller's storage client gets a "Not found" when trying to read the output file.
Can you check if the output file exists in your blob store?flat-area-42876
08/02/2024, 8:32 PM