Hello - we've been experimenting with using the Ar...
# flyte-support
b
Hello - we've been experimenting with using the ArrayNode
map_task
in 1.13 for very large jobs (>60,000 tasks) and have found an interesting failure case. It seems like once the job size crosses a certain threshold, we consistently see failures on the same
map_task
batch. The error is a generic:
Copy code
[324-325][328-329][332-333][338-339]: [1/1] currentAttempt done. Last Error: UNKNOWN::Outputs not generated by task execution
The relevant part of the execution graph is attached in a thread. Each
_run_experiments_batch
runs 3 sets of
map_task
, where each
map_task
runs 400 tasks.
Screenshot from 2024-08-01 16-51-22.png
What's interesting is that whether we run a job of this size or double, in both cases the failure happens at node
dn9
In each case, it's always the 3rd map_task that fails on this node with the above error. When I attempt to view the logs in cloudwatch, I see the logs of the 1st (of 3)
map_task
on that node instead of the 3rd.
Recovering the job appears to have it complete successfully
f
I think we only ran scale testing for ArrayNode up to ~5000 subtasks. And just to confirm, are you running an ArrayNode with 60,000 subtasks? I figured we'd run into etcd size limitations with that many, but that should lead to a different error. Had you run workloads of similar scale on the legacy maptasks? Also, what concurrency were you running the map task with? What flyte version are you running for propeller and what flytekit version as well? We should probably enrich the error message to include the file path. I'm unsure how/if the output prefix/file path could differ when the subtask pod is built and when creating the output writer after the task succeeds.
f
5k is the limit if you want 60, you will have to run 12 copies of them. The state explodes and etcd cannot handle it
f
yup, and by breaking into 12 copies, it'd be executing over 12 different launchplans. However, I'm unsure if the workflow is actually running 60k tasks. From my understanding, if the workflow state gets too large for etcd to handle, there should be an error at the controller level to fail the workflow with "Workflow execution state is too large for Flyte to handle" and then failing any running tasks
b
Each
map_task
runs 400 tasks and each one of those dynamics in the screencap runs 3 of those
map_tasks
in sequence, so roughly 1200 per dynamic. I then batch up the dynamics so that no more than 5 are running concurrently
The 60,000+ number is from distributing the work in batches across those dynamic workflows
So to recap - each ArrayNode individually does not run more than 400, each dynamic workflow should run roughly 1200, but each Flyte execution should be running on the order of ~60,000 total with roughly no more than 2000 individual tasks running concurrently
@flat-area-42876 we just recently upgraded to flytekit and propeller 1.13 from 1.10.7. Before, we were using
FlyteRemote
to distribute our workload across separate pipeline executions because we couldn't hit nearly this many within one execution. I'm testing to see if the (noticeably improved!) scalability in 1.13 will allow us to move our entire workset into a single execution
I bumped the resource requests on the dynamic workflows (does this have an impact?) and a task we use to materialize results and was able to complete 2 runs of ~60,000 total tasks, testing now with a very large run of ~150,000 tasks (same structure as above, but bumped the ArrayNodes from 400 to 500). If I want to complete very large runs like this from a single execution, should I expect to need to bump the ArrayNodes up into the thousands? One of our task types has variable duration, so I've been aiming to keep the ArrayNode's fairly constrained so that a small number of long-running tasks don't block progress
That very large run failed - what I'm looking for is to understand if there is an upper bound on the total number of nodes in an execution or a single dynamic
f
Upper bound should be around 5k nodes in the execution before you start running into etcd issues. You could increase the scale of a single execution by kicking off external workflows from a parent workflow
b
An entire
map_task
is a single ArrayNode right? In my case I'm running into issues with 50 dynamic workflows that each have 3
map_task
and 2 tasks to materialize results, so ~250 nodes
f
Yup it's a single node. I'm skeptical the
Last Error: UNKNOWN::Outputs not generated by task execution
is tied to etcd limitations. This error occurs when the task is successful but then propeller's storage client gets a "Not found" when trying to read the output file. Can you check if the output file exists in your blob store?
Are you using the default stow_store DataStore client as well?