Nicholas Roberson
05/10/2023, 10:37 PMmap_task
Rahul Mehta
05/10/2023, 10:43 PMdynamic
) vs map_task. We have some high fan-out dynamic
workflows right now that will spin up ~1k pods, but nothing close to that scaleYee
05/10/2023, 10:43 PMKetan (kumare3)
05/11/2023, 12:28 AMNicholas Roberson
05/11/2023, 1:33 PMmap_task
would be the ideal performance, we can limit the concurrency there to protect us a bit (no need to spin up that many at once), but that many over 1hr is still reasonable.there are ways of increasing performance - spec offloading, increasing kube client limits and running sharded propeller (edited)Id be interested to know more about this if you have any docs or examples you can point me to.
Dan Rammer (hamersaw)
05/11/2023, 1:34 PMNo more than 150,000 total pods
. might be something to note.Nicholas Roberson
05/11/2023, 1:36 PMDan Rammer (hamersaw)
05/11/2023, 1:38 PMNicholas Roberson
05/11/2023, 1:38 PMDan Rammer (hamersaw)
05/11/2023, 1:39 PMwe would hit the limits of the Kubernetes API first before we max out FlyteThat's what we expect. If your seeing anything different it would be great to dive into!
Nicholas Roberson
05/11/2023, 1:39 PMDan Rammer (hamersaw)
05/11/2023, 1:40 PMKetan (kumare3)
05/11/2023, 1:41 PMNicholas Roberson
05/11/2023, 1:51 PM[1/1] currentAttempt done. Last Error: USER::[1/1] currentAttempt done. Last Error: USER::containers with unready status: [nicholas-0vr23dznsximza1mggnq1a-n0-0-dn313-0]|Back-off pulling image "<http://772228263286.dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:b48da3d087bede34e4097373a549e5ffdad4f4e988156d1f4b667246c1863f5a|772228263286.dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:b48da3d087bede34e4097373a549e5ffdad4f4e988156d1f4b667246c1863f5a>"
I ran a dynamic workflow with 1000 tasks and max_parallelism
set to 100. Very bare bones.Yee
05/11/2023, 11:15 PMNicholas Roberson
05/14/2023, 4:08 PMmap_task
to handle this helps, however if we have 10 workflows with 50 concurrency lets say, we will still blow past the image pull rate (if the post I sent is accurate)containers with unready status: [anjmtfxbkkzndb4sm9kj-n0-0-dn810-0]|Back-off pulling image "*************.<http://dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc|dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc>"
and ran this a bunch of times (calling docker rmi ...
to remove inbetween)
docker pull *************.<http://dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc|dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc>
which successfully downloaded the image to my local. I wonder if this is the limits on the propellor we are hitting and the pods cant spin up -> increasing our burst
setting may help here.Rahul Mehta
05/14/2023, 5:23 PMEssentially these jobs are going to come in really fast in real life, however they don't need to all be processed immediately and can sit waiting for multiple days if we cant get to them. We are largely trying to see how things behave under this scenario. High throughput, large queue for the scheduler/propellor to manage.Have you thought about using SQS or Kafka to manage the queuing and then have a scheduled workflow that pops batches of messages off the queue and launches them? Seems like relying on the “workflow controller” (ie propeller) as your queue may not be precisely what you want here / you’ll have a greater degree of control over the concurrency of the tasks.
Ketan (kumare3)
05/14/2023, 5:24 PMNicholas Roberson
05/14/2023, 5:26 PMRahul Mehta
05/14/2023, 5:28 PMNicholas Roberson
05/14/2023, 5:29 PMRahul Mehta
05/14/2023, 5:31 PMNicholas Roberson
05/14/2023, 5:32 PMRahul Mehta
05/14/2023, 5:39 PMKetan (kumare3)
05/14/2023, 5:43 PMNicholas Roberson
05/14/2023, 5:46 PMKetan (kumare3)
05/14/2023, 5:54 PMNicholas Roberson
05/14/2023, 5:57 PMKetan (kumare3)
05/14/2023, 5:59 PMNicholas Roberson
05/14/2023, 6:00 PMKetan (kumare3)
05/14/2023, 6:01 PMNicholas Roberson
05/14/2023, 6:01 PM