Nicholas Roberson05/10/2023, 10:37 PM
Rahul Mehta05/10/2023, 10:43 PM
) vs map_task. We have some high fan-out
workflows right now that will spin up ~1k pods, but nothing close to that scale
Nicholas Roberson05/11/2023, 1:33 PM
would be the ideal performance, we can limit the concurrency there to protect us a bit (no need to spin up that many at once), but that many over 1hr is still reasonable.
there are ways of increasing performance - spec offloading, increasing kube client limits and running sharded propeller (edited)Id be interested to know more about this if you have any docs or examples you can point me to.
Dan Rammer (hamersaw)05/11/2023, 1:34 PM
Nicholas Roberson05/11/2023, 1:36 PM
Nicholas Roberson05/11/2023, 1:38 PM
Dan Rammer (hamersaw)05/11/2023, 1:39 PM
we would hit the limits of the Kubernetes API first before we max out FlyteThat's what we expect. If your seeing anything different it would be great to dive into!
Nicholas Roberson05/11/2023, 1:39 PM
Dan Rammer (hamersaw)05/11/2023, 1:40 PM
Nicholas Roberson05/11/2023, 1:51 PM
I ran a dynamic workflow with 1000 tasks and
[1/1] currentAttempt done. Last Error: USER::[1/1] currentAttempt done. Last Error: USER::containers with unready status: [nicholas-0vr23dznsximza1mggnq1a-n0-0-dn313-0]|Back-off pulling image "<http://772228263286.dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:b48da3d087bede34e4097373a549e5ffdad4f4e988156d1f4b667246c1863f5a|772228263286.dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:b48da3d087bede34e4097373a549e5ffdad4f4e988156d1f4b667246c1863f5a>"
set to 100. Very bare bones.
Nicholas Roberson05/14/2023, 4:08 PM
to handle this helps, however if we have 10 workflows with 50 concurrency lets say, we will still blow past the image pull rate (if the post I sent is accurate)
and ran this a bunch of times (calling
containers with unready status: [anjmtfxbkkzndb4sm9kj-n0-0-dn810-0]|Back-off pulling image "*************.<http://dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc|dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc>"
to remove inbetween)
docker rmi ...
which successfully downloaded the image to my local. I wonder if this is the limits on the propellor we are hitting and the pods cant spin up -> increasing our
docker pull *************.<http://dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc|dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc>
setting may help here.
Rahul Mehta05/14/2023, 5:23 PM
Essentially these jobs are going to come in really fast in real life, however they don't need to all be processed immediately and can sit waiting for multiple days if we cant get to them. We are largely trying to see how things behave under this scenario. High throughput, large queue for the scheduler/propellor to manage.Have you thought about using SQS or Kafka to manage the queuing and then have a scheduled workflow that pops batches of messages off the queue and launches them? Seems like relying on the “workflow controller” (ie propeller) as your queue may not be precisely what you want here / you’ll have a greater degree of control over the concurrency of the tasks.
Nicholas Roberson05/14/2023, 5:26 PM
Rahul Mehta05/14/2023, 5:28 PM
Nicholas Roberson05/14/2023, 5:29 PM
Rahul Mehta05/14/2023, 5:31 PM
Nicholas Roberson05/14/2023, 5:32 PM
Rahul Mehta05/14/2023, 5:39 PM
Nicholas Roberson05/14/2023, 5:46 PM
Nicholas Roberson05/14/2023, 5:57 PM
Nicholas Roberson05/14/2023, 6:00 PM
Nicholas Roberson05/14/2023, 6:01 PM