<@UNR3C6Y4T> we have a large job coming up and wan...
# flyte-deployment
n
@Yee we have a large job coming up and want to stress test what our Flyte deployment / Flyte scheduler can manage, we were considering doing a stress test with an increasing amount of (very simple) tasks. Starting somewhere around 10k pods and increasing on subsequent runs to roughly 200k pods. All will be spun up in a short amount of time and we want to see if it kills our deployment or what the performance impacts are. Any ideas you have on better being able to stress test, or things to look out for that could help us better manage this massive amount of parallelism.
Initial thoughts are even at 10k to create them in batches to not DDoS the flyte API. But past that once they all exist we want to see how the Scheduler manages under this load.
We also want to see what the difference may be when just creating them in a for loop vs. using
map_task
r
Very interested in the for loops (ie. in
dynamic
) vs map_task. We have some high fan-out
dynamic
workflows right now that will spin up ~1k pods, but nothing close to that scale
y
cc @Dan Rammer (hamersaw) and @Eduardo Apolinario (eapolinario)
we’ve definitely run stress tests in the past, i don’t quite remember the numbers though.
@Ketan (kumare3) may have a better idea.
k
@Nicholas Roberson - so map tasks are more optimal as compared to dynamic or other types of pods
also propeller will try to protect itself and kubeapi aggressively, so you may not get very high through put, as there is max parallelism (you should tweak it),
there are ways of increasing performance - spec offloading, increasing kube client limits and running sharded propeller
n
@Ketan (kumare3) Interesting, and yeah we suspected that the
map_task
would be the ideal performance, we can limit the concurrency there to protect us a bit (no need to spin up that many at once), but that many over 1hr is still reasonable.
there are ways of increasing performance - spec offloading, increasing kube client limits and running sharded propeller (edited)
Id be interested to know more about this if you have any docs or examples you can point me to.
d
@Nicholas Roberson very interested in discussing the results of this. There are a number of knobs in flyte (mostly between propeller and admin) than can be tuned for higher throughput. As suggested, in executions of this scale we typically see the kube apiserver as the bottleneck.
also, per k8s docs -
No more than 150,000 total pods
. might be something to note.
n
@Dan Rammer (hamersaw) Good to know, so in reality we would hit the limits of the Kubernetes API first before we max out Flyte?
We will update to max at 100k then probably, but I assume we will bring something down or Flyte slows us down to protect itself.
d
n
In reality also we can have N pods process M jobs each (not each has to be its own job) so I would imagine we would probably not set M = 1 if what I am reading from you all is true.
d
we would hit the limits of the Kubernetes API first before we max out Flyte
That's what we expect. If your seeing anything different it would be great to dive into!
n
Solid, ill add this to the stress test doc I have internally and share with infra team as well as other stakeholders.
d
and then here's the PR for adding kubeclient config to admin, in admin this just covers creating new FlyteWorkflow CRs. so if you're using maptasks this shouldn't be an issue. however, similar configuration exists in flytepropeller (don't have the docs on hand) which will affect creating of Pods, etc. at larger scales this will probably require some tweaking.
k
I think we should just have this in the docs
n
Ok thanks all, I appreciate the feedback. Typically what would be the best metrics to look at during running this? We have DataDog + kube metrics scraped by Prometheus and ETLed to a reporting DB.
I mean #1 is always did we crash anything, but past that.
Its likely we are not going to try to run all 100k at the same time (but will sumbit them in a short time), we will increase the parallelism to a certain point and then will watch how the remaining 90k, lets say, stress the system. Essentially these jobs are going to come in really fast in real life, however they don't need to all be processed immediately and can sit waiting for multiple days if we cant get to them. We are largely trying to see how things behave under this scenario. High throughput, large queue for the scheduler/propellor to manage.
Can someone help me interpret this error:
Copy code
[1/1] currentAttempt done. Last Error: USER::[1/1] currentAttempt done. Last Error: USER::containers with unready status: [nicholas-0vr23dznsximza1mggnq1a-n0-0-dn313-0]|Back-off pulling image "<http://772228263286.dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:b48da3d087bede34e4097373a549e5ffdad4f4e988156d1f4b667246c1863f5a|772228263286.dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:b48da3d087bede34e4097373a549e5ffdad4f4e988156d1f4b667246c1863f5a>"
I ran a dynamic workflow with 1000 tasks and
max_parallelism
set to 100. Very bare bones.
y
that error message is probably correct. you’re probably actually hitting the rate-limit on ecr. (assuming the image exists.)
n
Hm, yeah seems like it, for individual repositories in looks like the image pull is rate limited (on top of ECR limits for all images), is there any way to cache the image in a workflow so we dont have to consistently pull it past the first time?
The overall limit on image pulls is really high, spinning up 1k pods should not hit it. Using
map_task
to handle this helps, however if we have 10 workflows with 50 concurrency lets say, we will still blow past the image pull rate (if the post I sent is accurate)
I was able to reproduce the error in Flyte
Copy code
containers with unready status: [anjmtfxbkkzndb4sm9kj-n0-0-dn810-0]|Back-off pulling image "*************.<http://dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc|dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc>"
and ran this a bunch of times (calling
docker rmi ...
to remove inbetween)
Copy code
docker pull *************.<http://dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc|dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc>
which successfully downloaded the image to my local. I wonder if this is the limits on the propellor we are hitting and the pods cant spin up -> increasing our
burst
setting may help here.
r
Essentially these jobs are going to come in really fast in real life, however they don't need to all be processed immediately and can sit waiting for multiple days if we cant get to them. We are largely trying to see how things behave under this scenario. High throughput, large queue for the scheduler/propellor to manage.
Have you thought about using SQS or Kafka to manage the queuing and then have a scheduled workflow that pops batches of messages off the queue and launches them? Seems like relying on the “workflow controller” (ie propeller) as your queue may not be precisely what you want here / you’ll have a greater degree of control over the concurrency of the tasks.
k
This is not propeller - it’s getting throttles pulling images
But need to understand more, ideally you should not hit this as it will reuse a downloaded image
n
Yeah the current setup we have now is doing something similar to that, in the name of simplification of some of our tools we wanted to try to move to a Flyte only implementation. That being said, we could use a new redis queue to do this.
1. On job submit push 10k messages to queue + spin up workflow to process that queue. 2. Workflow loops with a delay and creates async tasks to process messages on the queue. 3. Once all messages are processed and queue is empty, workflow ends.
r
Is there a reason you prefer a single workflow execution to handle the entire set of messages? Rather than a workflow itself handling the loop could a cron schedule handle that?
Ie, workflow on a cron schedule pops k messages off the queue and dispatches tasks to process those, and ends when processing that batch is finished
n
We are pushing for better traceability at the moment, so tracking logs, cost, etc for large jobs if we can do it in one workflow that would be ideal, however if there are too many complications as a result we can move to a hybrid solution.
Having all work under a single umbrella (workflow) makes things easy to track for devs as well as users wondering about the status of their job since they can just go to one place in the console to track.
r
I see, in general we’ve found that smaller units of work make reasoning about retries, etc easier. But understood on tracking workflow execution state.
n
Yeah I agree with you 100%, were trying to balance that with this implementation.
Let me do some noodling, but at the moment the only barrier seems to be some rate limit we are hitting in ECR, and if we want to have thousands of pods pulling the same image we will need to figure out how to have our setup insulate from hitting that as best as we can.
r
That sounds like you need to tune image pull policies & ensure that tasks pack onto the same nodes so you can re use the image (depending on your cluster autoscaler - ie karpenter, autopilot, etc this is fairly simple)
We hit something similar when our autoscaler was launching a larger number of small nodes - by configuring it to prefer larger instances we could pack more tasks onto a single node and avoid the unnecessary pulls.
k
So @Nicholas Roberson / @Rahul Mehta one of the completely undocumented feature in Flyte is inbuilt resource control system that uses redis
Also you are right Rahul, at lyft we have massive machines with ssds and at scale we did not even spin many of them down
Also happy to jump on a call sometime to talk more about things
@Nicholas Roberson not a sales thing, but union cloud has a different architecture- it has queues built in that can store workflows and distribute them
n
That would be really helpful sometime during the week, let me get back and see if infra would want to join and we can chat.
k
Also great to hear that you are centralizing on Flyte
n
100%! We looked at a few options, however considering that we have a team of scientists-who-code but aren't full developers, this was the best tool to enable them to write remote workflows with minimal load on the engineering team. Help them move faster while not requiring us to do a ton of work for production-ize their code.
k
We would love to call it a platform haha
n
It is a full-fledged platform hahaha don't take my wording to heart
k
I am joking - gotta have a funny bone when in the community
n
Amen to that
157 Views