< thankful minister 83577> we have a large job coming up and Flyte #flyte-deployment

<@UNR3C6Y4T> we have a large job coming up and wan...

quick-barista-52015

05/10/2023, 10:37 PM

@thankful-minister-83577 we have a large job coming up and want to stress test what our Flyte deployment / Flyte scheduler can manage, we were considering doing a stress test with an increasing amount of (very simple) tasks. Starting somewhere around 10k pods and increasing on subsequent runs to roughly 200k pods. All will be spun up in a short amount of time and we want to see if it kills our deployment or what the performance impacts are. Any ideas you have on better being able to stress test, or things to look out for that could help us better manage this massive amount of parallelism.

quick-barista-52015

05/10/2023, 10:37 PM

Initial thoughts are even at 10k to create them in batches to not DDoS the flyte API. But past that once they all exist we want to see how the Scheduler manages under this load.

quick-barista-52015

05/10/2023, 10:38 PM

We also want to see what the difference may be when just creating them in a for loop vs. using

map_task

elegant-australia-91422

05/10/2023, 10:43 PM

Very interested in the for loops (ie. in

dynamic

) vs map_task. We have some high fan-out

dynamic

workflows right now that will spin up ~1k pods, but nothing close to that scale

thankful-minister-83577

05/10/2023, 10:43 PM

cc @hallowed-mouse-14616 and @high-accountant-32689

thankful-minister-83577

05/10/2023, 10:44 PM

we’ve definitely run stress tests in the past, i don’t quite remember the numbers though.

thankful-minister-83577

05/10/2023, 10:44 PM

@freezing-airport-6809 may have a better idea.

freezing-airport-6809

05/11/2023, 12:28 AM

@quick-barista-52015 - so map tasks are more optimal as compared to dynamic or other types of pods

freezing-airport-6809

05/11/2023, 12:28 AM

also propeller will try to protect itself and kubeapi aggressively, so you may not get very high through put, as there is max parallelism (you should tweak it),

👍 1

freezing-airport-6809

05/11/2023, 12:29 AM

there are ways of increasing performance - spec offloading, increasing kube client limits and running sharded propeller

quick-barista-52015

05/11/2023, 1:33 PM

@freezing-airport-6809 Interesting, and yeah we suspected that the

map_task

would be the ideal performance, we can limit the concurrency there to protect us a bit (no need to spin up that many at once), but that many over 1hr is still reasonable.

quick-barista-52015

05/11/2023, 1:34 PM

there are ways of increasing performance - spec offloading, increasing kube client limits and running sharded propeller (edited)

Id be interested to know more about this if you have any docs or examples you can point me to.

hallowed-mouse-14616

05/11/2023, 1:34 PM

@quick-barista-52015 very interested in discussing the results of this. There are a number of knobs in flyte (mostly between propeller and admin) than can be tuned for higher throughput. As suggested, in executions of this scale we typically see the kube apiserver as the bottleneck.

hallowed-mouse-14616

05/11/2023, 1:35 PM

also, per k8s docs -

No more than 150,000 total pods

. might be something to note.

👍 1

quick-barista-52015

05/11/2023, 1:36 PM

@hallowed-mouse-14616 Good to know, so in reality we would hit the limits of the Kubernetes API first before we max out Flyte?

quick-barista-52015

05/11/2023, 1:37 PM

We will update to max at 100k then probably, but I assume we will bring something down or Flyte slows us down to protect itself.

hallowed-mouse-14616

05/11/2023, 1:38 PM

docs for offloading static workflow information to CRD

quick-barista-52015

05/11/2023, 1:38 PM

In reality also we can have N pods process M jobs each (not each has to be its own job) so I would imagine we would probably not set M = 1 if what I am reading from you all is true.

hallowed-mouse-14616

05/11/2023, 1:39 PM

we would hit the limits of the Kubernetes API first before we max out Flyte

That's what we expect. If your seeing anything different it would be great to dive into!

🙏 1

quick-barista-52015

05/11/2023, 1:39 PM

Solid, ill add this to the stress test doc I have internally and share with infra team as well as other stakeholders.

hallowed-mouse-14616

05/11/2023, 1:40 PM

and then here's the PR for adding kubeclient config to admin, in admin this just covers creating new FlyteWorkflow CRs. so if you're using maptasks this shouldn't be an issue. however, similar configuration exists in flytepropeller (don't have the docs on hand) which will affect creating of Pods, etc. at larger scales this will probably require some tweaking.

freezing-airport-6809

05/11/2023, 1:41 PM

I think we should just have this in the docs

quick-barista-52015

05/11/2023, 1:51 PM

Ok thanks all, I appreciate the feedback. Typically what would be the best metrics to look at during running this? We have DataDog + kube metrics scraped by Prometheus and ETLed to a reporting DB.

quick-barista-52015

05/11/2023, 1:52 PM

I mean #1 is always did we crash anything, but past that.

quick-barista-52015

05/11/2023, 1:59 PM

Its likely we are not going to try to run all 100k at the same time (but will sumbit them in a short time), we will increase the parallelism to a certain point and then will watch how the remaining 90k, lets say, stress the system. Essentially these jobs are going to come in really fast in real life, however they don't need to all be processed immediately and can sit waiting for multiple days if we cant get to them. We are largely trying to see how things behave under this scenario. High throughput, large queue for the scheduler/propellor to manage.

quick-barista-52015

05/11/2023, 11:01 PM

Can someone help me interpret this error:

Copy code

[1/1] currentAttempt done. Last Error: USER::[1/1] currentAttempt done. Last Error: USER::containers with unready status: [nicholas-0vr23dznsximza1mggnq1a-n0-0-dn313-0]|Back-off pulling image "<http://772228263286.dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:b48da3d087bede34e4097373a549e5ffdad4f4e988156d1f4b667246c1863f5a|772228263286.dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:b48da3d087bede34e4097373a549e5ffdad4f4e988156d1f4b667246c1863f5a>"

I ran a dynamic workflow with 1000 tasks and

max_parallelism

set to 100. Very bare bones.

thankful-minister-83577

05/11/2023, 11:15 PM

that error message is probably correct. you’re probably actually hitting the rate-limit on ecr. (assuming the image exists.)

👍 1

quick-barista-52015

05/14/2023, 4:08 PM

Hm, yeah seems like it, for individual repositories in looks like the image pull is rate limited (on top of ECR limits for all images), is there any way to cache the image in a workflow so we dont have to consistently pull it past the first time?

quick-barista-52015

05/14/2023, 4:12 PM

We may be hitting this limit: https://repost.aws/knowledge-center/ecs-pull-container-error-rate-limit

quick-barista-52015

05/14/2023, 4:13 PM

The overall limit on image pulls is really high, spinning up 1k pods should not hit it. Using

map_task

to handle this helps, however if we have 10 workflows with 50 concurrency lets say, we will still blow past the image pull rate (if the post I sent is accurate)

quick-barista-52015

05/14/2023, 4:52 PM

I was able to reproduce the error in Flyte

Copy code

containers with unready status: [anjmtfxbkkzndb4sm9kj-n0-0-dn810-0]|Back-off pulling image "*************.<http://dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc|dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc>"

and ran this a bunch of times (calling

docker rmi ...

to remove inbetween)

Copy code

docker pull *************.<http://dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc|dkr.ecr.us-west-2.amazonaws.com/flyte_run_parallel_workflow:155b30ac808f5bef6fd005479c0c6670be6584a4323ca9752c68931fe17647fc>

which successfully downloaded the image to my local. I wonder if this is the limits on the propellor we are hitting and the pods cant spin up -> increasing our

burst

setting may help here.

elegant-australia-91422

05/14/2023, 5:23 PM

Essentially these jobs are going to come in really fast in real life, however they don't need to all be processed immediately and can sit waiting for multiple days if we cant get to them. We are largely trying to see how things behave under this scenario. High throughput, large queue for the scheduler/propellor to manage.

Have you thought about using SQS or Kafka to manage the queuing and then have a scheduled workflow that pops batches of messages off the queue and launches them? Seems like relying on the “workflow controller” (ie propeller) as your queue may not be precisely what you want here / you’ll have a greater degree of control over the concurrency of the tasks.

freezing-airport-6809

05/14/2023, 5:24 PM

This is not propeller - it’s getting throttles pulling images

👍 1

freezing-airport-6809

05/14/2023, 5:24 PM

But need to understand more, ideally you should not hit this as it will reuse a downloaded image

quick-barista-52015

05/14/2023, 5:26 PM

Yeah the current setup we have now is doing something similar to that, in the name of simplification of some of our tools we wanted to try to move to a Flyte only implementation. That being said, we could use a new redis queue to do this.

quick-barista-52015

05/14/2023, 5:27 PM

1. On job submit push 10k messages to queue + spin up workflow to process that queue. 2. Workflow loops with a delay and creates async tasks to process messages on the queue. 3. Once all messages are processed and queue is empty, workflow ends.

elegant-australia-91422

05/14/2023, 5:28 PM

Is there a reason you prefer a single workflow execution to handle the entire set of messages? Rather than a workflow itself handling the loop could a cron schedule handle that?

elegant-australia-91422

05/14/2023, 5:29 PM

Ie, workflow on a cron schedule pops k messages off the queue and dispatches tasks to process those, and ends when processing that batch is finished

quick-barista-52015

05/14/2023, 5:29 PM

We are pushing for better traceability at the moment, so tracking logs, cost, etc for large jobs if we can do it in one workflow that would be ideal, however if there are too many complications as a result we can move to a hybrid solution.

👍 1

quick-barista-52015

05/14/2023, 5:30 PM

Having all work under a single umbrella (workflow) makes things easy to track for devs as well as users wondering about the status of their job since they can just go to one place in the console to track.

elegant-australia-91422

05/14/2023, 5:31 PM

I see, in general we’ve found that smaller units of work make reasoning about retries, etc easier. But understood on tracking workflow execution state.

quick-barista-52015

05/14/2023, 5:32 PM

Yeah I agree with you 100%, were trying to balance that with this implementation.

quick-barista-52015

05/14/2023, 5:37 PM

Let me do some noodling, but at the moment the only barrier seems to be some rate limit we are hitting in ECR, and if we want to have thousands of pods pulling the same image we will need to figure out how to have our setup insulate from hitting that as best as we can.

elegant-australia-91422

05/14/2023, 5:39 PM

That sounds like you need to tune image pull policies & ensure that tasks pack onto the same nodes so you can re use the image (depending on your cluster autoscaler - ie karpenter, autopilot, etc this is fairly simple)

elegant-australia-91422

05/14/2023, 5:40 PM

We hit something similar when our autoscaler was launching a larger number of small nodes - by configuring it to prefer larger instances we could pack more tasks onto a single node and avoid the unnecessary pulls.

👀 1

freezing-airport-6809

05/14/2023, 5:43 PM

So @quick-barista-52015 / @elegant-australia-91422 one of the completely undocumented feature in Flyte is inbuilt resource control system that uses redis

freezing-airport-6809

05/14/2023, 5:44 PM

Also you are right Rahul, at lyft we have massive machines with ssds and at scale we did not even spin many of them down

freezing-airport-6809

05/14/2023, 5:45 PM

Also happy to jump on a call sometime to talk more about things

freezing-airport-6809

05/14/2023, 5:46 PM

@quick-barista-52015 not a sales thing, but union cloud has a different architecture- it has queues built in that can store workflows and distribute them

👀 1

quick-barista-52015

05/14/2023, 5:46 PM

That would be really helpful sometime during the week, let me get back and see if infra would want to join and we can chat.

freezing-airport-6809

05/14/2023, 5:54 PM

Also great to hear that you are centralizing on Flyte

quick-barista-52015

05/14/2023, 5:57 PM

100%! We looked at a few options, however considering that we have a team of scientists-who-code but aren't full developers, this was the best tool to enable them to write remote workflows with minimal load on the engineering team. Help them move faster while not requiring us to do a ton of work for production-ize their code.

🔥 1

🙇 1

freezing-airport-6809

05/14/2023, 5:59 PM

We would love to call it a platform haha

😆 1

quick-barista-52015

05/14/2023, 6:00 PM

It is a full-fledged platform hahaha don't take my wording to heart

freezing-airport-6809

05/14/2023, 6:01 PM

I am joking - gotta have a funny bone when in the community

quick-barista-52015

05/14/2023, 6:01 PM

Amen to that

❤️ 1

162 Views

Open in Slack

Previous Next