Hey there Let s say I have a team that wants to trigger 100k Flyte #flyte-support

Hey there! Let's say I have a team that wants to t...

flat-exabyte-79377

08/21/2023, 2:26 PM

Hey there! Let's say I have a team that wants to trigger 100k instances of a workflow to run, but we only want 1000 of them to run at the same time. 1. Would Flyte "enqueue" all 100k of them and enable us to set some sort of max concurrency parameter? The goal is for them to "fire and forget" so that they don't have to wait for trigger such a large number of jobs 2. If all of those workflows are extremely similar, is it better to run those as MapTasks?

thankful-tailor-28399

08/21/2023, 2:55 PM

Maybe max_parallelism would work for this?

🙏 1

kind-kite-58745

08/21/2023, 2:58 PM

You can set different max_parallelism per workflow/execution, so I don’t think that’s it From the page you linked:

The worst case for FlytePropeller is workflows that have an extremely large fan-out. This is because FlytePropeller implements a greedy traversal algorithm, that tries to evaluate the entire unblocked nodes within a workflow in every round. A solution for this is to limit the maximum number of nodes that can be evaluated. This can be done by setting max-parallelism for an execution. This can done in multiple ways

It seems to be a setting for limiting the amount of nodes that can run concurrently within an execution

thankful-tailor-28399

08/21/2023, 2:59 PM

You can launch those 100k as subworkflows, in which case this

max_parallelism

would apply

👍 1

kind-kite-58745

08/21/2023, 3:01 PM

That’s probably a bad idea for the reason I quoted above

The worst case for FlytePropeller is workflows that have an extremely large fan-out.

I’ve also had issues with Flyte console being unable to display large execution graphs sometimes (https://github.com/flyteorg/flyte/issues/3803)

thankful-tailor-28399

08/21/2023, 3:06 PM

About rendering; I don’t think that having 100k perfectly rendered workflows would be easier to follow; with such amount of tasks, I would not rely heavily on the console. About the performance; it’s true, it will take longer to update the workflows. Let’s wait and see if some else has a better idea then or this would be the way to go

kind-kite-58745

08/21/2023, 3:09 PM

About the queue handling 100,000 executions, I see there’s an option to configure queue parameters in the flyte propeller configmap (or from the helm values.yaml):

Copy code

queue:
        batch-size: -1
        batching-interval: 2s
        queue:
          base-delay: 5s
          capacity: 1000
          max-delay: 120s
          rate: 100
          type: maxof
        sub-queue:
          capacity: 1000
          rate: 100
          type: bucket

So you should be able to configure a higher capacity for the queue to keep your 100,000 executions in the queue But this does not limit the amount of executions that can run at the same time

🙏 1

kind-kite-58745

08/21/2023, 3:23 PM

Looks like setting a limit to the amount of executions was not supported as of 5 months ago: https://discuss.flyte.org/t/9661845/is-there-a-way-to-configure-the-scheduler-or-flyte-in-some-w In this discussion Ketan also mentioned it’s a low priority, and I could not find a feature request for it on github

kind-kite-58745

08/21/2023, 3:29 PM

If your tasks only run on cloud kubernetes, then you can limit the amount of concurrent tasks by setting a limit to the autoscaling of your node-pool (gke) or node-group (eks). Once this limit is reached, other pods will remain “Pending” until they can be scheduled. You will need to increase the tasks timeout by a lot to allow this, and this has to be configured for each node group that is used

👍 1

flat-exabyte-79377

08/21/2023, 3:30 PM

Thanks a lot for your suggestions!

thankful-tailor-28399

08/21/2023, 3:39 PM

But this would impact other workflows, right? So if other workflow is launched it will count for that limit and might not scale to 100k. Or am I wrong? And how would the autoscaling at a node-* level work, given that depending on resources usage one node might have n pods ?

kind-kite-58745

08/21/2023, 3:44 PM

Because these 100,000 ~~tasks~~ executions are all similar, you can check how many nodes run concurrently in a single execution and set the autoscaling limit to 1000 * that amount. Other tasks will still compete for the nodes quota, I think it is a good idea to create a separate Flyte cluster for these 100,000. It’s still a workaround, and it would be great if we had a feature to limit the executions per project/workflow natively from Flyte, that would also count non-kubernetes tasks By the way, this amount of executions comes close to the recommended limits for a single kubernetes cluster: https://kubernetes.io/docs/setup/best-practices/cluster-large/

• No more than 110 pods per node

• No more than 5,000 nodes

• No more than 150,000 total pods

• No more than 300,000 total containers

So this should probably run on a multi-cluster setup: https://docs.flyte.org/en/latest/deployment/deployment/multicluster.html

flat-exabyte-79377

08/21/2023, 3:48 PM

Thanks, yes we're setting this up as multi-cluster 🙂 Do you think the new ArrayNode map tasks could be a good fit as well? Maybe they would fit within the `max_parallelism`limit (as opposed to regular map tasks)? https://flyte.org/blog/flyte-1-9-arraynode-execution-tags-new-navigation-and-more I was wondering if making a workflow that triggers 1000 map tasks of 1000 tasks each could help, with a max parallelism of 1 (since it seems like map task nodes count for 1 unit of parallelism)

freezing-airport-6809

08/22/2023, 3:10 AM

I would write a parent workflow that can launch your launchplans

🙏 1

freezing-airport-6809

08/22/2023, 3:10 AM

This way it will scale to multiple Clusters and then limit number of executions by Mac parallelism

freezing-airport-6809

08/22/2023, 3:11 AM

We have never created a 100k parallel workflow - it should work but will need a few config tweaks - like workflow offloading

freezing-airport-6809

08/22/2023, 3:13 AM

You can simply create a static workflow of 100 nodes, each launching a workflow that launches a 1000 nodes in parallel

27 Views

Open in Slack

Previous Next