We have users executing workflows with heavy fanout and it s Flyte #flyte-support

We have users executing workflows with heavy fanou...

clean-glass-36808

02/20/2025, 6:46 PM

We have users executing workflows with heavy fanout and it seems that they are confused as to why their executions seem to take so long (it looks like a green succeeded bar on the timeline that took several hours). They appear to be hitting parallelism limits but it is not obvious to them. Just curious if any other folks have these issues or we're interpreting things wrong. I'm guessing the task node is being run so its active but propeller is just skipping it due to parallelism limits.

freezing-airport-6809

02/20/2025, 7:33 PM

look into union, union has much better observability, has a 5x faster engine

freezing-airport-6809

02/20/2025, 7:33 PM

happy to do a demo

red-farmer-96033

02/20/2025, 8:08 PM

i have been disappointed by the scaling limits of flyte thus far - is there any guide on achieving maximal performance or case studies on actual high scale production set up. i've been digging through the docs but really cannot find good info about how to make it scale properly. in my previous role, we routinely handled workflows with over 100,000 tasks using Metaflow without any special configuration needed. the contrast with Flyte at the current job has been quite stark - we're struggling to reliably execute even a few hundred tasks.

clean-glass-36808

02/20/2025, 8:11 PM

These are the best docs I've found: https://docs.flyte.org/en/latest/deployment/configuration/performance.html We've been able to scale Flyte but the ux (ui) degrades with scale imo

red-farmer-96033

02/20/2025, 8:11 PM

what is the scale that you are able to get to?

clean-glass-36808

02/20/2025, 8:12 PM

We scale tested 25k concurrent individual tasks (no fanout) IIRC But you absolutely have to tune the config for queue size and workers

red-farmer-96033

02/20/2025, 8:13 PM

i see - maybe i will push our team to go back to metaflow then 🙂

freezing-airport-6809

02/21/2025, 5:18 AM

@red-farmer-96033 are you using single binary?

freezing-airport-6809

02/21/2025, 5:19 AM

100k tasks And UX is the UI @red-farmer-96033

red-farmer-96033

02/21/2025, 5:26 AM

We run it in a single cluster. Are you saying the UI wont work at 100k tasks? That’s unfortunate. We have a hard time scaling flyte to reliably run more than 500 concurrent tasks for a single workflow. Does Lyft still use Flyte? Or any other company using it at a significant scale?

freezing-airport-6809

02/21/2025, 5:28 AM

lots of companies use it at pretty high scale

freezing-airport-6809

02/21/2025, 5:31 AM

Please share an example of 500 tasks that does not work if you are talking about Metaflow with AWS batch array nodes, then just AWS Batch plugin in flyte

freezing-airport-6809

02/21/2025, 5:33 AM

Look at Wayve etc btw -

https://www.youtube.com/watch?v=TNUlsAXH2Qc&list=PL-OJo2SeWc8L0SgBmzxIH0bSYPsMYTBXD&index=1▾

red-farmer-96033

02/21/2025, 5:33 AM

Thank you! - let me have our data scientists follow up here with an example. Metaflow doesn’t have integration with AWS Batch array jobs, we were using Metaflow with Kubernetes.

freezing-airport-6809

02/21/2025, 5:37 AM

ohh are you using

parallel.map

? in metaflow

freezing-airport-6809

02/21/2025, 5:38 AM

if so, then i understand

freezing-airport-6809

02/21/2025, 5:38 AM

that is not individual pods/containers

red-farmer-96033

02/21/2025, 5:38 AM

No - we are using metaflow’s foreach that launches individual pods

freezing-airport-6809

02/21/2025, 5:38 AM

got it

red-farmer-96033

02/21/2025, 5:43 AM

We have to run a lot of simulations and concurrency is the key bottleneck. At the moment the key concern is that can we run multiple workflows (many per researcher) where each workflow is a collection of 10k-100k concurrent containers. Our current Flyte set up isn’t scaling well, we know Metaflow scales to that limit but somebody would have to take the burden of rewriting our existing workflows - the question in my mind is - is it easier to scale Flyte or easier to rewrite our work in Metaflow.

freezing-airport-6809

02/21/2025, 5:46 AM

we do have folks with 100k containers, but, not directly as

map

, you will have to use

@eager

model.

freezing-airport-6809

02/21/2025, 5:47 AM

This is new, but it essentially allows you the control to launch lots and lots of workflows

freezing-airport-6809

02/21/2025, 5:48 AM

that way you can write a

for

loop and submit lots and lots of executions, the state is managed separately so, not much of a problem

red-farmer-96033

02/21/2025, 5:52 AM

I see. What is your recommendation on at what scale to use @eager. We wouldn’t want our users to worry about details of the tool for an embarrassingly parallel fan out and re-architect their work, but I understand that there might be fundamental limitations that are hard to work around.

freezing-airport-6809

02/21/2025, 5:53 AM

there are ways of doing it, let me share a pattern with curent set

freezing-airport-6809

02/21/2025, 5:53 AM

I will send it tomorrow

red-farmer-96033

02/21/2025, 5:54 AM

Thank you 🙏 while I have you here - does Flyte work well with systems like Kueue or Armada?

freezing-airport-6809

02/21/2025, 5:57 AM

infact Jason uses Armada, he has not upstreamed the plugin

freezing-airport-6809

02/21/2025, 5:57 AM

Kueue it will work, but kueue may have scaling challenges

clean-glass-36808

02/21/2025, 6:09 AM

If you are interested in using Flyte + Armada we can certainly help.

freezing-airport-6809

02/21/2025, 6:23 AM

@clean-glass-36808 upstream it

freezing-airport-6809

02/21/2025, 6:23 AM

Also here is the example i gave him

Copy code

import flytekit as fl
import typing

@fl.task
def five_x(x: int) -> int:
    return 5 * x

@fl.dynamic
def fanout(x: int) -> typing.List[typing.List[int]]:
    # create a chunked list of 5000 elements up to x
    outputs = []
    for i in range(0, x, 5000):
        l = list(range(i, i+5000))
        outputs.append(fl.map_task(five_x)(x=l))
    return outputs

@fl.workflow
def fanout_wf(x: int) -> typing.List[typing.List[int]]:
    return fanout(x=x)

freezing-airport-6809

02/21/2025, 6:24 AM

I dont love the pattern, but it allows you to scale to a large list

freezing-airport-6809

02/21/2025, 6:24 AM

i am running a test (albeit on union). Union has a different engine and is more scalable

freezing-airport-6809

02/21/2025, 6:29 AM

@red-farmer-96033 I will let you know how it goes 😄

freezing-airport-6809

02/21/2025, 6:29 AM

Screenshot 2025-02-20 at 10.29.32 PM.png

freezing-airport-6809

02/21/2025, 6:30 AM

I do think

@eager

might be infact better for this - but eager is newer and today it will create new executions for each

red-farmer-96033

02/21/2025, 6:30 AM

Got it. And this is still only limited to 5k - the hard limit you mentioned, correct?

freezing-airport-6809

02/21/2025, 6:31 AM

No this is 100k

freezing-airport-6809

02/21/2025, 6:31 AM

I ran it with

Copy code

union run --remote massive_fanout.py fanout_wf --x 100000

freezing-airport-6809

02/21/2025, 6:31 AM

as you can see its 100k, each map task for 5k

freezing-airport-6809

02/21/2025, 6:32 AM

we could wrap it in

Copy code

my_map_recipe = fanout

red-farmer-96033

02/21/2025, 6:34 AM

I see. So 20 sub workflows of 5k mapped tasks each?

red-farmer-96033

02/21/2025, 6:36 AM

Let me know if it succeeds. Would have been great to see if this is indeed possible on Flyte too or not. Is the 5k limit a hard limit if we don’t care about the UI too much?

clean-glass-36808

02/21/2025, 6:37 AM

I was wondering if the 5k was for etcd limits cuz I think we hit those.

red-farmer-96033

02/21/2025, 6:41 AM

Is there any automatic state offloading to s3 etc. like metaflow to work around this?

freezing-airport-6809

02/21/2025, 6:42 AM

5k limit is etcd limit

freezing-airport-6809

02/21/2025, 6:42 AM

state is not offloaded, workflows can be offloaded

freezing-airport-6809

02/21/2025, 6:42 AM

you have to turn that on

freezing-airport-6809

02/21/2025, 6:42 AM

dynamic workflows are automatically offloaded

freezing-airport-6809

02/21/2025, 6:45 AM

state is not offloaded, for perf and consistency reasons

freezing-airport-6809

02/21/2025, 6:45 AM

anyways i need to go to sleep

freezing-airport-6809

02/21/2025, 6:46 AM

i have ran this example will let you know tomorow

freezing-airport-6809

02/21/2025, 6:47 AM

I see 20k completed

freezing-airport-6809

02/21/2025, 6:47 AM

so most likely 100k should work

freezing-airport-6809

02/21/2025, 6:47 AM

i am also seeing pretty large concurrency

freezing-airport-6809

02/21/2025, 6:47 AM

but ymmv

red-farmer-96033

02/21/2025, 6:48 AM

Yes concurrency is the most important factor for us because that determines good put and how quickly things finish. I don’t doubt that most systems can handle high parallelism, concurrency is where things get interesting 😀

freezing-airport-6809

02/21/2025, 7:10 AM

We designed for large concurrency of workflows some of them may sit forever - waiting for a training job to complete

freezing-airport-6809

02/21/2025, 7:10 AM

But we have something cooking - hopefully I get enough time to built it

freezing-airport-6809

02/21/2025, 7:15 AM

@red-farmer-96033 i just wanted to check more than 70k done

red-farmer-96033

02/21/2025, 7:25 AM

Would love to see what’s cooking on the concurrency front. For parallelism - what is the expectation for OSS Flyte?

freezing-airport-6809

02/21/2025, 7:39 AM

You can run lots Of concurrent pods you need - to increase kube client https://docs.flyte.org/en/latest/deployment/configuration/performance.html

freezing-airport-6809

02/21/2025, 3:54 PM

@red-farmer-96033 / @clean-glass-36808 my 100k task execution worked fine

freezing-airport-6809

02/21/2025, 3:54 PM

i had a small cluster - i was able to run through all of it in under an hour

silly-toddler-37820

05/15/2025, 4:03 PM

To anyone who finds this thread, the Pela Kith user above is a false identity used by one of the technical leaders of Outerbounds (proprietors of Metaflow), and the comments above are designed to spread fear and doubt about Flyte's capabilities. It's really sad that this has happened (multiple times, and I will comment on the other threads as well). Integrity and honest information sharing are extremely important, especially given the breadth of important applications powered today by Flyte. As always, please bring your honest questions, feedback, and knowledge in this channel so we can all learn and improve!

56 Views

Open in Slack

Previous Next