We have users executing workflows with heavy fanou...
# flyte-support
c
We have users executing workflows with heavy fanout and it seems that they are confused as to why their executions seem to take so long (it looks like a green succeeded bar on the timeline that took several hours). They appear to be hitting parallelism limits but it is not obvious to them. Just curious if any other folks have these issues or we're interpreting things wrong. I'm guessing the task node is being run so its active but propeller is just skipping it due to parallelism limits.
f
look into union, union has much better observability, has a 5x faster engine
happy to do a demo
r
i have been disappointed by the scaling limits of flyte thus far - is there any guide on achieving maximal performance or case studies on actual high scale production set up. i've been digging through the docs but really cannot find good info about how to make it scale properly. in my previous role, we routinely handled workflows with over 100,000 tasks using Metaflow without any special configuration needed. the contrast with Flyte at the current job has been quite stark - we're struggling to reliably execute even a few hundred tasks.
c
These are the best docs I've found: https://docs.flyte.org/en/latest/deployment/configuration/performance.html We've been able to scale Flyte but the ux (ui) degrades with scale imo
r
what is the scale that you are able to get to?
c
We scale tested 25k concurrent individual tasks (no fanout) IIRC But you absolutely have to tune the config for queue size and workers
r
i see - maybe i will push our team to go back to metaflow then 🙂
f
@red-farmer-96033 are you using single binary?
100k tasks And UX is the UI @red-farmer-96033
r
We run it in a single cluster. Are you saying the UI wont work at 100k tasks? That’s unfortunate. We have a hard time scaling flyte to reliably run more than 500 concurrent tasks for a single workflow. Does Lyft still use Flyte? Or any other company using it at a significant scale?
f
lots of companies use it at pretty high scale
Please share an example of 500 tasks that does not work if you are talking about Metaflow with AWS batch array nodes, then just AWS Batch plugin in flyte
Look at Wayve etc btw -

https://www.youtube.com/watch?v=TNUlsAXH2Qc&list=PL-OJo2SeWc8L0SgBmzxIH0bSYPsMYTBXD&index=1

r
Thank you! - let me have our data scientists follow up here with an example. Metaflow doesn’t have integration with AWS Batch array jobs, we were using Metaflow with Kubernetes.
f
ohh are you using
parallel.map
? in metaflow
if so, then i understand
that is not individual pods/containers
r
No - we are using metaflow’s foreach that launches individual pods
f
got it
r
We have to run a lot of simulations and concurrency is the key bottleneck. At the moment the key concern is that can we run multiple workflows (many per researcher) where each workflow is a collection of 10k-100k concurrent containers. Our current Flyte set up isn’t scaling well, we know Metaflow scales to that limit but somebody would have to take the burden of rewriting our existing workflows - the question in my mind is - is it easier to scale Flyte or easier to rewrite our work in Metaflow.
f
we do have folks with 100k containers, but, not directly as
map
, you will have to use
@eager
model.
This is new, but it essentially allows you the control to launch lots and lots of workflows
that way you can write a
for
loop and submit lots and lots of executions, the state is managed separately so, not much of a problem
r
I see. What is your recommendation on at what scale to use @eager. We wouldn’t want our users to worry about details of the tool for an embarrassingly parallel fan out and re-architect their work, but I understand that there might be fundamental limitations that are hard to work around.
f
there are ways of doing it, let me share a pattern with curent set
I will send it tomorrow
r
Thank you 🙏 while I have you here - does Flyte work well with systems like Kueue or Armada?
f
infact Jason uses Armada, he has not upstreamed the plugin
Kueue it will work, but kueue may have scaling challenges
c
If you are interested in using Flyte + Armada we can certainly help.
f
@clean-glass-36808 upstream it
Also here is the example i gave him
Copy code
import flytekit as fl
import typing

@fl.task
def five_x(x: int) -> int:
    return 5 * x

@fl.dynamic
def fanout(x: int) -> typing.List[typing.List[int]]:
    # create a chunked list of 5000 elements up to x
    outputs = []
    for i in range(0, x, 5000):
        l = list(range(i, i+5000))
        outputs.append(fl.map_task(five_x)(x=l))
    return outputs

@fl.workflow
def fanout_wf(x: int) -> typing.List[typing.List[int]]:
    return fanout(x=x)
I dont love the pattern, but it allows you to scale to a large list
i am running a test (albeit on union). Union has a different engine and is more scalable
@red-farmer-96033 I will let you know how it goes 😄
Screenshot 2025-02-20 at 10.29.32 PM.png
I do think
@eager
might be infact better for this - but eager is newer and today it will create new executions for each
r
Got it. And this is still only limited to 5k - the hard limit you mentioned, correct?
f
No this is 100k
I ran it with
Copy code
union run --remote massive_fanout.py fanout_wf --x 100000
as you can see its 100k, each map task for 5k
we could wrap it in
Copy code
my_map_recipe = fanout
r
I see. So 20 sub workflows of 5k mapped tasks each?
Let me know if it succeeds. Would have been great to see if this is indeed possible on Flyte too or not. Is the 5k limit a hard limit if we don’t care about the UI too much?
c
I was wondering if the 5k was for etcd limits cuz I think we hit those.
r
Is there any automatic state offloading to s3 etc. like metaflow to work around this?
f
5k limit is etcd limit
state is not offloaded, workflows can be offloaded
you have to turn that on
dynamic workflows are automatically offloaded
state is not offloaded, for perf and consistency reasons
anyways i need to go to sleep
i have ran this example will let you know tomorow
I see 20k completed
so most likely 100k should work
i am also seeing pretty large concurrency
but ymmv
r
Yes concurrency is the most important factor for us because that determines good put and how quickly things finish. I don’t doubt that most systems can handle high parallelism, concurrency is where things get interesting 😀
f
We designed for large concurrency of workflows some of them may sit forever - waiting for a training job to complete
But we have something cooking - hopefully I get enough time to built it
@red-farmer-96033 i just wanted to check more than 70k done
r
Would love to see what’s cooking on the concurrency front. For parallelism - what is the expectation for OSS Flyte?
f
You can run lots Of concurrent pods you need - to increase kube client https://docs.flyte.org/en/latest/deployment/configuration/performance.html
@red-farmer-96033 / @clean-glass-36808 my 100k task execution worked fine
i had a small cluster - i was able to run through all of it in under an hour
s
To anyone who finds this thread, the Pela Kith user above is a false identity used by one of the technical leaders of Outerbounds (proprietors of Metaflow), and the comments above are designed to spread fear and doubt about Flyte's capabilities. It's really sad that this has happened (multiple times, and I will comment on the other threads as well). Integrity and honest information sharing are extremely important, especially given the breadth of important applications powered today by Flyte. As always, please bring your honest questions, feedback, and knowledge in this channel so we can all learn and improve!