Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

two questions around deploying flyte.
1. Is it safe to do a rolling restart of flyte propeller and/or flyte admin during running workflows? are there any bad states that the system can get into?
2. What are the suggested kube cpu and memory request for admin and propeller for a medium size cluster (500 concurrent workflows)?
Thanks!

Yes it is safe absolutely- you will see some
Back offs but everything should resume correctly and will not interrupt long running jobs 

500 is low - propeller is memory hungry and admin also like slightly higher memory 

Admin 2gb and propeller 4gb should
Keep you nice for a while maybe double that load

Thanks for the quick response. what does admin store inside of memory? I assume propeller has some sort of cache for input&amp;outputs.

any pointers regarding CPU consumption? are they mainly doing I/O?

Yes - mainly io, all written in golang so optimized for high io

got it. So, the 2GB and 4GB memory suggestions are for running stable at 500 concurrent workflows? would these numbers roughly double for 1000 concurrent workflows? I assume other things like apiserver rate limits would come into play here as well

<@U06PDL7UAL9> propeller (Flyte's execution engine) uses `goroutines` to handle workflow executions. Each goroutine takes 2kB of memory (<https://go.dev/doc/go1.4|source>), so 500 concurrent workflows would give you about 1GB of memory. As you mentioned, though,  there are other sources of overhead, including the KubeAPI rate limits, the potential lags from the Informer cache and more.
The current performance docs have some guidance: <https://docs.flyte.org/en/latest/deployment/configuration/performance.html#optimizing-round-latency>

I'm exploring how all the pieces fit together and hopefully that page will receive a revamp soon :slightly_smiling_face: