One question about the scalability of Flyte, what’...
# ask-the-community
x
One question about the scalability of Flyte, what’s the recommended max number of concurrent tasks running in a cluster? and how to set it (or do I need to set it)? I’m trying to run this distributed ML system on flyte. It’s a single-task workflow, It needs to run at least 500 workflows or tasks (each task needs 30G RAM, 4 CPU, 1GPU) in parallel in addition to other misc workflows. Just wanted to make sure flyte can handle that scalability (I believe it can)?
d
@Xinzhou Liu I wouldn't be concerned, this is very small scale with regards to Flyte. Users are running 10k+ concurrent workflows where each workflow contains many tasks.Scalability is one of the main advantages of Flyte and we're always improving, but haven't yet seen a scale that is too large!
x
Awesome, that’s great to know! My system currently relies on a SQS queue to load balance to a dedicated set of workers (500), each worker processing arbitrary number of message untill the queue is empty. I’m thinking of getting rid of the SQS queue and leverage Flyte’s internal capabilities to schedule jobs. Btw, does Flyte’s scalability also depend on the cluster configuration? @Dan Rammer (hamersaw)
d
Btw, does Flyte’s scalability also depend on the cluster configuration?
can you say a little more about this? Flyte just relies on a k8s cluster, the number of nodes and corresponding resource availability will put restrictions on how fast things run. But when scaling the Flyte core isn't typically the bottleneck, rather the k8s cluster -- in which case we do support multi-cluster as well (ie. scheduling tasks on multiple k8s clusters). So there isn't necessarily specific k8s configuration for scaling, rather adding additional resources.
x
we got a bunch of such grpc error, but not sure what it means
Copy code
Debug string UNKNOWN:Error received from peer ipv4:10.79.160.119:443 {created_time:"2023-05-01T20:09:03.49763948+00:00", grpc_status:14, grpc_message:"unavailable"}
d
Which logs are you seeing this in?
x
Copy code
[20230501 20:09:04  ERROR]	Registering against <http://flyte.company.xyz|flyte.company.xyz>
[20230501 20:09:04  ERROR]	Detected Root /app/path/to/__main__, using this to create deployable package...
[20230501 20:09:04  ERROR]	No output path provided, using a temporary directory at /tmp/tmpvtnuhkjl instead
[20230501 20:09:04  ERROR]	Failed with Exception: Reason: SYSTEM:Unknown
[20230501 20:09:04  ERROR]	RPC Failed, with Status: StatusCode.UNAVAILABLE
[20230501 20:09:04  ERROR]		details: unavailable
[20230501 20:09:04  ERROR]		Debug string UNKNOWN:Error received from peer ipv4:10.79.160.119:443 {created_time:"2023-05-01T20:09:03.49763948+00:00", grpc_status:14, grpc_message:"unavailable"}
Registration logs I guess?
d
this is happening during like
pyflyte run ...
then?
x
yeah, it happened during
pyflyte reigster
139 Views