I am evaluating Flyte, but we have some challengin...
# flyte-support
c
I am evaluating Flyte, but we have some challenging scheduling requirements. Are these supported? We have a hundred to a few thousands tasks that need to update a database. Each task updates multiple partitions of the database, the list of partitions (they have a clear id) is known when the task is added to the task graph. Only 1 task may update a partition at a time. Each task needs ~half an hour. We want to run as many tasks in parallel as possible. • Can Flyte schedule tasks such that the mutually exclusive tasks do not run at the same time? • Is it possible to optimize scheduling by scheduling tasks with the most in-common partitions first?
a
Hey @clever-shampoo-31949
• Can Flyte schedule tasks such that the mutually exclusive tasks do not run at the same time?
IIUC you'd need to set concurrency limits at the launchplan or workflow level. This is being spec'd out as we speak (see RFC and feel free to comment there).
f
@average-finland-92144 @clever-shampoo-31949 I think for scheduling one task of a kind you will have to use cache serialization
c
Thanks @average-finland-92144 and @freezing-airport-6809 . We want to run as much in parallel as compute is available, so setting a concurrency limit is not what we're after. But the cache serialization thing looks like what we need. (The name is strange though, will read the docs a bit more to understand.)
From the cache serializing docs:
Using this mechanism, Flyte ensures that during multiple concurrent executions of a task only a single instance is evaluated and all others wait until completion and reuse the resulting cached outputs.
So unfortunately, this is not what we need. All our tasks need to run, even if they touch the same database partition.
@average-finland-92144 This could work with a concurrency limit per database partition. I will add a comment to the RFC. (There are many partitions though.)
f
That is at the workflow level
There is another mechanism in Flyte - but not documented