:wave::skin-tone-4: I have a general question on h...
# ask-the-community
v
👋🏽 I have a general question on how to maximize parallelism within Flyte. Let's say I have workflows with the following structure:
Copy code
# This workflow does some work to fetch three pieces of data: A, B, C
@workflow
def fetch_data() -> Tuple(A, B, C):
   ...

# This workflow processes each fetched dataset
@workflow
def process_data(a: A, b: B, c: C):
   first_processing_task(a)
   second_processing_task(b)
   third_processing_task(c)
   ...

# This workflow puts it all together
@workflow
def main():
   a, b, c = fetch_data()
   return process_data(a=a, b=b, c=c)
Because each processing task runs independently, it's theoretically possible for parts of the
process_data
workflow to run even if other parts cannot. For example, let's say that fetching datasets A and B is very quick (say, 5 sec) but C takes a very long time (say, 10min). In this case, it would be ideal for any downstream work relying exclusively on A or B to be launched even while C is still being fetched. Is there any way to configure Flyte to greedily kick off the tasks for processing A and B in the
process_data
workflow while we're waiting for output C from the
fetch_data
workflow? Or is it an ironclad rule within Flyte that the tasks of a downstream workflow can only be launched once the upstream workflow(s) have all fully succeeded?
d
Flyte has a hard rule that a task can not begin until all of it's input values are available. However, in this scenario could you not restructure the workflow to allow this to satisfy your requirements?
Copy code
@workflow
def fetch_data_a() -> A:
   ...

@workflow
def fetch_data_b() -> B:
   ...

@workflow
def fetch_data_c() -> C:
   ...

@workflow
def process_data_a(a: A):
   first_processing_task(a)
   ...

@workflow
def main():
    a = fetch_data_a()
    a2 = process_data_a(a=a)
    b = fetch_data_b()
    b2 = process_data_b(b=b)
    c = fetch_data_c()
    c2 = process_data_c(c=c)

    return a2, b2, c2
If there are no dependencies between fetching / processing A,B, and C they should probably be separate tasks?
v
Thanks for the quick response! Yes your suggested approach will get us the parallelism we're looking for, and we're considering similar options on how we can update our task / workflow orchestration going forward. In our specific case, the service we're designing stitches Flyte workflows / pipelines defined across multiple disparate teams. It's common for these teams to define workflows that declare all of their required inputs from upstreams and then run their respective processing, and it's often the case that each individual processing step only requires a subset of the declared inputs. So I was just wondering if there was a way to configure / tune Flyte to kick off pieces of a workflow that are safe to run, even if the entire workflow itself is not ready. Your response makes sense though, and I figured that this would be a hard Flyte constraint. Thanks!
j
could most of these functions just be decorated with
@task
?
@workflow
seems like overkill.