wave skin tone 4 I have a general question on how to maximi Flyte #flyte-support

:wave::skin-tone-4: I have a general question on h...

lively-sundown-82704

08/21/2023, 3:11 PM

👋🏽 I have a general question on how to maximize parallelism within Flyte. Let's say I have workflows with the following structure:

Copy code

# This workflow does some work to fetch three pieces of data: A, B, C
@workflow
def fetch_data() -> Tuple(A, B, C):
   ...

# This workflow processes each fetched dataset
@workflow
def process_data(a: A, b: B, c: C):
   first_processing_task(a)
   second_processing_task(b)
   third_processing_task(c)
   ...

# This workflow puts it all together
@workflow
def main():
   a, b, c = fetch_data()
   return process_data(a=a, b=b, c=c)

Because each processing task runs independently, it's theoretically possible for parts of the

process_data

workflow to run even if other parts cannot. For example, let's say that fetching datasets A and B is very quick (say, 5 sec) but C takes a very long time (say, 10min). In this case, it would be ideal for any downstream work relying exclusively on A or B to be launched even while C is still being fetched. Is there any way to configure Flyte to greedily kick off the tasks for processing A and B in the

process_data

workflow while we're waiting for output C from the

fetch_data

workflow? Or is it an ironclad rule within Flyte that the tasks of a downstream workflow can only be launched once the upstream workflow(s) have all fully succeeded?

hallowed-mouse-14616

08/21/2023, 3:23 PM

Flyte has a hard rule that a task can not begin until all of it's input values are available. However, in this scenario could you not restructure the workflow to allow this to satisfy your requirements?

Copy code

@workflow
def fetch_data_a() -> A:
   ...

@workflow
def fetch_data_b() -> B:
   ...

@workflow
def fetch_data_c() -> C:
   ...

@workflow
def process_data_a(a: A):
   first_processing_task(a)
   ...

@workflow
def main():
    a = fetch_data_a()
    a2 = process_data_a(a=a)
    b = fetch_data_b()
    b2 = process_data_b(b=b)
    c = fetch_data_c()
    c2 = process_data_c(c=c)

    return a2, b2, c2

If there are no dependencies between fetching / processing A,B, and C they should probably be separate tasks?

☝️ 1

lively-sundown-82704

08/21/2023, 3:32 PM

Thanks for the quick response! Yes your suggested approach will get us the parallelism we're looking for, and we're considering similar options on how we can update our task / workflow orchestration going forward. In our specific case, the service we're designing stitches Flyte workflows / pipelines defined across multiple disparate teams. It's common for these teams to define workflows that declare all of their required inputs from upstreams and then run their respective processing, and it's often the case that each individual processing step only requires a subset of the declared inputs. So I was just wondering if there was a way to configure / tune Flyte to kick off pieces of a workflow that are safe to run, even if the entire workflow itself is not ready. Your response makes sense though, and I figured that this would be a hard Flyte constraint. Thanks!

freezing-boots-56761

08/22/2023, 1:48 AM

could most of these functions just be decorated with

@task

@workflow

seems like overkill.

👍 1

4 Views

Open in Slack

Previous Next