Hi Flyte community! We running Flyte on GKE using ...
# flyte-support
w
Hi Flyte community! We running Flyte on GKE using the flyte-core deployment. I have a workflow that does (GPU inference + CPU post processing). I want to invoke this workflow on the order of 100-1500 times from a single workflow. Currently we're using
@dynamic
workflows to fanout with
max-parallelism=200
but we're seeing a great deal of latency in workflow progress. Looking at https://www.union.ai/docs/flyte/deployment/flyte-configuration/performance/ and https://www.union.ai/docs/flyte/user-guide/core-concepts/workflows/subworkflows-and-sub-launch-plans/, it looks like we can achieve similar concurrency by invoking a sublaunch plan 100-1500 times and increasing the free worker count for Flytepropeller. We have explored map_tasks but it's a little too restrictive for our use case. Has anyone been similar situations and would be willing to share how they approached the fanout issue. Using sub launchplans for fanout
Copy code
import flytekit as fl


@fl.task
def my_gpu_task() -> None:
    pass

@fl.task
def my_cpu_task -> None:
    pass

@fl.workflow
def my_workflow() -> None:
    my_gpu_task() >> my_cpu_task()

my_workflow_lp = fl.LaunchPlan.get_or_create(my_workflow)


@fl.dynamic
def dynamic_lp(num_fanout: int) -> list[int]:
    return [my_workflow_lp() for i in range(num_fanout)]
a
hey Chris Have you had a chance to identify better the source of latency? Without recurring to map tasks, I think tweaking parameters like workers count may help but it's best to identify the bottleneck. There is the Grafana propeller dashboard that tracks latency and workers
also hovering over the timeline view gives you an indication of the phase where the majority of the time is spent on:
there at least we could determine if there's a bottleneck in the container bootstrap step or just code execution etc
w
Yeah we have the Grafana propeller dashboard setup. During the workflow execution we see "Round traverse latency per workflow" peak to ~4mins. Based on my reading of the docs/code the propeller worker isn't able to poll the state of the workflow tasks fast enough?
The timeline view unfortunately doesn't load (maybe due to high fanout).