• Rahul Mehta

    Rahul Mehta

    3 months ago
    Hey folks, currently investigating Flyte as an alternative to our current setup which uses Kubeflow/Argo -- we have several pipelines with a very high degree of fan-out (on the order of ~700-1000 individual tasks). Does anyone have data points/benchmarks on how Flyte scales wrt the size of the workflow? Many of the features in the SDK (especially how flytekit works w/ native python types) look quite promising and could eliminate a lot of workarounds we've implemented to use Kubeflow.
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    3 months ago
    Hey @Rahul Mehta, this is a very good question. I can say a few things. Currently Flyte stores workflows as k8s CRDs, so the 1.5MB limit on etcd means that workflow size is bounded. Generally, this means a single workflow can fanout to 10k+, but this is not a hard number as it is contingent on task ID lengths, etc.
  • That being said, through the use of launch plans you can break a workflow into multiple CRDs in the backend. So be using separate tiers a single workflow is essentially unbounded in size.
  • Rahul Mehta

    Rahul Mehta

    3 months ago
    Ah fantastic -- to confirm, does this mean a subworkflow will use a separate CRD? Also, Argo supports offloading large workflows to an external DB to get around the 1.5MB limit for K8s resources; does Flyte support something similar?
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    3 months ago
    If you launch a subworkflow directly it will not. If launch a subworkflow using a launchplan it will. This is terminology we are certainly trying to clarify as ti impact is not clear.
  • Rahul Mehta

    Rahul Mehta

    3 months ago
    Gotcha, good to know. I'm planning on running some stress tests to compare vanilla Argo, Kubeflow (on Argo), Metaflow (on Argo) and Flyte soon so will keep that in mind when doing so.
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    3 months ago
    Currently we are exploring options to offload to an external DB (actually hoping to transition completely away from etcd because of the aforementioned limitations). Maybe starting with large workflows only is a great stepping stone.
  • Rahul Mehta

    Rahul Mehta

    3 months ago
  • Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    3 months ago
    Oh absolutely, please reach out with any questions! And we would love to see the stress results. We're seeing a lot of interesting in these large scale. So there is plenty of interesting work to do in support.
  • Ketan (kumare3)

    Ketan (kumare3)

    3 months ago
    also @Rahul Mehta & @Dan Rammer (hamersaw) we do support a concept of Map tasks, that are built specifically for fanouts - Examples here
  • there are cases in which we have one workflow with multiple map tasks, each can have 1000's (maybe around 5k) tasks each
  • in the backend they are optimized for low overhead storage, so hence this can allow you to scale quite a bit
  • Rahul Mehta

    Rahul Mehta

    3 months ago
    That sounds like exactly what we're looking for (fwiw, our use-case is backtesting models w/ quarterly data, which requires running a full model training pipeline for each period's worth of data)
  • Ketan (kumare3)

    Ketan (kumare3)

    3 months ago
    ya we used to do that all the time
  • Rahul let me know if you want to have a quick chat
  • Rahul Mehta

    Rahul Mehta

    3 months ago
    Happy to, will DM you.
  • Ketan (kumare3)

    Ketan (kumare3)

    3 months ago
    also, you will really like - Caching (memoization)
  • and recover mode