Hi everyone My team is evaluating Flyte as a candidate for s Flyte #flyte-support

Hi everyone! My team is evaluating Flyte as a cand...

sparse-pencil-33953

07/22/2023, 5:59 PM

Hi everyone! My team is evaluating Flyte as a candidate for some of our use cases. One of our use cases is exposing some pre-trained models for on-demand inference. If I understand the documentation Flyte does a complete teardown of the containers. Does anyone have a solution for reusing a pool of

hot

containers for tasks? I suppose we could have a task call out to a pool of preheated inference VMs but I worry we'd be essentially undermining the point of Flytes scheduling. All thoughts are welcome, thanks!

elegant-australia-91422

07/22/2023, 8:51 PM

The whole point is that workflow/task executions are ephemeral -- it's possible to keep a fixed pool of nodes warm so that scheduling the pods is fast. What's the motivation behind keeping a container "warm"? If you truly need to, I'd suggest just running them as a

Deployment

that can scale according to capacity (possibly w/ a HPA) and expose inference via an endpoint (either sending the inputs over the wire or passing a pointer to the inputs in some object store/db)

👍 1

👀 1

elegant-australia-91422

07/22/2023, 8:53 PM

W/ the appropriate networking setup Flyte tasks can hit those models - we've actually followed this pattern where we have online & batch/offline workloads that require the same version of a trained model

👀 1

sparse-pencil-33953

07/22/2023, 9:20 PM

What's the motivation behind keeping a container "warm"?

- Well, some models can take upwards of a minute or more just to load into memory. We would like to be able to dynamically schedule DAG's of operations which may include expensive inference tasks. For those expensive takes it would be nice to reuse

warm

containers.

elegant-australia-91422

07/22/2023, 9:27 PM

Makes sense. Flyte (& most other container orchestration systems like Airflow/Argo/Luigi) is best for batch workloads where all tasks are ephemeral. If the init cost is high, would probably recommend a service. The only thing that may be hairy is cost attribution if it’s being shared between different pipelines, but you could instrument requests to the model server with the pipeline/task ID to proportionally back out cost for a shared inference service. Also if you’re doing auto scaling for the inference service you’ll likely need to include some retry logic with an appropriate backoff either at the http client or task level (ie if a new set of requests triggers a scale-up & the model server takes >1m to initialize, clients need to be resilient to that)

👍 1

sparse-pencil-33953

07/22/2023, 9:33 PM

Thanks, I appreciate these comments. My team is largely arriving at a similar conclusion, but we wanted to reach out and see if there is an established, common solution to this problem.

elegant-australia-91422

07/22/2023, 9:52 PM

Totally! Also FWIW, a few minutes of init latency feels relatively minor in the context of a large batch pipeline, so it might be less complexity to just eat the init cost if it’s dominated by the other steps. Don’t know all the specifics of your use case though.

elegant-australia-91422

07/22/2023, 9:53 PM

The only reason we do it is to avoid annoying ABI (python pickle) breaks when the inference pipeline is running with a different version of the underlying libs than when the model was trained, and we’re hoping to unwind it soon

👀 2

freezing-airport-6809

07/22/2023, 10:38 PM

I would recommend looking into flytekit agent framework as an option to call out to either services or write agents for these models

👀 1

freezing-airport-6809

07/22/2023, 10:38 PM

Also there are ways of speeding up startup

freezing-airport-6809

07/22/2023, 10:39 PM

We would love to understand the type of usecase

freezing-airport-6809

07/22/2023, 10:40 PM

Also if inference overall latency is not a concern but you have volume I agree with Rahul’s proposal to simply forget

sparse-pencil-33953

07/24/2023, 4:01 PM

@freezing-airport-6809 Its not clear how many details I should be sharing, but basically there will be user defined graphs/chains of audio processing functions that will be triggered client side but executed serverside. Some operations will be significantly heaver than others. Many of the trained audio models are very large and probably don't make sense in an empheral context.

18 Views

Open in Slack

Previous Next