Hi everyone! My team is evaluating Flyte as a cand...
# ask-the-community
Hi everyone! My team is evaluating Flyte as a candidate for some of our use cases. One of our use cases is exposing some pre-trained models for on-demand inference. If I understand the documentation Flyte does a complete teardown of the containers. Does anyone have a solution for reusing a pool of
containers for tasks? I suppose we could have a task call out to a pool of preheated inference VMs but I worry we'd be essentially undermining the point of Flytes scheduling. All thoughts are welcome, thanks!
The whole point is that workflow/task executions are ephemeral -- it's possible to keep a fixed pool of nodes warm so that scheduling the pods is fast. What's the motivation behind keeping a container "warm"? If you truly need to, I'd suggest just running them as a
that can scale according to capacity (possibly w/ a HPA) and expose inference via an endpoint (either sending the inputs over the wire or passing a pointer to the inputs in some object store/db)
W/ the appropriate networking setup Flyte tasks can hit those models - we've actually followed this pattern where we have online & batch/offline workloads that require the same version of a trained model
What's the motivation behind keeping a container "warm"?
- Well, some models can take upwards of a minute or more just to load into memory. We would like to be able to dynamically schedule DAG's of operations which may include expensive inference tasks. For those expensive takes it would be nice to reuse
Makes sense. Flyte (& most other container orchestration systems like Airflow/Argo/Luigi) is best for batch workloads where all tasks are ephemeral. If the init cost is high, would probably recommend a service. The only thing that may be hairy is cost attribution if it’s being shared between different pipelines, but you could instrument requests to the model server with the pipeline/task ID to proportionally back out cost for a shared inference service. Also if you’re doing auto scaling for the inference service you’ll likely need to include some retry logic with an appropriate backoff either at the http client or task level (ie if a new set of requests triggers a scale-up & the model server takes >1m to initialize, clients need to be resilient to that)
Thanks, I appreciate these comments. My team is largely arriving at a similar conclusion, but we wanted to reach out and see if there is an established, common solution to this problem.
Totally! Also FWIW, a few minutes of init latency feels relatively minor in the context of a large batch pipeline, so it might be less complexity to just eat the init cost if it’s dominated by the other steps. Don’t know all the specifics of your use case though.
The only reason we do it is to avoid annoying ABI (python pickle) breaks when the inference pipeline is running with a different version of the underlying libs than when the model was trained, and we’re hoping to unwind it soon
I would recommend looking into flytekit agent framework as an option to call out to either services or write agents for these models
Also there are ways of speeding up startup
We would love to understand the type of usecase
Also if inference overall latency is not a concern but you have volume I agree with Rahul’s proposal to simply forget
@Ketan (kumare3) Its not clear how many details I should be sharing, but basically there will be user defined graphs/chains of audio processing functions that will be triggered client side but executed serverside. Some operations will be significantly heaver than others. Many of the trained audio models are very large and probably don't make sense in an empheral context.