Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Hello everyone, I'm trying to wrap my head around how to approach green field deployment of flyte in on-prem environment - everything should stay on-prem.
We have 50TB of training data, multiple training servers with 100TB of local storage each, we also have netapp network storage with 500TB.
Link between those are at least 100G.

Any ideas how to approach such setup in most efficient way and avoid bottlenecks caused by data transfer? Probably data streaming would obvious answer, but I'm curious how do handle use cases when so much data is involved.

hey <@U08LJBR264R> Is this for training
Data transfer optimization depends on the usecase

<@UNZB4NW3S> Yes it will be mostly focused on model training. In the future we might need distributed training (across the nodes), so this is also a topic to consider.

for model training, streaming is definitely the way to go! we recently added <https://www.union.ai/docs/byoc/tutorials/language-models/data-streaming/|this tutorial> to our docs that walks through how to do streaming + multi-node training. this works on union today. haven’t tested it on flyte yet, but i believe it should work there too.

also, on the netapp side, do you mind if i ask whether you’re using multi-protocol storage systems? i mean object, block, or file?

<@U08LJBR264R> I think the approach should also involve multiple tactics:

1. Leverage caching as much as possible
2. Consider  streaming (<https://www.union.ai/docs/flyte/user-guide/data-input-output/flyte-file-and-flyte-directory/#streaming|example with FlyteFile>)
3. Profiling your I/o read vs write to leverage acceleration features in your storage array

<@U01J90KBSU9> Example you provided will work only for text based data right? What if my data is video or image?
net app side is not configured yet, but probably as file storage. Any recommendations on that part?

<@U08LJBR264R> the example uses text data, but you can extend it to support images or videos as well: <https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/dataset_format.html>. i’ve used mosaic streaming to stream the data, but you can use any streaming library.

&gt; net app side is not configured yet, but probably as file storage. Any recommendations on that part?
no recommendations from my end, just wanted to check. file storage should work fine, though object storage would be supported out of the box. if you think file is the most performant for your use case, then yes, it makes sense to use it.

<@U08LJBR264R> moving our convo here

I wonder if something like Longhorn would be needed at all considering it's only block and you already have a Netapp box.
I'd go with Netapp for object storage (S3 compatible which is what Union/Flyte uses by default) and local storage for block-based PVs in case you have tasks that require persistent storage. Where is your 50TB of training data stored currently?

<@U08LJBR264R> so usually when you train, you want to saturate your GPU. If there is no preprocessing usually batches streamed and cached to local ssd should suffice as long as you can batch them to gpu, without needing cpu much

Streaming data loader should work for all formats 

We can walk over the usecase more deeply over a call