Hi I have a question regarding data flow in flyte I read the Flyte #flyte-support

Hi I have a question regarding data flow in flyte ...

sticky-angle-28419

04/21/2023, 3:59 PM

Hi I have a question regarding data flow in flyte - I read the docs on this and I understand that flyte only passes around reference to files, but if I have a large dataset in s3, for example, does that mean each task that has to process/use that data will download the entire dataset onto each task container’s memory?

glamorous-carpet-83516

04/21/2023, 4:10 PM

yes, you could either download entire dataset or read file steam

sticky-angle-28419

04/21/2023, 4:12 PM

OK thanks! If the dataset is really large though, wouldn’t this take a long time to download and process the dataset for each task?

glamorous-carpet-83516

04/21/2023, 5:23 PM

if you use tensorflow or pytorch, they have some features that allow you train model with batch data. it means it can download and process data in the same time.

glamorous-carpet-83516

04/21/2023, 5:27 PM

For now, flyte save every intermediate data (task output) to s3. we’re thinking add a in-memory object store to flyte, which allows you to write some task outputs to this in-memory store. if you’re interest in it. I could write one pager about it, then we can discuss futher.

sticky-angle-28419

04/21/2023, 5:38 PM

That’s interesting but a more basic question first - if I have image data in s3 and I supply the s3 url as task input, I still need a logic to download the dataset in the task code right (e.g. via

requests

lib, etc)?. And then if I process it and send it to another task, then flyte will take care of creating the reference for the s3 data and I can just use it in the second task as if it’s already in memory?

glamorous-carpet-83516

04/21/2023, 6:06 PM

flyte uses lazy download. it downloads the file only when you call

file.download()

file.open()

__fspath__

glamorous-carpet-83516

04/21/2023, 6:07 PM

file.open()

return file handler, and you can read streaming data

sticky-angle-28419

04/21/2023, 6:18 PM

Oh ok didn’t know that - thanks!

sticky-angle-28419

04/21/2023, 6:37 PM

Oh and is there an example with s3 as a data source?

sticky-angle-28419

04/21/2023, 6:37 PM

I can’t seem to find any in the tutorials

glamorous-carpet-83516

04/21/2023, 10:03 PM

https://docs.flyte.org/projects/cookbook/en/latest/auto/core/flyte_basics/files.html

glamorous-carpet-83516

04/21/2023, 10:03 PM

the example with local is same as the one with s3

sticky-angle-28419

04/21/2023, 10:05 PM

OK will check it out thank you!

sticky-angle-28419

04/22/2023, 1:25 AM

Oh one more thing: if I’m not wrong, fsspec can cache downloaded files to local storage. If I specify storage via

@task

, do all flyte tasks in the same workflow share that storage?

glamorous-carpet-83516

04/23/2023, 1:01 AM

No, only the task specified storage

sticky-angle-28419

04/23/2023, 1:04 AM

OK so if I have

preprocess_task

train_task

, and download in

preprocess_task

to do some data processing and pass it over to

train_task

, then will Flyte basically upload the processed data from the first task to s3 and then download that data in

train_task

glamorous-carpet-83516

04/23/2023, 1:04 AM

Yes, correct

sticky-angle-28419

04/23/2023, 1:05 AM

Ok I see - so currently no funtionality for sharing storage between tasks then

sticky-angle-28419

04/23/2023, 1:05 AM

I guess for larger datasets, just need to stream them directly from the source instead of downloading the whole thing in each task

sticky-angle-28419

04/23/2023, 1:06 AM

Thanks for explaining this!

155 Views

Open in Slack

Previous Next