Hi I have a question regarding data flow in flyte ...
# ask-the-community
s
Hi I have a question regarding data flow in flyte - I read the docs on this and I understand that flyte only passes around reference to files, but if I have a large dataset in s3, for example, does that mean each task that has to process/use that data will download the entire dataset onto each task container’s memory?
k
yes, you could either download entire dataset or read file steam
s
OK thanks! If the dataset is really large though, wouldn’t this take a long time to download and process the dataset for each task?
k
if you use tensorflow or pytorch, they have some features that allow you train model with batch data. it means it can download and process data in the same time.
For now, flyte save every intermediate data (task output) to s3. we’re thinking add a in-memory object store to flyte, which allows you to write some task outputs to this in-memory store. if you’re interest in it. I could write one pager about it, then we can discuss futher.
s
That’s interesting but a more basic question first - if I have image data in s3 and I supply the s3 url as task input, I still need a logic to download the dataset in the task code right (e.g. via
requests
lib, etc)?. And then if I process it and send it to another task, then flyte will take care of creating the reference for the s3 data and I can just use it in the second task as if it’s already in memory?
k
flyte uses lazy download. it downloads the file only when you call
file.download()
,
file.open()
or
__fspath__
file.open()
return file handler, and you can read streaming data
s
Oh ok didn’t know that - thanks!
Oh and is there an example with s3 as a data source?
I can’t seem to find any in the tutorials
the example with local is same as the one with s3
s
OK will check it out thank you!
Oh one more thing: if I’m not wrong, fsspec can cache downloaded files to local storage. If I specify storage via
@task
, do all flyte tasks in the same workflow share that storage?
k
No, only the task specified storage
s
OK so if I have
preprocess_task
->
train_task
, and download in
preprocess_task
to do some data processing and pass it over to
train_task
, then will Flyte basically upload the processed data from the first task to s3 and then download that data in
train_task
?
k
Yes, correct
s
Ok I see - so currently no funtionality for sharing storage between tasks then
I guess for larger datasets, just need to stream them directly from the source instead of downloading the whole thing in each task
Thanks for explaining this!
127 Views