<#3334 [Core feature] Iterable FlyteDirectory to s...
# flyte-github
a
#3334 [Core feature] Iterable FlyteDirectory to support downloading individual files on the fly Issue created by cosmicBboy Motivation: Why do you think this is important? With compute-heavy, acceleration-dependent workloads like deep learning model training, it's desirable to start training as soon as possible so as to avoid wasting time downloading the full dataset. This becomes more important the larger the dataset size is. Assuming that the full dataset can fit in disk, it would improve cost efficiency to be able to start training on batches of data as soon as the machine is available. In the ML training use case, datasets are often organized as files, where each file is a data point. For example, imagine a dataset of images, where a special
labels.txt
file contains the labels of each image
example_*.png
.
Copy code
dataset /
    labels.txt
    example_abc.png
    example_xyz.png
    ...
Goal: What should the final outcome look like, ideally? As a Flyte user, I should be able to lazily iterate over a
FlyteDirectory
of such a dataset such that I don't have to download the entire directory and instead start training as soon as the first batch of data is available on the running Pod. Requirements • Should support iteration over files in the directory in a random order • Potentially support iteration of batches of files in a random order Describe alternatives you've considered Users would have to create their own workaround to: 1. store the filenames for all the examples in a custom Flyte type (probably a
dataclass
) 2. create their own iterable downloader by combining the root FlyteDirectory with the filenames from (1) and use the FileAccessProvider to fetch individual files. 3. iterate over the files in the user-defined dataloader Propose: Link/Inline OR Additional context No response Are you sure this issue hasn't been raised already? ☑︎ Yes Have you read the Code of Conduct? ☑︎ Yes flyteorg/flyte