Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

<https://github.com/flyteorg/flyte/issues/3334|#3334 [Core feature] Iterable FlyteDirectory to support downloading individual files on the fly>
Issue created by <https://github.com/cosmicBboy|cosmicBboy>
*Motivation: Why do you think this is important?*

With compute-heavy, acceleration-dependent workloads like deep learning model training, it's desirable to start training as soon as possible so as to avoid wasting time downloading the full dataset. This becomes more important the larger the dataset size is.

Assuming that the full dataset can fit in disk, it would improve cost efficiency to be able to start training on batches of data as soon as the machine is available. In the ML training use case, datasets are often organized as files, where each file is a data point. For example, imagine a dataset of images, where a special `labels.txt` file contains the labels of each image `example_*.png`.

```
dataset /
    labels.txt
    example_abc.png
    example_xyz.png
    ...
```

*Goal: What should the final outcome look like, ideally?*

As a Flyte user, I should be able to lazily iterate over a `FlyteDirectory` of such a dataset such that I don't have to download the entire directory and instead start training as soon as the first batch of data is available on the running Pod.

*Requirements*

• Should support iteration over files in the directory in a random order
• Potentially support iteration of batches of files in a random order

*Describe alternatives you've considered*

Users would have to create their own workaround to:

1. store the filenames for all the examples in a custom Flyte type (probably a `dataclass`)
2. create their own iterable downloader by combining the root FlyteDirectory with the filenames from (1) and use the FileAccessProvider to fetch individual files.
3. iterate over the files in the user-defined dataloader

*Propose: Link/Inline OR Additional context*

_No response_

*Are you sure this issue hasn't been raised already?*

☑︎ Yes

*Have you read the Code of Conduct?*

☑︎ Yes
<https://github.com/flyteorg/flyte|flyteorg/flyte>