<#3219 [Core feature] Default parquet-to-pandas en...
# flyte-github
a
#3219 [Core feature] Default parquet-to-pandas encoder/decoder should support iterable read. Issue created by cosmicBboy Motivation: Why do you think this is important? The purpose of this issue is to support the use case where I can load `StructuredDataset`s iteratively as in:
Copy code
structured_dataset.open(pd.Dataframe).iter()
Goal: What should the final outcome look like, ideally? The end user should be able to specify the partition column when they output a structured dataset:
Copy code
@task
def make_df() -> StructuredDataset:
    df = pd.DataFrame.from_records([
        {
            "id": i,
            "partition": (i % 10) + 1,
            "name": "".join(
                random.choices(string.ascii_uppercase + string.digits, k=10)
            )
        }
        for i in range(1000)
    ])
    return StructuredDataset(dataframe=df, partition_cols=["partition"])  # or ["partition1", "partition2"]
And then consume it like so:
Copy code
@task
def use_df(dataset: StructuredDataset) -> pd.DataFrame:
    output = []
    for dd in dataset.open(pd.DataFrame).iter():
        print(f"This is a partial dataframe")
        print(dd.head(3))
        output.append(dd)
    return pd.concat(output)
Describe alternatives you've considered The user needs to implement their own encoder/decoder for this use case. Propose: Link/Inline OR Additional context There is a working implementation of this here: https://github.com/flyteorg/flyte-demos/blob/main/flyte_demo/workflows/data_iter.py#L101 Are you sure this issue hasn't been raised already? ☑︎ Yes Have you read the Code of Conduct? ☑︎ Yes flyteorg/flyte