acoustic-carpenter-78188
01/09/2023, 8:51 PMstructured_dataset.open(pd.Dataframe).iter()
Goal: What should the final outcome look like, ideally?
The end user should be able to specify the partition column when they output a structured dataset:
@task
def make_df() -> StructuredDataset:
df = pd.DataFrame.from_records([
{
"id": i,
"partition": (i % 10) + 1,
"name": "".join(
random.choices(string.ascii_uppercase + string.digits, k=10)
)
}
for i in range(1000)
])
return StructuredDataset(dataframe=df, partition_cols=["partition"]) # or ["partition1", "partition2"]
And then consume it like so:
@task
def use_df(dataset: StructuredDataset) -> pd.DataFrame:
output = []
for dd in dataset.open(pd.DataFrame).iter():
print(f"This is a partial dataframe")
print(dd.head(3))
output.append(dd)
return pd.concat(output)
Describe alternatives you've considered
The user needs to implement their own encoder/decoder for this use case.
Propose: Link/Inline OR Additional context
There is a working implementation of this here: https://github.com/flyteorg/flyte-demos/blob/main/flyte_demo/workflows/data_iter.py#L101
Are you sure this issue hasn't been raised already?
☑︎ Yes
Have you read the Code of Conduct?
☑︎ Yes
flyteorg/flyte