Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

<https://github.com/flyteorg/flyteidl/pull/364|#364 add partition_columns to StructuredDatasetType>
Pull request opened by <https://github.com/cosmicBboy|cosmicBboy>
Signed-off-by: Niels Bantilan <mailto:niels.bantilan@gmail.com|niels.bantilan@gmail.com>

*Add `partition_columns` to `StructuredDatasetType`*

Partially addresses <https://github.com/flyteorg/flyte/issues/3219|flyteorg/flyte#3219>

*TL;DR*

This PR adds an additional property to the `StructureDatasetType` protobuf definition so that metadata about which columns in the dataset (some kind of DataFrame object) are used for partitioning the dataset into chunks, for example when a `pandas.DataFrame` is serialized as a parquet file.

*Type*

☐ Bug Fix
☑︎ Feature
☐ Plugin

*Are all requirements met?*

☑︎ Code completed
☐ Smoke tested
☐ Unit tests added
☐ Code documentation added
☐ Any pending items have an associated Issue

*Complete description*

This change is required to store additional metadata about which columns are used for partitioning. Currently this only meaningfully affects the serialization/deserialization of parquet files, but in the future we could support the partitioning of other serialization formats.

*Tracking Issue*

Partly addresses <https://github.com/flyteorg/flyte/issues/3219|flyteorg/flyte#3219>

*Follow-up issue*

NA
<https://github.com/flyteorg/flyteidl|flyteorg/flyteidl>
:white_check_mark: All checks have passed
13/13 successful checks