Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

<https://github.com/flyteorg/flyte/issues/3037|#3037 [Core feature] [Flytekit] Add support for HDF5 and Arrow in flyteplugins-vaex >
Issue created by <https://github.com/ryankarlos|ryankarlos>
*Motivation: Why do you think this is important?*

Currently `flyteplugins-vaex` supports automatic serialization and deserialization of vaex dataframe between consecutive tasks using parquet <https://github.com/flyteorg/flytekit/pull/1230|flyteorg/flytekit#1230>

It would be good to extend this to HDF5 and arrow for performance and interoperability, when data sets are too large to fit into memory <https://vaex.readthedocs.io/en/latest/faq.html#What-is-the-optimal-file-format-to-use-with-vaex|https://vaex.readthedocs.io/en/latest/faq.html#What-is-the-optimal-file-format-to-use-with-vaex>

*Goal: What should the final outcome look like, ideally?*

Register extra handlers `VaexDataFrameToHDF5EncodingHandler` and `VaexDataFrameToArrowEncodingHandler`, so users can use `Annotated` to update the default format:

```
@task
def t1(f: vaex.dataframe.DataFrameLocal) -&gt; Annotated[StructuredDataset, HDF5]

@task
def t2(f: vaex.dataframe.DataFrameLocal) -&gt; Annotated[StructuredDataset, Arrow]
```

*Describe alternatives you've considered*

N/A

*Propose: Link/Inline OR Additional context*

See discussion thread here <https://github.com/flyteorg/flytekit/pull/1230#discussion_r1006645274|flyteorg/flytekit#1230 (comment)>

*Are you sure this issue hasn't been raised already?*

☑︎ Yes

*Have you read the Code of Conduct?*

☑︎ Yes
<https://github.com/flyteorg/flyte|flyteorg/flyte>

<https://github.com/flyteorg/flyte/issues/3037|#3037 [Core feature] [Flytekit] Add support for HDF5 and Arrow in flyteplugins-vaex >
Issue reopened by <https://github.com/eapolinario|eapolinario>
<https://github.com/flyteorg/flyte|flyteorg/flyte>