cuddly-jelly-27016
05/14/2025, 12:13 AMflyteplugins-vaex
supports automatic serialization and deserialization of vaex dataframe between consecutive tasks using parquet flyteorg/flytekit#1230
It would be good to extend this to HDF5 and arrow for performance and interoperability, when data sets are too large to fit into memory https://vaex.readthedocs.io/en/latest/faq.html#What-is-the-optimal-file-format-to-use-with-vaex
### Goal: What should the final outcome look like, ideally?
Register extra handlers VaexDataFrameToHDF5EncodingHandler
and VaexDataFrameToArrowEncodingHandler
, so users can use Annotated
to update the default format:
@task
def t1(f: vaex.dataframe.DataFrameLocal) -> Annotated[StructuredDataset, HDF5]
@task
def t2(f: vaex.dataframe.DataFrameLocal) -> Annotated[StructuredDataset, Arrow]
### Describe alternatives you've considered
N/A
### Propose: Link/Inline OR Additional context
See discussion thread here flyteorg/flytekit#1230 (comment)
### Are you sure this issue hasn't been raised already?
• Yes
### Have you read the Code of Conduct?
• Yes
flyteorg/flytecuddly-jelly-27016
05/14/2025, 12:13 AM