acoustic-carpenter-78188
05/22/2023, 6:49 PMStructuredDataset
. This can be very limiting in cases where we want to take advantage of specific features from a given library.
Some common use-cases
• Predicate pushdown when reading hive-style partitions
• Writing partitioned tables
• Appending to a BigQuery table instead of overwriting it
Goal: What should the final outcome look like, ideally?
I would like to take advantage of StructuredDataset's portability without losing important features of the underlying libraries.
Describe alternatives you've considered
The alternative would be using FlyteDirectory
or FlyteFile
.
@task
def t(ff: StructuredDataset):
# we have a pyspark DataFrame
df = StructuredDataset(pyspark.sql.DataFrame).all()
ff = FlyteFile.new_remote_file()
df.write.parquet(ff.path, partitionBy=["country"])
return ff
Propose: Link/Inline OR Additional context
Some examples...
# filter hive partitioned parquet with pandas to avoid loading it all into memory
def t_pandas():
df = sd.open(pd.DataFrame).all(filters=[("year", ">=", 2021)])
# write a hive partitioned table
def t_pandas()
...
return StructuredDataset(dataframe=df, partition_cols=['year', 'month', 'day'])
# append to a table instead of overwrite it
def t_write_bq():
...
return StructuredDataset(df, uri="bq:/", append=True)
# support polars streaming
def t_polars_stream():
...
return StructuredDataset(dataframe=df, streaming=True)
# read with low memory setting
def t_polars_low_mem(sd: StructuredDataset):
df = sd.open(pl.DataFrame).all(low_mem=True)
Are you sure this issue hasn't been raised already?
☑︎ Yes
Have you read the Code of Conduct?
☑︎ Yes
flyteorg/flyte