<#3708 [Core feature] StructuredDataset read/write...
# flyte-github
a
#3708 [Core feature] StructuredDataset read/write kwargs Issue created by peridotml Motivation: Why do you think this is important? It seems like we cannot configure read / write settings for dataframe types individually and as
StructuredDataset
. This can be very limiting in cases where we want to take advantage of specific features from a given library. Some common use-cases • Predicate pushdown when reading hive-style partitions • Writing partitioned tables • Appending to a BigQuery table instead of overwriting it Goal: What should the final outcome look like, ideally? I would like to take advantage of StructuredDataset's portability without losing important features of the underlying libraries. Describe alternatives you've considered The alternative would be using
FlyteDirectory
or
FlyteFile
.
Copy code
@task 
def t(ff: StructuredDataset):
    # we have a pyspark DataFrame

    df = StructuredDataset(pyspark.sql.DataFrame).all()
    ff = FlyteFile.new_remote_file()
    df.write.parquet(ff.path, partitionBy=["country"])
   return ff
Propose: Link/Inline OR Additional context Some examples...
Copy code
# filter hive partitioned parquet with pandas to avoid loading it all into memory
def t_pandas():
   df = sd.open(pd.DataFrame).all(filters=[("year", ">=",  2021)])

# write a hive partitioned table
def t_pandas()
    ...
    return StructuredDataset(dataframe=df, partition_cols=['year', 'month', 'day'])

# append to a table instead of overwrite it
def t_write_bq():
    ...
    return StructuredDataset(df, uri="bq:/", append=True)

# support polars streaming
def t_polars_stream():
    ...
    return StructuredDataset(dataframe=df, streaming=True)

# read with low memory setting
def t_polars_low_mem(sd: StructuredDataset):
    df = sd.open(pl.DataFrame).all(low_mem=True)
Are you sure this issue hasn't been raised already? ☑︎ Yes Have you read the Code of Conduct? ☑︎ Yes flyteorg/flyte