Hi! Is there any way to avoid the extra read and w...
# flyte-support
g
Hi! Is there any way to avoid the extra read and write if wanting to pass a structureddataset to a task that already exists on S3? Ie can I create the structureddataset without having a task that effectively does a copy?
f
Yes - use structereddataset input and simply return it as output without open
This should just patch the pointers out
g
Ok nice, I’ll give that a try!
b
I got this to work for parquet datasets written by spark, but not for json and csv datasets from spark with
Copy code
@task
def sd_creator(): -> StructuredDataset
   return StructuredDataset(uri="<s3a://my_bucket/path/to/json/>")

@task
def sd_worker(sd_json):
   df_json.open(DataFrame).all()...
as it still attempts to read with parquet (even if specifying file_format=“json” to StructuredDataset). Do you have a working example for this? I did get it to work with an explicit read with spark and return StructuredDataset without uri, but it adds overhead.