Rupsha Chaudhuri
12/14/2022, 6:09 PMDan Rammer (hamersaw)
12/14/2022, 6:45 PMNiels Bantilan
12/14/2022, 6:54 PMRupsha Chaudhuri
12/14/2022, 7:01 PMYee
Yee
Rupsha Chaudhuri
12/14/2022, 7:22 PMYee
Yee
Rupsha Chaudhuri
12/14/2022, 7:26 PMRupsha Chaudhuri
12/14/2022, 7:26 PMYee
Rupsha Chaudhuri
12/14/2022, 7:27 PMNiels Bantilan
12/14/2022, 10:05 PMi think if the data is easily chunk-able and is slightly cpu intensive, i would opt for the map task approachagreed! basically you’ll want the task that produces the data to output 2 things: (i) the
StructuredDataset
itself with the chunked parquet file and (ii) a list of filenames for each chunk.
Then, you’ll want to map over a dataclass that contains a reference to the StructuredDataset
in addition to the filename of the chunk you want to process for a particular maptaskRupsha Chaudhuri
12/14/2022, 10:08 PMNiels Bantilan
12/14/2022, 10:14 PMStructuredDataset
as input in a map task, is there a way for me to only download one of the chunks onto the map task pod?Niels Bantilan
12/14/2022, 10:21 PMcurrent_context()
to use the file_access
API to download a specific chunkRupsha Chaudhuri
12/14/2022, 10:22 PMNiels Bantilan
12/14/2022, 10:23 PMYee
Yee
Yee