Hi I have a question regarding the concept of `Structured Da Flyte #flyte-support

Hi, I have a question regarding the concept of `St...

average-winter-43654

08/10/2023, 12:48 PM

Hi, I have a question regarding the concept of

Structured Dataset

. • It seems that Flyte support two default ways to pass data between tasks (i.e. between k8s pods),

Parquet

and

Pickle

. However, parquet only works with pandas.DataFrame, which is automatically treated as a default dataset type called

schema

by flyte. All of the other data types are supported by pickle. • Flyte introduced a concept called

Structured Dataset

to support custom data type for using Parquet. This doc introduced a way to include

numpy array

as a

Structured Dataset

by creating classes like

Encoder

Decoder

, and

Renderer

. • Is my understanding correct? • And what is the real benefit of using Parquet? I know it may have a better performance and a more efficient compress rate than using Pickle. How about static type checking? Will flyte do type checks for

schema

and

Structrured Dataset

between tasks that are connected within one workflow during compile time? Thanks.

tall-lock-23197

08/11/2023, 7:31 AM

Structured Dataset is a Flyte native type that lets you specify any 2D type, like a 2D numpy array, pandas dataframe, and all kinds of dataframes. Parquet is the serializer to store pandas dataframe. To understand this better, consider an example: if task A returns a pandas dataframe, flyte converts it into a parquet file, stores it in a blog storage (like s3, gcs); if task B accepts task A output as an input, it converts the parquet data to a pandas dataframe. You needn't worry about serialization and deserialization because it is automatically handled by flyte.

tall-lock-23197

08/11/2023, 7:32 AM

Will flyte do type checks ...

What kind of type checks are you referring to?

14 Views

Open in Slack

Previous Next