Hi, I have a question regarding the concept of `St...
# ask-the-community
y
Hi, I have a question regarding the concept of
Structured Dataset
. • It seems that Flyte support two default ways to pass data between tasks (i.e. between k8s pods),
Parquet
and
Pickle
. However, parquet only works with pandas.DataFrame, which is automatically treated as a default dataset type called
schema
by flyte. All of the other data types are supported by pickle. • Flyte introduced a concept called
Structured Dataset
to support custom data type for using Parquet. This doc introduced a way to include
numpy array
as a
Structured Dataset
by creating classes like
Encoder
,
Decoder
, and
Renderer
. • Is my understanding correct? • And what is the real benefit of using Parquet? I know it may have a better performance and a more efficient compress rate than using Pickle. How about static type checking? Will flyte do type checks for
schema
and
Structrured Dataset
between tasks that are connected within one workflow during compile time? Thanks.
s
Structured Dataset is a Flyte native type that lets you specify any 2D type, like a 2D numpy array, pandas dataframe, and all kinds of dataframes. Parquet is the serializer to store pandas dataframe. To understand this better, consider an example: if task A returns a pandas dataframe, flyte converts it into a parquet file, stores it in a blog storage (like s3, gcs); if task B accepts task A output as an input, it converts the parquet data to a pandas dataframe. You needn't worry about serialization and deserialization because it is automatically handled by flyte.
Will flyte do type checks ...
What kind of type checks are you referring to?