cuddly-jelly-27016
05/14/2025, 12:12 AMTypeTransformers
today in flytekit, effectively result in type-erasure at runtime. The higher level types are converted to underlying flyte types and on retrieval the information of the source type is lost. This works in theory as the receiving sdk, has the right types defined. It also helps in easy type-casting types into all of its derivative types. This technique has been successfully deployed to a language like Java and the JVM.
Examples of type derivatives
Convert from Spark Data frame -> Flyte.Schema -> Pandas data frame.
But, it is desirable to keep the source type available so that we can recover the type, even without explicitly requesting for this type.
Example:
remote.get().outputs.x -> can be correctly casted if available
Moreover, one problem with type erasure is loss of static type checking across languages or different tasks.
To overcome this problem the issue proposes we introduce a new type called the LogicalType, which keeps information about the source and the transport type associated.
Goal: What should the final outcome look like, ideally?
Users can specify new types, and we can reverse engineer those types from the stored definition. Helps in debugging, static type assertions, optimizations and helps extensibility
Describe alternatives you've considered
What exists today - type erasure!
[Optional] Propose: Link/Inline OR Additional context
-- from @kanterov
Logical type is a type alias for an existing LiteralType, and values for logical types are represented with existing Literal. Logical types can correspond to built-in or user-defined types in SDK. A logical type is defined as (this approach is inspired by Apache Beam proto):
message LogicalType {
// Required. Unique resource name for LogicalType.
// There is a list of well-known logical types supported by SDKs,
// and users can add their own
string urn = 1;
// Required. Existing LiteralType used to represent values of LogicalType
LiteralType representation = 2;
// Optional. Additional argument for logical type. May be used to serialize additional information
Literal argument = 3;
// Optional. Type of argument.
LiteralType argument_type = 4;
}
Example of urn
pandas.DataFrame, pyspark.DataFrame
Semantics
Type t1 is supertype of logical type t2, iff:
t1 is strictly equal to t2
t1 is supertype of t3, and t3 is supertype of t2
t1 is supertype of t2.representation
This allows us to read unknown logical types using their representation. E.g. if task_1 produces output: LogicalType(representation=INTEGER) and task_2 has input of INTEGER, it’s possible to bind task_2.input to task_1.output. However, it isn’t possible to do the opposite: use any INTEGER as LogicalType(representation=INTEGER).
SDKs have a list of well-known logical types that are mapped to built-in or custom types. flyteconsole or flytectl can have a special behaviour for well-known logical types.
flytepropeller shouldn’t introduce a special behaviour for well-known logical types when doing type-checking. This limitation of logical types allows the introduction of new logical types without all components of Flyte being aware of it. When there is an unknown logical type, it should be safe for implementation to fallback to it’s representation.
Examples of well-known logical types
• INT32 (represented as INTEGER)
• FIXEDBYTES(N) (represented as BINARY): argument type is INTEGER, representing length of fixed byte array
• LOCAL DATE (represented as DATETIME): date without timezone
• LOCAL DATETIME (represented as DATETIME): datetime without timezone
• DECIMAL(P, D) (represented as BYTES): argument_type is {p: INTEGER, d: INTEGER}, where p is precision, and d is the number of digits after decimal points)
Example: introducing INT32
flyteidl has an INTEGER type that is 64-bit integer. It’s natural for SDK users to use 32 bit integers unless they need 64 bits. In Java, there are two separate types: Integer and Long representing 32 and 64 bit integers. However, it creates a problem because a 32 bit integer can overflow when trying to fit 64 bits. Introducing logical type for INT32 allows tasks to read INT32, only if input is bound to a literal that is known to be INT32.
flyteorg/flytecuddly-jelly-27016
05/14/2025, 12:12 AM