#1144 [Plugin][Flytekit] Support for TFRecord as loadable schema type
Issue created by
kumare3
Why would this plugin be helpful to the Flyte community
Often times users want to process data using Spark, but data is passed to a Tensorflow training process. Parquet or other columnar structures are highly in-efficient for training. To solve this problem, the TF community has done some work. It would be wonderful, if we could perform this conversion automatically depending on the context.
e.g. If the user accepts a TFRecord (data format) as a spark dataframe then we can convert, similarly if the User writes Spark dataframe, but somehow annotates it as TFRecord then we can auto-convert.
Similarly, if the user reads the SparkDataframe into a process as TFRecords, we can do the conversion
This library provides this trait
https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector
LinkedIn has further updated this library to make it better in some ways
https://github.com/linkedin/spark-tfrecord
Type of Plugin
☑︎ Python/Java interface only plugin
☐ Web Service (e.g. AWS Sagemaker, GCP DataFlow, Qubole etc...)
☐ Kubernetes Operator (e.g. TfOperator, SparkOperator, FlinkK8sOperator, etc...)
☐ Customized Plugin using native kubernetes constructs
☐ Other
Can you help us with the implementation?
☐ Yes
☐ No
flyteorg/flyte