Also <@U01DYLVUNJE>, regarding <https://github.com...
# hacktoberfest-2022
s
Also @Niels Bantilan, regarding https://github.com/flyteorg/flytekit/pull/1240, are there any use cases you could think of?
n
not sure I understand the question… I assumed
.tfrecord
is a TF-specific file format that we wanted to support out-of-the-box?
s
This PR adds a
tf.Train.Example
type and it internally stores the data in a tfrecord file for serialization. I’ve been wondering why someone might use
tf.Train.Example
as a data type.
r
@Samhita Alla This is the pattern tf docs https://www.tensorflow.org/tutorials/load_data/tfrecord suggest for serialising to
.tfrecord
unless i misunderstood the requirements of the original issue. The main use case(s) seems to be for image data and training on TPUs (see next comment in thread). The alternative i guess is to provide support for a
tf.train.Features
type which then gets converted to
Example
type and serialised and stored as
.tfrecord
- see the images i/o example at the end of https://www.tensorflow.org/tutorials/load_data/tfrecord
similar example(s) in keras documentation using
tf.Example
<->
tf.record
and quoting below from the docs regarding the use case “An important use case of the TFRecord data format is training on TPUs. First, TPUs are fast enough to benefit from optimized I/O operations. In addition, TPUs require data to be stored remotely (e.g. on Google Cloud Storage) and using the TFRecord format makes it easier to load the data without batch-downloading.” https://keras.io/examples/keras_recipes/creating_tfrecords/
s
@Ryan Nazareth, makes sense. I have a couple of suggestions: • There’s no
tf.data
<-> tfrecord conversion available. We may need to add support for
tf.data.Dataset
as well. • IMO, this Flyte type should only perform conversion from `tf.Train.Example`/`tf.data` to tfrecord file (like Flyte ONNX types) cause tfrecord format per se can be used while training the model, but not the vice versa. We can also have a task to read the tfrecord file but it can include a lot of customizations, e.g., https://www.tensorflow.org/tutorials/load_data/tfrecord#read_the_tfrecord_file & https://keras.io/examples/keras_recipes/creating_tfrecords/#train-a-simple-model-using-the-generated-tfrecords. So I’m not sure if we want a task to read the tfrecord file. cc: @Niels Bantilan @Ketan (kumare3)
r
@Samhita Alla sure i can make the change and add support for
tf.data
. Also, regarding your second point, Im assuming in that case i should create a new flytefile type
TfRecordFile=FlyteFile[typing.TypeVar('tfrecord')]
which would then be returned as `TfRecordFile(path=lv.scalar.blob.uri)`in
def to_python_value(self, ...)
method of TypeTransformer, similar to the pytorch ONXX implementation ? Let me know once @Niels Bantilan and @Ketan (kumare3) have confirmed.
n
So after discussing with @Eduardo Apolinario (eapolinario) and thinking about this a little more, I think we still need to discuss the core question of: how do people actually use
tf.train.Example
, `TFRecord`s and
tf.data.Dataset
in real life, and how do we create reasonable i/o and serialization/deserialization boundaries between these objects, again, in the context of how people actual use them, and in a way that provides value to users? I created a draft proposal on the PR: https://github.com/flyteorg/flytekit/pull/1240#issuecomment-1292623521
@Samhita Alla @Ryan Nazareth @Ketan (kumare3) @Eduardo Apolinario (eapolinario) @Yee would appreciate your thoughts/feedback
s
Thanks, @Niels Bantilan! Added a suggestion.
d
Related to this discussion, I've been thinking about how to add support for
tf.data.Dataset
to Flyte (and whether it's even a good idea) and created this feature request issue: https://github.com/flyteorg/flyte/issues/3038
156 Views