I am looking for a way to pass Numpy Arrays (ndarray) and PyTorch/Tensorflow Tensors as Flyte Task input/output. I haven’t come across any example yet. I’m aware of the native support for Dataframes. It seems inefficient to convert ndarray/Tensors back and forth using Dataframes.
How are folks handling this?
a
acceptable-policeman-57188
05/20/2022, 7:48 PM
cc @broad-monitor-993@high-accountant-32689
b
broad-monitor-993
05/20/2022, 7:57 PM
unfortunately the flytekit TypeEngine doesn’t have native support for numpy arrays or pytorch/tensorflow tensors… would you mind opening up an issue for that @straight-laptop-71325?
Currently there are 3 paths to doing this:
1. passing dataframes around (as you’ve suggested)
2. passing
List[int]
or
List[float]
and reconstituting your arrays/tensors at the beginning of the next task
3. using a
np.ndarray
or
torch.Tensor
annotation purely for human-readability. Under the hood this will pickle your array/tensor and unpickle it on the other side.
(3) is convenient, but you run the risk of deserialization issues if you happen to use different versions of python/numpy/pytorch/tensorflow across your tasks that are not cross-compatible. (2) is really for smaller data use cases since these are stored as FlyteIDL literals. (1) is nice because flyte understands this and stores dataframes as parquet files, which is a more efficient/reliable storage format than pickle.
👍 1
s
straight-laptop-71325
05/20/2022, 8:11 PM
Thanks @broad-monitor-993. This is helpful!
I’ll open an issue for this.