Quick question about the dask plugin (this also ap...
# ask-the-community
r
Quick question about the dask plugin (this also applies more broadly to spark as well) - how is serialization handled between tasks that use a pandas vs dask dataframe (or spark vs dask, etc)? For ex, task A processes a dataset with spark and returns a spark df. Task B is a dask task that loads the processed dataset to train a model. My current understanding is that the types need to properly line up in order for the workflow to type check, but the return type of A would be a pyspark DF while the argument in B would be a dask dataframe. How are folks handling this case where we might want to mix spark and dask in pipelines? Does structured dataset figure into this? Any clarification would be appreciated
f
As far as I understand yes, structured dataset should solve exactly this issue 🤔 Have never mixed dask and spark in a flyte pipeline myself though.
r
So the idea is all tasks use a StructuredDataset as opposed to the respective dataframe types from the various libraries and the type engine will handle unmarshalling the DF in the correct type? Edit: ah, looks like there’s StructuredDataset.open
k
Yes, or if you type the input it should work automatically
But I don’t think we have a dask dataframe plugin / transformer? Can you add one
r
sure shouldn't be hard, and I'm going to need to as part of some work we're doing on distributed training
The issue w/ typing inputs differently (ie. passing a task that has return annotation pd.DataFrame to a task that expects a pyspark or dask DF) is that breaks our mypy validation
(We'd done some work around using paramspecs to ensure we can type check workflows at CI-time that I shared w/ @Eduardo Apolinario (eapolinario))
So it appears that structured datasets are the path forward here to preserve type safety and have pandas/dask/spark interop