Quick question about the dask plugin (this also applies more broadly to spark as well) - how is serialization handled between tasks that use a pandas vs dask dataframe (or spark vs dask, etc)?
For ex, task A processes a dataset with spark and returns a spark df. Task B is a dask task that loads the processed dataset to train a model. My current understanding is that the types need to properly line up in order for the workflow to type check, but the return type of A would be a pyspark DF while the argument in B would be a dask dataframe.
How are folks handling this case where we might want to mix spark and dask in pipelines? Does structured dataset figure into this? Any clarification would be appreciated
08/28/2023, 11:08 AM
As far as I understand yes, structured dataset should solve exactly this issue 🤔 Have never mixed dask and spark in a flyte pipeline myself though.
08/28/2023, 11:10 AM
So the idea is all tasks use a StructuredDataset as opposed to the respective dataframe types from the various libraries and the type engine will handle unmarshalling the DF in the correct type?
Edit: ah, looks like there’s StructuredDataset.open
08/28/2023, 2:27 PM
Yes, or if you type the input it should work automatically
But I don’t think we have a dask dataframe plugin / transformer? Can you add one
08/28/2023, 2:28 PM
sure shouldn't be hard, and I'm going to need to as part of some work we're doing on distributed training
The issue w/ typing inputs differently (ie. passing a task that has return annotation pd.DataFrame to a task that expects a pyspark or dask DF) is that breaks our mypy validation
(We'd done some work around using paramspecs to ensure we can type check workflows at CI-time that I shared w/ @Eduardo Apolinario (eapolinario))
So it appears that structured datasets are the path forward here to preserve type safety and have pandas/dask/spark interop