Quick question about the dask plugin this also applies more Flyte #flyte-support

Quick question about the dask plugin (this also ap...

elegant-australia-91422

08/28/2023, 11:07 AM

Quick question about the dask plugin (this also applies more broadly to spark as well) - how is serialization handled between tasks that use a pandas vs dask dataframe (or spark vs dask, etc)? For ex, task A processes a dataset with spark and returns a spark df. Task B is a dask task that loads the processed dataset to train a model. My current understanding is that the types need to properly line up in order for the workflow to type check, but the return type of A would be a pyspark DF while the argument in B would be a dask dataframe. How are folks handling this case where we might want to mix spark and dask in pipelines? Does structured dataset figure into this? Any clarification would be appreciated

cool-lifeguard-49380

08/28/2023, 11:08 AM

As far as I understand yes, structured dataset should solve exactly this issue 🤔 Have never mixed dask and spark in a flyte pipeline myself though.

elegant-australia-91422

08/28/2023, 11:10 AM

So the idea is all tasks use a StructuredDataset as opposed to the respective dataframe types from the various libraries and the type engine will handle unmarshalling the DF in the correct type? Edit: ah, looks like there’s StructuredDataset.open

freezing-airport-6809

08/28/2023, 2:27 PM

Yes, or if you type the input it should work automatically

freezing-airport-6809

08/28/2023, 2:28 PM

But I don’t think we have a dask dataframe plugin / transformer? Can you add one

elegant-australia-91422

08/28/2023, 2:28 PM

sure shouldn't be hard, and I'm going to need to as part of some work we're doing on distributed training

🔥 1

elegant-australia-91422

08/28/2023, 2:29 PM

The issue w/ typing inputs differently (ie. passing a task that has return annotation pd.DataFrame to a task that expects a pyspark or dask DF) is that breaks our mypy validation

elegant-australia-91422

08/28/2023, 2:29 PM

(We'd done some work around using paramspecs to ensure we can type check workflows at CI-time that I shared w/ @high-accountant-32689)

elegant-australia-91422

08/28/2023, 2:30 PM

So it appears that structured datasets are the path forward here to preserve type safety and have pandas/dask/spark interop

12 Views

Open in Slack

Previous Next