acoustic-carpenter-78188
12/24/2022, 3:53 PMStructuredDataset
and then trying to load them into a polars dataframe and a hugging face dataset.
It resulted in the following error for both plugins No such file or directory: /var/folders/wq/3hjh3ms916b6dj56zx0f_x000000gq/T/flyte-69d2tww2/sandbox/local_flytekit/95bac8efeb64a8d10d34c73b66df7051/00000
. However, it did work for pandas.
It seems like polars and huggingface add in 00000
to the path in the transformers and spark does not.
• polars: https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-polars/flytekitplugins/polars/sd_transformers.py#L43
• spark: https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-spark/flytekitplugins/spark/sd_transformers.py#L29
Expected behavior
I would expect to be able to use a StructuredDataset
from spark with dataframe libraries from all plugins.
Additional context to reproduce
from flytekit import task, StructuredDataset
from flytekitplugins.spark.task import Spark
from datasets import Dataset
import polars as pl
import datasets
import pandas as pd
@task(
task_config=Spark()
)
def spark_task(path: str) -> StructuredDataset:
sess = flytekit.current_context().spark_session
df = sess.read.parquet(path)
return StructuredDataset(dataframe=df)
df = spark_task(path="./ratings_100k.parquet")
try:
df.open(pl.DataFrame).all().head()
except Exception as e:
print(e)
try:
df.open(datasets.Dataset).all().head()
except Exception as e:
print(e)
df.open(pd.DataFrame).all().head()
Screenshots
Screen Shot 2022-12-24 at 10 54 40 AM▾