Happy New Year! I have large machine learning feat...
# ask-the-community
f
Happy New Year! I have large machine learning feature dataset stored as many multiple parquet files in an AWS S3 folder (key). I have a flyte task to read the data in and return it as a pandas DF. Due to the large data size, I prefer to use flyte spark task to read the data. sample code:
Copy code
@task(
    container_image="<http://xyz.dkr.ecr.us-east-1.amazonaws.com/flyte-pyspark:latest|xyz.dkr.ecr.us-east-1.amazonaws.com/flyte-pyspark:latest>",
    task_config=Spark(
        spark_conf={...
        }
    ),
)
def read_spark_df() -> pandas.DataFrame:
    sess = flytekit.current_context().spark_session
    spark_df = sess.read.parquet("<s3a://bucket/key.parquet>").toPandas()
    df = pandas.DataFrame(spark_df)
    return df
n
Hi Frank, this should work. Ensure that you have the Spark plugin enabled on your Flyte backend: https://docs.flyte.org/en/latest/deployment/plugin_setup/k8s/index.html#deployment-plugin-setup-k8s
y
i’m a little confused here though… when you return the dataframe, flytekit will again serialize it back to s3 as one parquet file.
I/O is persisted between task runs
it will not pass in memory from one task to the next.
f
@Yee, I got your point now. Thanks
n
yeah if it’s memory you’re concerned about then you can use a plain
@task
and you can request for more resources: https://docs.flyte.org/projects/cookbook/en/latest/auto/deployment/customizing_resources.html#sphx-glr-auto-deployment-customizing-resources-py
f
@Niels Bantilan, that’s a nice solution that bypass spark setup and make things simpler. I will try that. The concern is pandas read parquet files in a single thread if I understand it correctly.
n
Yep, using modin/dask for reading in the parquet file in a multithreading/multiprocessing manner would work too
103 Views