Happy New Year I have large machine learning feature dataset Flyte #flyte-support

Happy New Year! I have large machine learning feat...

salmon-refrigerator-32115

01/05/2023, 7:50 PM

Happy New Year! I have large machine learning feature dataset stored as many multiple parquet files in an AWS S3 folder (key). I have a flyte task to read the data in and return it as a pandas DF. Due to the large data size, I prefer to use flyte spark task to read the data. sample code:

Copy code

@task(
    container_image="<http://xyz.dkr.ecr.us-east-1.amazonaws.com/flyte-pyspark:latest|xyz.dkr.ecr.us-east-1.amazonaws.com/flyte-pyspark:latest>",
    task_config=Spark(
        spark_conf={...
        }
    ),
)
def read_spark_df() -> pandas.DataFrame:
    sess = flytekit.current_context().spark_session
    spark_df = sess.read.parquet("<s3a://bucket/key.parquet>").toPandas()
    df = pandas.DataFrame(spark_df)
    return df

broad-monitor-993

01/05/2023, 8:02 PM

Hi Frank, this should work. Ensure that you have the Spark plugin enabled on your Flyte backend: https://docs.flyte.org/en/latest/deployment/plugin_setup/k8s/index.html#deployment-plugin-setup-k8s

thankful-minister-83577

01/05/2023, 8:03 PM

i’m a little confused here though… when you return the dataframe, flytekit will again serialize it back to s3 as one parquet file.

thankful-minister-83577

01/05/2023, 8:04 PM

I/O is persisted between task runs

thankful-minister-83577

01/05/2023, 8:05 PM

it will not pass in memory from one task to the next.

salmon-refrigerator-32115

01/05/2023, 8:06 PM

@thankful-minister-83577, I got your point now. Thanks

broad-monitor-993

01/05/2023, 8:24 PM

yeah if it’s memory you’re concerned about then you can use a plain

@task

and you can request for more resources, load the parquet file with pandas directly, and do whatever data processing you need in the same task itself.

salmon-refrigerator-32115

01/05/2023, 9:18 PM

@broad-monitor-993, that’s a nice solution that bypass spark setup and make things simpler. I will try that. The concern is pandas read parquet files in a single thread if I understand it correctly. What about read with modin/ray version of pandas?

broad-monitor-993

01/05/2023, 10:52 PM

Yep, using modin/dask for reading in the parquet file in a multithreading/multiprocessing manner would work too

158 Views

Open in Slack

Previous Next