https://flyte.org logo
Title
f

Frank Shen

01/06/2023, 6:51 PM
Could anyone explain how to setup local env to read from S3 using spark?
k

Kevin Su

01/06/2023, 7:08 PM
You have to add AWS Credentials to spark context.
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", accessKeyId)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", secretAccessKey)
f

Frank Shen

01/06/2023, 7:11 PM
@Kevin Su, thanks. I don’t remember installed spark locally. So how could following flyte spark task work (this works but read from S3 doesn’t)?
@task(
    task_config=Spark(
        spark_conf={...
        }
    ),
)
def create_spark_df() -> Annotated[StructuredDataset, columns]:
    sess = flytekit.current_context().spark_session
    return StructuredDataset(
        dataframe=sess.createDataFrame(
            [
                ("Alice", 5),
                ("Bob", 10),
                ("Charlie", 15),
            ],
            ["name", "age"],
        )
    )
What mechanism flyte utilized to run spark locally?
I take it back. I do have spark version 3.3.1 installed.
k

Kevin Su

01/06/2023, 7:27 PM
IIRC, you don’t need to have spark.jar to run pyspark. only need to install pyspark and java. py4j in pyspark convert some code to byte code, and run it in the jvm. if you run pyspark locally, the job will not run on the cluster. all the code is running in the process in the jvm.
f

Frank Shen

01/06/2023, 7:30 PM
@Kevin Su, I saw this in the log and it concur with what you are saying.
/env/lib/python3.8/site-packages/py4j/protocol.py\", line 326, in get_return_value\n    raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet.\n: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found