Could anyone explain how to setup local env to rea...
# ask-the-community
f
Could anyone explain how to setup local env to read from S3 using spark?
k
You have to add AWS Credentials to spark context.
Copy code
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", accessKeyId)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", secretAccessKey)
f
@Kevin Su, thanks. I don’t remember installed spark locally. So how could following flyte spark task work (this works but read from S3 doesn’t)?
Copy code
@task(
    task_config=Spark(
        spark_conf={...
        }
    ),
)
def create_spark_df() -> Annotated[StructuredDataset, columns]:
    sess = flytekit.current_context().spark_session
    return StructuredDataset(
        dataframe=sess.createDataFrame(
            [
                ("Alice", 5),
                ("Bob", 10),
                ("Charlie", 15),
            ],
            ["name", "age"],
        )
    )
What mechanism flyte utilized to run spark locally?
I take it back. I do have spark version 3.3.1 installed.
k
IIRC, you don’t need to have spark.jar to run pyspark. only need to install pyspark and java. py4j in pyspark convert some code to byte code, and run it in the jvm. if you run pyspark locally, the job will not run on the cluster. all the code is running in the process in the jvm.
f
@Kevin Su, I saw this in the log and it concur with what you are saying.
Copy code
/env/lib/python3.8/site-packages/py4j/protocol.py\", line 326, in get_return_value\n    raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet.\n: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
168 Views