Could anyone explain how to setup local env to rea...
# flyte-support
s
Could anyone explain how to setup local env to read from S3 using spark? Running this task remotely, it’s run in the image that has flyte/spark setup. When running locally, it doesn’t use that image, therefore missing spark jars.
g
You have to add AWS Credentials to spark context.
Copy code
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", accessKeyId)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", secretAccessKey)
s
@glamorous-carpet-83516, thanks. I don’t remember installed spark locally. So how could following flyte spark task work (this works but read from S3 doesn’t)?
Copy code
@task(
    task_config=Spark(
        spark_conf={...
        }
    ),
)
def create_spark_df() -> Annotated[StructuredDataset, columns]:
    sess = flytekit.current_context().spark_session
    return StructuredDataset(
        dataframe=sess.createDataFrame(
            [
                ("Alice", 5),
                ("Bob", 10),
                ("Charlie", 15),
            ],
            ["name", "age"],
        )
    )
What mechanism flyte utilized to run spark locally?
I take it back. I do have spark version 3.3.1 installed.
g
IIRC, you don’t need to have spark.jar to run pyspark. only need to install pyspark and java. py4j in pyspark convert some code to byte code, and run it in the jvm. if you run pyspark locally, the job will not run on the cluster. all the code is running in the process in the jvm.
s
@glamorous-carpet-83516, I saw this in the log and it concur with what you are saying.
Copy code
/env/lib/python3.8/site-packages/py4j/protocol.py\", line 326, in get_return_value\n    raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet.\n: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
188 Views