Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Could anyone explain how to setup local env to read from S3 using spark?
Running this task remotely, it’s run in the image that has flyte/spark setup. When running locally, it doesn’t use that image, therefore missing spark jars.

You have to add AWS Credentials to spark context.

```sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", accessKeyId)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", secretAccessKey)```

<@USU6W5ATA>, thanks.
I don’t remember installed spark locally.
So how could following flyte spark task work (this works but read from S3 doesn’t)?
```@task(
    task_config=Spark(
        spark_conf={...
        }
    ),
)
def create_spark_df() -&gt; Annotated[StructuredDataset, columns]:
    sess = flytekit.current_context().spark_session
    return StructuredDataset(
        dataframe=sess.createDataFrame(
            [
                ("Alice", 5),
                ("Bob", 10),
                ("Charlie", 15),
            ],
            ["name", "age"],
        )
    )```

What mechanism flyte utilized to run spark locally?

I take it back. I do have spark version 3.3.1 installed.

IIRC, you don’t need to have spark.jar to run pyspark. only need to install pyspark and java. py4j in pyspark convert some code to byte code, and run it in the jvm.

if you run pyspark locally, the job will not run on the cluster. all the code is running in the process in the jvm.

<@USU6W5ATA>, I saw this in the log and it concur with what you are saying.
```/env/lib/python3.8/site-packages/py4j/protocol.py\", line 326, in get_return_value\n    raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet.\n: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found```