How to use s3 with flyte spark? Spark with flyte i...
# ask-the-community
a
How to use s3 with flyte spark? Spark with flyte itself works fine with the spark serviceaccount (spark-pi example from the flyte docs) and I tested the spark serviceaccount using a spark task that simply uses boto3. But I can't seem to make something like spark.read.parquet("s3://<bucket>/<path to file>) to work. I tested it with pyspark using the workflow dockerfile and it works if I do aws configure first.
Copy code
An error occurred while calling o125.parquet.
: java.nio.file.AccessDeniedException: s3://<bucket>/<path>: getFileStatus on s3://<bucket>/<path>: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden;
My spark config in values.yaml is:
Copy code
spark-config-default:
            # We override the default credentials chain provider for Hadoop so that
            # it can use the serviceAccount based IAM role or ec2 metadata based.
            # This is more in line with how AWS works
            - spark.hadoop.fs.s3a.aws.credentials.provider: "com.amazonaws.auth.DefaultAWSCredentialsProviderChain"
            - spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2"
            - spark.kubernetes.allocation.batch.size: "50"
            - spark.hadoop.fs.s3a.acl.default: "BucketOwnerFullControl"
            - spark.hadoop.fs.s3n.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
            - spark.hadoop.fs.AbstractFileSystem.s3n.impl: "org.apache.hadoop.fs.s3a.S3A"
            - spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
            - spark.hadoop.fs.AbstractFileSystem.s3.impl: "org.apache.hadoop.fs.s3a.S3A"
            - spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
            - spark.hadoop.fs.AbstractFileSystem.s3a.impl: "org.apache.hadoop.fs.s3a.S3A"
            - spark.hadoop.fs.s3a.multipart.threshold: "536870912"
            - spark.blacklist.enabled: "true"
            - spark.blacklist.timeout: "5m"
            - spark.task.maxfailures: "8"
I can specify the AWS credentials manually and use "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" in the spark config, but I was wondering if you can use IRSA to avoid using credentials.
d
@Aleksander Lempinen not sure about the Spark integration. But in general you could use IRSA, just take into account that every Flyte project will be a separate namespace, so you'll have to add each to the IAM policy
k
have you tried structured dataset?
Copy code
def t1():
    sd = StructuredDataset(uri="<s3://bucket/key>")
    df = sd.open(pyspark.DataFrame).all()
156 Views