How to use s3 with flyte spark Spark with flyte itself works Flyte #flyte-support

How to use s3 with flyte spark? Spark with flyte i...

shy-room-84005

02/09/2023, 9:07 AM

How to use s3 with flyte spark? Spark with flyte itself works fine with the spark serviceaccount (spark-pi example from the flyte docs) and I tested the spark serviceaccount using a spark task that simply uses boto3. But I can't seem to make something like spark.read.parquet("s3://<bucket>/<path to file>) to work. I tested it with pyspark using the workflow dockerfile and it works if I do aws configure first.

Copy code

An error occurred while calling o125.parquet.
: java.nio.file.AccessDeniedException: s3://<bucket>/<path>: getFileStatus on s3://<bucket>/<path>: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden;

My spark config in values.yaml is:

Copy code

spark-config-default:
            # We override the default credentials chain provider for Hadoop so that
            # it can use the serviceAccount based IAM role or ec2 metadata based.
            # This is more in line with how AWS works
            - spark.hadoop.fs.s3a.aws.credentials.provider: "com.amazonaws.auth.DefaultAWSCredentialsProviderChain"
            - spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2"
            - spark.kubernetes.allocation.batch.size: "50"
            - spark.hadoop.fs.s3a.acl.default: "BucketOwnerFullControl"
            - spark.hadoop.fs.s3n.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
            - spark.hadoop.fs.AbstractFileSystem.s3n.impl: "org.apache.hadoop.fs.s3a.S3A"
            - spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
            - spark.hadoop.fs.AbstractFileSystem.s3.impl: "org.apache.hadoop.fs.s3a.S3A"
            - spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
            - spark.hadoop.fs.AbstractFileSystem.s3a.impl: "org.apache.hadoop.fs.s3a.S3A"
            - spark.hadoop.fs.s3a.multipart.threshold: "536870912"
            - spark.blacklist.enabled: "true"
            - spark.blacklist.timeout: "5m"
            - spark.task.maxfailures: "8"

I can specify the AWS credentials manually and use "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" in the spark config, but I was wondering if you can use IRSA to avoid using credentials.

average-finland-92144

02/09/2023, 6:28 PM

@shy-room-84005 not sure about the Spark integration. But in general you could use IRSA, just take into account that every Flyte project will be a separate namespace, so you'll have to add each to the IAM policy

glamorous-carpet-83516

02/09/2023, 6:59 PM

have you tried structured dataset?

Copy code

def t1():
    sd = StructuredDataset(uri="<s3://bucket/key>")
    df = sd.open(pyspark.DataFrame).all()

197 Views

Open in Slack

Previous Next