< thankful minister 83577> < glamorous carpet 83516>

<@UNR3C6Y4T>, <@USU6W5ATA>, <@UNZB4NW3S>, I made i...

salmon-refrigerator-32115

01/07/2023, 12:43 AM

@thankful-minister-83577, @glamorous-carpet-83516, @freezing-airport-6809, I made it working in pyspark by providing the correct hadoop-aws jar:

Copy code

conf = SparkConf()
    conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.2')

However, it failed in flyte with error:

Copy code

23/01/06 16:42:33 ERROR SparkContext: Error initializing SparkContext.
java.io.IOException: Failed to connect to <http://nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808|nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808>

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.IOException: Failed to connect to <http://nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808|nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808>

freezing-airport-6809

01/07/2023, 12:46 AM

interesting

freezing-airport-6809

01/07/2023, 12:46 AM

did you set this - conf.set(‘spark.jars.packages’, ‘org.apache.hadoophadoop aws3.3.2’) in sparkConf?

salmon-refrigerator-32115

01/07/2023, 12:47 AM

yes

salmon-refrigerator-32115

01/07/2023, 12:47 AM

now pyspark code can read from S3, but running same code and config in flyte task, it failed.

thankful-minister-83577

01/07/2023, 12:59 AM

this is a local run?

thankful-minister-83577

01/07/2023, 12:59 AM

why is it trying to connect to a live server? what’s on that dns?

freezing-airport-6809

01/07/2023, 12:59 AM

metadata server?

salmon-refrigerator-32115

01/07/2023, 1:01 AM

local run, yes. Why it connects to wmad, I have no idea. in both cases, the same pyspark is invoked in the venv the two .py files are in the same dir.

salmon-refrigerator-32115

01/07/2023, 1:03 AM

pyspark is creating a SparkContext() directly:

Copy code

conf = SparkConf()

flyte manages spark context differently.

Copy code

flytekit.current_context().spark_session

salmon-refrigerator-32115

01/07/2023, 1:03 AM

That’s the difference.

freezing-airport-6809

01/07/2023, 1:12 AM

interesting session vs context?

salmon-refrigerator-32115

01/07/2023, 1:21 AM

Copy code

flytekit.current_context()

thankful-minister-83577

01/07/2023, 1:29 AM

but that also calls the same thing: https://github.com/flyteorg/flytekit/blob/55b6602a7d4e3491d5175c6d9039369c2880fc98/plugins/flytekit-spark/flytekitplugins/spark/task.py#L136

thankful-minister-83577

01/07/2023, 1:30 AM

the additional post processing happens underneath that, and then the

user_params.builder().add_attr

is what makes it available in the object returned by current_context()

tall-lock-23197

01/09/2023, 6:58 AM

@salmon-refrigerator-32115, can you run the following command and try again?

Copy code

export SPARK_LOCAL_IP="127.0.0.1"

👍 1

salmon-refrigerator-32115

01/09/2023, 6:24 PM

@tall-lock-23197, yay, your advice works! Thanks a lot!

🎉 1

salmon-refrigerator-32115

01/09/2023, 6:27 PM

@tall-lock-23197, can I add SPARK_LOCAL_IP to .flyte/config.yaml?

Copy code

admin:
  endpoint: dns:///flyte.dev.dap.warnermedia.com
SPARK_LOCAL_IP:
   127.0.0.1

salmon-refrigerator-32115

01/10/2023, 12:13 AM

Also can I move the extra spark_conf properties needed for s3 from task decorator to .flyte/config.yaml, or OS env variables, or spark config file? Right now they are here. I don’t want to do this for every task.

Copy code

@task(
    task_config=Spark(
        spark_conf={
...
            # The following is needed only when running spark task in dev's local PC. Also need to do this locally: export SPARK_LOCAL_IP="127.0.0.1"
            "spark.jars.packages": "org.apache.hadoop:hadoop-aws:3.3.2",
            "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider",
            "spark.hadoop.fs.s3a.access.key": "",
            "spark.hadoop.fs.s3a.secret.key": "",
            "spark.hadoop.fs.s3a.session.token": "",
        },

glamorous-carpet-83516

01/10/2023, 12:46 AM

you can define default spark config in propeller config map. https://docs.flyte.org/projects/cookbook/en/latest/auto/integrations/kubernetes/k8s_spark/index.html#step-1-deploy-spark-plugin-in-the-flyte-backend

tall-lock-23197

01/10/2023, 5:24 AM

Adding to Kevin's suggestion: Take at look at step 4 - spark operator on https://docs.flyte.org/en/latest/deployment/plugin_setup/k8s/index.html#deployment-plugin-setup-k8s guide. You can edit the spark config in the concerning config file.

tall-lock-23197

01/10/2023, 5:24 AM

can I add SPARK_LOCAL_IP to .flyte/config.yaml?

I don't think so. You can, however, add it to the bash or zsh profile since it's an env variable.

salmon-refrigerator-32115

01/10/2023, 6:23 PM

@glamorous-carpet-83516, thank you. Do I need to setup a local flyte sandbox cluster first in order to add the default spark config? I need to run the spark task purely local: pyflyte run (without --remote ).

glamorous-carpet-83516

01/10/2023, 6:26 PM

yes, that default config is used for running spark on k8s.

salmon-refrigerator-32115

01/10/2023, 6:26 PM

Thanks

salmon-refrigerator-32115

01/10/2023, 6:27 PM

@tall-lock-23197, thank you.

glamorous-carpet-83516

01/10/2023, 6:28 PM

you can create a file

spark-defaults.conf

, and add it to env. pyspark will use default config in it. https://stackoverflow.com/a/71214326/9574775

salmon-refrigerator-32115

01/10/2023, 6:34 PM

@glamorous-carpet-83516, RE

Copy code

The spark-defaults.conf file should be located in:

$SPARK_HOME/conf

salmon-refrigerator-32115

01/10/2023, 6:35 PM

Do you think the env var SPARK_HOME should be set to my venv’s pyspark install location, i.e. ../env/lib/python3.8/site-packages/pyspark ?

glamorous-carpet-83516

01/10/2023, 6:40 PM

it should point to your Apache Spark distribution. if don’t have download it, I think it’s fine to create an arbitrary directory.

salmon-refrigerator-32115

01/10/2023, 6:46 PM

I do have Spark installed. However, when I run flyte spark task via pyspark, it didn’t use the local spark system. I am affraid if setting SPARK_HOME there will unexpectedly impact the current pyspark behavior. Do you know?

salmon-refrigerator-32115

01/10/2023, 6:46 PM

@glamorous-carpet-83516

glamorous-carpet-83516

01/10/2023, 6:49 PM

I’m not sure. could you try to run it? and see if there is a unexpected behavior

salmon-refrigerator-32115

01/10/2023, 6:49 PM

Will do.

salmon-refrigerator-32115

01/10/2023, 6:56 PM

@glamorous-carpet-83516, I knew this post but didn’t try coz I am full of doubt, until you encouraged me. I just added conf/spark-defaults.conf to the pyspark install and it worked like a charm. Thanks a lot!

salmon-refrigerator-32115

01/10/2023, 6:56 PM

I didn’t set SPARK_HOME.

freezing-airport-6809

01/11/2023, 2:11 PM

@salmon-refrigerator-32115 can you please help the docs for this? This would help many others

👍 1

salmon-refrigerator-32115

01/11/2023, 5:37 PM

I am trying to do that. @freezing-airport-6809 what is the mechanism for adding this knowledge?

glamorous-carpet-83516

01/11/2023, 6:45 PM

@salmon-refrigerator-32115 could you help update this doc

glamorous-carpet-83516

01/11/2023, 6:45 PM

just open pr to flytesnacks

salmon-refrigerator-32115

01/11/2023, 6:59 PM

My solution involves passing the AWS account and credentials to the flyte local install. I don’t have a solution that will result in a PR for that. However, I do want to share my steps in a flyte wiki page if you can point me to the right URL.

salmon-refrigerator-32115

01/11/2023, 6:59 PM

@glamorous-carpet-83516

thankful-minister-83577

01/11/2023, 7:02 PM

https://github.com/flyteorg/flyte/discussions/categories/deployment-tips-tricks maybe? (we don’t have the github wiki enabled)

thankful-minister-83577

01/11/2023, 7:03 PM

(we may enable it at some point in the future, but we can do the work of porting this at that time)

salmon-refrigerator-32115

01/11/2023, 7:41 PM

@thankful-minister-83577 @glamorous-carpet-83516 shared https://github.com/flyteorg/flyte/discussions/3229

👍 2

❤️ 2

158 Views

Open in Slack

Previous Next