<@UNR3C6Y4T>, <@USU6W5ATA>, <@UNZB4NW3S>, I made i...
# flyte-support
s
@thankful-minister-83577, @glamorous-carpet-83516, @freezing-airport-6809, I made it working in pyspark by providing the correct hadoop-aws jar:
Copy code
conf = SparkConf()
    conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.2')
However, it failed in flyte with error:
Copy code
23/01/06 16:42:33 ERROR SparkContext: Error initializing SparkContext.
java.io.IOException: Failed to connect to <http://nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808|nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808>

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.IOException: Failed to connect to <http://nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808|nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808>
f
interesting
did you set this - conf.set(‘spark.jars.packages’, ‘org.apache.hadoophadoop aws3.3.2’) in sparkConf?
s
yes
now pyspark code can read from S3, but running same code and config in flyte task, it failed.
t
this is a local run?
why is it trying to connect to a live server? what’s on that dns?
f
metadata server?
s
local run, yes. Why it connects to wmad, I have no idea. in both cases, the same pyspark is invoked in the venv the two .py files are in the same dir.
pyspark is creating a SparkContext() directly:
Copy code
conf = SparkConf()
flyte manages spark context differently.
Copy code
flytekit.current_context().spark_session
That’s the difference.
f
interesting session vs context?
s
Copy code
flytekit.current_context()
the additional post processing happens underneath that, and then the
user_params.builder().add_attr
is what makes it available in the object returned by current_context()
t
@salmon-refrigerator-32115, can you run the following command and try again?
Copy code
export SPARK_LOCAL_IP="127.0.0.1"
👍 1
s
@tall-lock-23197, yay, your advice works! Thanks a lot!
🎉 1
@tall-lock-23197, can I add SPARK_LOCAL_IP to .flyte/config.yaml?
Copy code
admin:
  endpoint: dns:///flyte.dev.dap.warnermedia.com
SPARK_LOCAL_IP:
   127.0.0.1
Also can I move the extra spark_conf properties needed for s3 from task decorator to .flyte/config.yaml, or OS env variables, or spark config file? Right now they are here. I don’t want to do this for every task.
Copy code
@task(
    task_config=Spark(
        spark_conf={
...
            # The following is needed only when running spark task in dev's local PC. Also need to do this locally: export SPARK_LOCAL_IP="127.0.0.1"
            "spark.jars.packages": "org.apache.hadoop:hadoop-aws:3.3.2",
            "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider",
            "spark.hadoop.fs.s3a.access.key": "",
            "spark.hadoop.fs.s3a.secret.key": "",
            "spark.hadoop.fs.s3a.session.token": "",
        },
g
t
Adding to Kevin's suggestion: Take at look at step 4 - spark operator on https://docs.flyte.org/en/latest/deployment/plugin_setup/k8s/index.html#deployment-plugin-setup-k8s guide. You can edit the spark config in the concerning config file.
can I add SPARK_LOCAL_IP to .flyte/config.yaml?
I don't think so. You can, however, add it to the bash or zsh profile since it's an env variable.
s
@glamorous-carpet-83516, thank you. Do I need to setup a local flyte sandbox cluster first in order to add the default spark config? I need to run the spark task purely local: pyflyte run (without --remote ).
g
yes, that default config is used for running spark on k8s.
s
Thanks
@tall-lock-23197, thank you.
g
you can create a file
spark-defaults.conf
, and add it to env. pyspark will use default config in it. https://stackoverflow.com/a/71214326/9574775
s
@glamorous-carpet-83516, RE
Copy code
The spark-defaults.conf file should be located in:

$SPARK_HOME/conf
Do you think the env var SPARK_HOME should be set to my venv’s pyspark install location, i.e. ../env/lib/python3.8/site-packages/pyspark ?
g
it should point to your Apache Spark distribution. if don’t have download it, I think it’s fine to create an arbitrary directory.
s
I do have Spark installed. However, when I run flyte spark task via pyspark, it didn’t use the local spark system. I am affraid if setting SPARK_HOME there will unexpectedly impact the current pyspark behavior. Do you know?
@glamorous-carpet-83516
g
I’m not sure. could you try to run it? and see if there is a unexpected behavior
s
Will do.
@glamorous-carpet-83516, I knew this post but didn’t try coz I am full of doubt, until you encouraged me. I just added conf/spark-defaults.conf to the pyspark install and it worked like a charm. Thanks a lot!
I didn’t set SPARK_HOME.
f
@salmon-refrigerator-32115 can you please help the docs for this? This would help many others
👍 1
s
I am trying to do that. @freezing-airport-6809 what is the mechanism for adding this knowledge?
g
@salmon-refrigerator-32115 could you help update this doc
just open pr to flytesnacks
s
My solution involves passing the AWS account and credentials to the flyte local install. I don’t have a solution that will result in a PR for that. However, I do want to share my steps in a flyte wiki page if you can point me to the right URL.
@glamorous-carpet-83516
t
(we may enable it at some point in the future, but we can do the work of porting this at that time)
s
@thankful-minister-83577 @glamorous-carpet-83516 shared https://github.com/flyteorg/flyte/discussions/3229
👍 2
❤️ 2
158 Views