<@UNR3C6Y4T>, <@USU6W5ATA>, <@UNZB4NW3S>, I made i...
# ask-the-community
f
@Yee, @Kevin Su, @Ketan (kumare3), I made it working in pyspark. However, it failed in flyte with error:
Copy code
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.IOException: Failed to connect to <http://nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808|nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808>
k
interesting
did you set this - conf.set(‘spark.jars.packages’, ‘org.apache.hadoophadoop aws3.3.2’) in sparkConf?
f
yes
now pyspark code can read from S3, but running same code and config in flyte task, it failed.
y
this is a local run?
why is it trying to connect to a live server? what’s on that dns?
k
metadata server?
f
local run, yes. Why it connects to wmad, I have no idea. in both cases, the same pyspark is invoked in the venv the two .py files are in the same dir.
pyspark is creating a SparkContext() directly:
Copy code
conf = SparkConf()
flyte manages spark context differently.
Copy code
flytekit.current_context().spark_session
That’s the difference.
k
interesting session vs context?
f
Copy code
flytekit.current_context()
the additional post processing happens underneath that, and then the
user_params.builder().add_attr
is what makes it available in the object returned by current_context()
s
@Frank Shen, can you run the following command and try again?
Copy code
export SPARK_LOCAL_IP="127.0.0.1"
f
@Samhita Alla, yay, your advice works! Thanks a lot!
@Samhita Alla, can I add SPARK_LOCAL_IP to .flyte/config.yaml?
Copy code
admin:
  endpoint: dns:///flyte.dev.dap.warnermedia.com
SPARK_LOCAL_IP:
   127.0.0.1
Also can I move the extra spark_conf properties needed for s3 from task decorator to .flyte/config.yaml, or OS env variables, or spark config file? Right now they are here:
Copy code
@task(
    task_config=Spark(
        spark_conf={
...
            # The following is needed only when running spark task in dev's local PC. Also need to do this locally: export SPARK_LOCAL_IP="127.0.0.1"
            "spark.jars.packages": "org.apache.hadoop:hadoop-aws:3.3.2",
            "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider",
            "spark.hadoop.fs.s3a.access.key": "",
            "spark.hadoop.fs.s3a.secret.key": "",
            "spark.hadoop.fs.s3a.session.token": "",
        },
s
Adding to Kevin's suggestion: Take at look at step 4 - spark operator on https://docs.flyte.org/en/latest/deployment/plugin_setup/k8s/index.html#deployment-plugin-setup-k8s guide. You can edit the spark config in the concerning config file.
can I add SPARK_LOCAL_IP to .flyte/config.yaml?
I don't think so. You can, however, add it to the bash or zsh profile since it's an env variable.
f
@Kevin Su, thank you. Do I need to setup a local flyte sandbox cluster first in order to add the default spark config? I need to run the spark task purely local: pyflyte run (without --remote ).
k
yes, that default config is used for running spark on k8s.
f
Thanks
@Samhita Alla, thank you.
k
you can create a file
spark-defaults.conf
, and add it to env. pyspark will use default config in it. https://stackoverflow.com/a/71214326/9574775
f
@Kevin Su, RE
Copy code
The spark-defaults.conf file should be located in:

$SPARK_HOME/conf
Do you think the env var SPARK_HOME should be set to my venv’s pyspark install location, i.e. ../env/lib/python3.8/site-packages/pyspark ?
k
it should point to your Apache Spark distribution. if don’t have download it, I think it’s fine to create an arbitrary directory.
f
I do have Spark installed. However, when I run flyte spark task via pyspark, it didn’t use the local spark system. I am affraid if setting SPARK_HOME there will unexpectedly impact the current pyspark behavior. Do you know?
@Kevin Su
k
I’m not sure. could you try to run it? and see if there is a unexpected behavior
f
Will do.
@Kevin Su, I knew this post but didn’t try coz I am full of doubt, until you encouraged me. I just added conf/spark-defaults.conf to the pyspark install and it worked like a charm. Thanks a lot!
I didn’t set SPARK_HOME.
k
@Frank Shen can you please help the docs for this? This would help many others
f
I am trying to do that. @Ketan (kumare3) what is the mechanism for adding this knowledge?
k
@Frank Shen could you help update this doc
just open pr to flytesnacks
f
My solution involves passing the AWS account and credentials to the flyte local install. I don’t have a solution that will result in a PR for that. However, I do want to share my steps in a flyte wiki page if you can point me to the right URL.
@Kevin Su
y
(we may enable it at some point in the future, but we can do the work of porting this at that time)
f
102 Views