https://flyte.org logo
#ask-the-community
Title
# ask-the-community
f

Frank Shen

01/07/2023, 12:43 AM
@Yee, @Kevin Su, @Ketan (kumare3), I made it working in pyspark. However, it failed in flyte with error:
Copy code
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.IOException: Failed to connect to <http://nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808|nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808>
k

Ketan (kumare3)

01/07/2023, 12:46 AM
interesting
did you set this - conf.set(‘spark.jars.packages’, ‘org.apache.hadoophadoop aws3.3.2’) in sparkConf?
f

Frank Shen

01/07/2023, 12:47 AM
yes
now pyspark code can read from S3, but running same code and config in flyte task, it failed.
y

Yee

01/07/2023, 12:59 AM
this is a local run?
why is it trying to connect to a live server? what’s on that dns?
k

Ketan (kumare3)

01/07/2023, 12:59 AM
metadata server?
f

Frank Shen

01/07/2023, 1:01 AM
local run, yes. Why it connects to wmad, I have no idea. in both cases, the same pyspark is invoked in the venv the two .py files are in the same dir.
pyspark is creating a SparkContext() directly:
Copy code
conf = SparkConf()
flyte manages spark context differently.
Copy code
flytekit.current_context().spark_session
That’s the difference.
k

Ketan (kumare3)

01/07/2023, 1:12 AM
interesting session vs context?
f

Frank Shen

01/07/2023, 1:21 AM
Copy code
flytekit.current_context()
the additional post processing happens underneath that, and then the
user_params.builder().add_attr
is what makes it available in the object returned by current_context()
s

Samhita Alla

01/09/2023, 6:58 AM
@Frank Shen, can you run the following command and try again?
Copy code
export SPARK_LOCAL_IP="127.0.0.1"
f

Frank Shen

01/09/2023, 6:24 PM
@Samhita Alla, yay, your advice works! Thanks a lot!
@Samhita Alla, can I add SPARK_LOCAL_IP to .flyte/config.yaml?
Copy code
admin:
  endpoint: dns:///flyte.dev.dap.warnermedia.com
SPARK_LOCAL_IP:
   127.0.0.1
Also can I move the extra spark_conf properties needed for s3 from task decorator to .flyte/config.yaml, or OS env variables, or spark config file? Right now they are here:
Copy code
@task(
    task_config=Spark(
        spark_conf={
...
            # The following is needed only when running spark task in dev's local PC. Also need to do this locally: export SPARK_LOCAL_IP="127.0.0.1"
            "spark.jars.packages": "org.apache.hadoop:hadoop-aws:3.3.2",
            "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider",
            "spark.hadoop.fs.s3a.access.key": "",
            "spark.hadoop.fs.s3a.secret.key": "",
            "spark.hadoop.fs.s3a.session.token": "",
        },
s

Samhita Alla

01/10/2023, 5:24 AM
Adding to Kevin's suggestion: Take at look at step 4 - spark operator on https://docs.flyte.org/en/latest/deployment/plugin_setup/k8s/index.html#deployment-plugin-setup-k8s guide. You can edit the spark config in the concerning config file.
can I add SPARK_LOCAL_IP to .flyte/config.yaml?
I don't think so. You can, however, add it to the bash or zsh profile since it's an env variable.
f

Frank Shen

01/10/2023, 6:23 PM
@Kevin Su, thank you. Do I need to setup a local flyte sandbox cluster first in order to add the default spark config? I need to run the spark task purely local: pyflyte run (without --remote ).
k

Kevin Su

01/10/2023, 6:26 PM
yes, that default config is used for running spark on k8s.
f

Frank Shen

01/10/2023, 6:26 PM
Thanks
@Samhita Alla, thank you.
k

Kevin Su

01/10/2023, 6:28 PM
you can create a file
spark-defaults.conf
, and add it to env. pyspark will use default config in it. https://stackoverflow.com/a/71214326/9574775
f

Frank Shen

01/10/2023, 6:34 PM
@Kevin Su, RE
Copy code
The spark-defaults.conf file should be located in:

$SPARK_HOME/conf
Do you think the env var SPARK_HOME should be set to my venv’s pyspark install location, i.e. ../env/lib/python3.8/site-packages/pyspark ?
k

Kevin Su

01/10/2023, 6:40 PM
it should point to your Apache Spark distribution. if don’t have download it, I think it’s fine to create an arbitrary directory.
f

Frank Shen

01/10/2023, 6:46 PM
I do have Spark installed. However, when I run flyte spark task via pyspark, it didn’t use the local spark system. I am affraid if setting SPARK_HOME there will unexpectedly impact the current pyspark behavior. Do you know?
@Kevin Su
k

Kevin Su

01/10/2023, 6:49 PM
I’m not sure. could you try to run it? and see if there is a unexpected behavior
f

Frank Shen

01/10/2023, 6:49 PM
Will do.
@Kevin Su, I knew this post but didn’t try coz I am full of doubt, until you encouraged me. I just added conf/spark-defaults.conf to the pyspark install and it worked like a charm. Thanks a lot!
I didn’t set SPARK_HOME.
k

Ketan (kumare3)

01/11/2023, 2:11 PM
@Frank Shen can you please help the docs for this? This would help many others
f

Frank Shen

01/11/2023, 5:37 PM
I am trying to do that. @Ketan (kumare3) what is the mechanism for adding this knowledge?
k

Kevin Su

01/11/2023, 6:45 PM
@Frank Shen could you help update this doc
just open pr to flytesnacks
f

Frank Shen

01/11/2023, 6:59 PM
My solution involves passing the AWS account and credentials to the flyte local install. I don’t have a solution that will result in a PR for that. However, I do want to share my steps in a flyte wiki page if you can point me to the right URL.
@Kevin Su
y

Yee

01/11/2023, 7:02 PM
(we may enable it at some point in the future, but we can do the work of porting this at that time)
f

Frank Shen

01/11/2023, 7:41 PM