Hello, do we have any reference of how to configur...
# flyte-deployment
j
Hello, do we have any reference of how to configure Spark plugin + k8s connector on GKE ? i can see only aws related doc here https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html#deployment-plugin-setup-k8s
d
in this case, this is more related to the Helm chart you´re using. If you used
deploy-flyte
then it's
flyte-core
. You should then add the config specified in the docs for flyte-core to your
values-gcp-core.yaml
file and upgrade your Helm deployment (just running
terraform apply
)
j
@David Espejo (he/him) Thanks for the response. Yes, i have used deploy-flyte to setup flyte-core. I am stuck at this step since these are referring to AWS.
d
j
ah ok. remaining all as-is from the doc ?
d
let me check because the
spark-config-default
has some keys that are AWS-specific
j
yeah..
d
@Jegadesh Thirumeni try with the following
Copy code
spark-config-default:
          - spark.eventLog.enabled: "true"
          - spark.eventLog.dir: "{{ Values.userSettings.bucketName }}/spark-events"
          - spark.driver.cores: "1"
          - spark.executorEnv.HTTP2_DISABLE: "true"
          - spark.hadoop.fs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
          - spark.kubernetes.allocation.batch.size: "50"
          - spark.kubernetes.driverEnv.HTTP2_DISABLE: "true"
          - spark.network.timeout: 600s
          - spark.executorEnv.KUBERNETES_REQUEST_TIMEOUT: 100000
          - spark.executor.heartbeatInterval: 60s
j
sure.
getting this
Copy code
│ Error: template: flyte-core/templates/propeller/webhook.yaml:33:27: executing "flyte-core/templates/propeller/webhook.yaml" at <include (print .Template.BasePath "/propeller/configmap.yaml") .>: error calling include: template: flyte-core/templates/propeller/configmap.yaml:41:19: executing "flyte-core/templates/propeller/configmap.yaml" at <tpl (toYaml .) $>: error calling tpl: error during tpl function execution for "plugins:\n  spark:\n    spark-config-default:\n    - spark.eventLog.enabled: \"true\"\n    - spark.eventLog.dir: '{{ Values.userSettings.bucketName }}/spark-events'\n    - spark.driver.cores: \"1\"\n    - spark.executorEnv.HTTP2_DISABLE: \"true\"\n    - spark.hadoop.fs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS\n    - spark.kubernetes.allocation.batch.size: \"50\"\n    - spark.kubernetes.driverEnv.HTTP2_DISABLE: \"true\"\n    - spark.network.timeout: 600s\n    - spark.executorEnv.KUBERNETES_REQUEST_TIMEOUT: 100000\n    - spark.executor.heartbeatInterval: 60s": parse error at (flyte-core/templates/propeller/webhook.yaml:5): function "Values" not defined
│
I think its due to spark.eventLog.dir
d
sorry, missing dot at the beginning. It should be
.Values.userSettings.bucketName
j
trying...
getting some IAM permission issue
forbidden, Reason: "IAM", UserMessage: "Unable to generate access token; IAM returned 403 Forbidden: Permission 'iam.serviceAccounts.getAccessToken' denied on resource (or it may not exist).\nThis error could be caused by a missing IAM policy binding on the target IAM service account.\nFor more information, refer to the Workload Identity documentation:\n\t<https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to>\n", started at 2024-03-07 16:51:08.984515318 +0000 UTC m=+183187.278580063
Copy code
[conn-id:22c33a1d8d8ce5a9 ip:172.16.0.61 pod:flytesnacks-development/fef289c19c0fb4703b69-n0-0-driver rpc-id:3396cac83aeffbf5] "/computeMetadata/v1/instance/service-accounts/flyte-gcp-flyteworkers@<projectid>.iam.gserviceaccount.com/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdevstorage.full_control" HTTP/403: generic::permission_denied: loading: GenerateAccessToken("flyte-gcp-flyteworkers@<projectid>.iam.gserviceaccount.com", ""): googleapi: Error 403: Permission 'iam.serviceAccounts.getAccessToken' denied on resource (or it may not exist).
d
is this trying to execute a Spark task?
j
d
ok, let's try first a non Spark task. I want to rule out possible issues with IAM
this was specific to spark i guess
d
are you specifying the execution to use the
spark
Service Account?
I mean, doing something like
pyflyte run --remote <your-workflow> --service-account=spark
?
j
no, i am just simplying trying like this
Copy code
pyflyte run --remote pyspark_pi.py my_spark
should i try with service account ?
d
yes please
j
getting no such option error
d
sorry, it's a positional argument, so it should go right after
--remote
like
pyflyte run --remote --service-account=spark ...
j
getting same error
d
can you confirm that the
spark
SA has the IAM role annotation?
kubectl describe sa spark -n <spark-operator-namespace>
j
so all these namespaces should be of the spark-operator-namespace. is it ?
d
oh that variable seems to point to the ns where you installed flyte, which is
flyte
on
deploy-flyte
so
kubectl describe sa spark -n flyte
j
okay. becuase my spark operator is in
spark-operator
namespace where as flyte is in
flyte
namespace.
Copy code
Error from server (NotFound): serviceaccounts "spark" not found
d
sorry, what about
kubectl get sa -n spark-operator
j
image.png
my
values-core-gcp.yaml
looks like this https://gist.github.com/jegadesh-google/be2a44026525cb41a868462cb1cf384b and i used the below commands to setup the spark operator
Copy code
helm repo add spark-operator <https://googlecloudplatform.github.io/spark-on-k8s-operator>
helm install spark-operator spark-operator/spark-operator --namespace spark-operator --create-namespace
d
it's likely that the SA that the spark job uses is not part of the binding that the
<http://iam.tf|iam.tf>
module performs. One last check, isn't there a
spark
SA on the
flyte
ns?
kubect get sa -n flyte
j
@David Espejo (he/him) i dont see a
spark
sa.
d
I think I'll have to reproduce this config and will let you know what I find
j
sure. that would be great @David Espejo (he/him)
any luck @David Espejo (he/him)
d
I'll let you know the results. I already have a base Flyte core env on GCP
j
awesome. thanks for your help. stuck on this part...
@David Espejo (he/him) I think this part is missing in the spark setup. https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#kubectl as i am getting this error
iam.serviceAccounts.getAccessToken
but even after doing that, i am facing the same issue..
d
So @Jegadesh Thirumeni I think I made good progress on this. This is what I did so far: 1. Added a number of configuration options to the values file (some of them not covered in the docs yet). Attached is the file I used. 2. Update
<http://flyte.tf|flyte.tf>
to reference the new values file (optional just in case you want to keep this config on a different file) 3. Run terraform apply and verify that a
spark
SA is created on the
flytesnacks-development
namespace, verify it's annotated with the GSA and verify there's a
spark-role
role created in the same namespace. 4. Added to
<http://iam.tf|iam.tf>
the
"spark"
SA to this array: https://github.com/unionai-oss/deploy-flyte/blob/6a6765cd4cb92fad46bb4b6466edf8f5a766bbb4/environments/gcp/flyte-core/iam.tf#L2 3. Added the `iam.serviceAccounts.signBlob`permission to this role: https://github.com/unionai-oss/deploy-flyte/blob/6a6765cd4cb92fad46bb4b6466edf8f5a766bbb4/environments/gcp/flyte-core/iam.tf#L80 4. Save and terraform apply 5. I had to complete the steps to use Artifact Registry and specify my repo name (like
registry=<my-repo>
) here AS you may have noticed I removed the ResourceQuota bc the scheduler will fail admission if there are not either requests or limits and, that's another effort: to profile what are the recommended base resources. With this, I don't have any permission issues as of now. I'm getting an annoying
No Module found
type of error but that's probably on the user side. Please let me know if this works for you as an update to the docs is needed
j
yes this helps. Thank you so much. Was able to move past this issue