https://flyte.org logo
#flyte-deployment
Title
# flyte-deployment
j

Jegadesh Thirumeni

03/07/2024, 10:06 AM
Hello, do we have any reference of how to configure Spark plugin + k8s connector on GKE ? i can see only aws related doc here https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html#deployment-plugin-setup-k8s
d

David Espejo (he/him)

03/07/2024, 3:53 PM
in this case, this is more related to the Helm chart you´re using. If you used
deploy-flyte
then it's
flyte-core
. You should then add the config specified in the docs for flyte-core to your
values-gcp-core.yaml
file and upgrade your Helm deployment (just running
terraform apply
)
j

Jegadesh Thirumeni

03/07/2024, 4:01 PM
@David Espejo (he/him) Thanks for the response. Yes, i have used deploy-flyte to setup flyte-core. I am stuck at this step since these are referring to AWS.
d

David Espejo (he/him)

03/07/2024, 4:02 PM
j

Jegadesh Thirumeni

03/07/2024, 4:04 PM
ah ok. remaining all as-is from the doc ?
d

David Espejo (he/him)

03/07/2024, 4:06 PM
let me check because the
spark-config-default
has some keys that are AWS-specific
j

Jegadesh Thirumeni

03/07/2024, 4:06 PM
yeah..
d

David Espejo (he/him)

03/07/2024, 4:26 PM
@Jegadesh Thirumeni try with the following
Copy code
spark-config-default:
          - spark.eventLog.enabled: "true"
          - spark.eventLog.dir: "{{ Values.userSettings.bucketName }}/spark-events"
          - spark.driver.cores: "1"
          - spark.executorEnv.HTTP2_DISABLE: "true"
          - spark.hadoop.fs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
          - spark.kubernetes.allocation.batch.size: "50"
          - spark.kubernetes.driverEnv.HTTP2_DISABLE: "true"
          - spark.network.timeout: 600s
          - spark.executorEnv.KUBERNETES_REQUEST_TIMEOUT: 100000
          - spark.executor.heartbeatInterval: 60s
j

Jegadesh Thirumeni

03/07/2024, 4:41 PM
sure.
getting this
Copy code
│ Error: template: flyte-core/templates/propeller/webhook.yaml:33:27: executing "flyte-core/templates/propeller/webhook.yaml" at <include (print .Template.BasePath "/propeller/configmap.yaml") .>: error calling include: template: flyte-core/templates/propeller/configmap.yaml:41:19: executing "flyte-core/templates/propeller/configmap.yaml" at <tpl (toYaml .) $>: error calling tpl: error during tpl function execution for "plugins:\n  spark:\n    spark-config-default:\n    - spark.eventLog.enabled: \"true\"\n    - spark.eventLog.dir: '{{ Values.userSettings.bucketName }}/spark-events'\n    - spark.driver.cores: \"1\"\n    - spark.executorEnv.HTTP2_DISABLE: \"true\"\n    - spark.hadoop.fs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS\n    - spark.kubernetes.allocation.batch.size: \"50\"\n    - spark.kubernetes.driverEnv.HTTP2_DISABLE: \"true\"\n    - spark.network.timeout: 600s\n    - spark.executorEnv.KUBERNETES_REQUEST_TIMEOUT: 100000\n    - spark.executor.heartbeatInterval: 60s": parse error at (flyte-core/templates/propeller/webhook.yaml:5): function "Values" not defined
│
I think its due to spark.eventLog.dir
d

David Espejo (he/him)

03/07/2024, 4:43 PM
sorry, missing dot at the beginning. It should be
.Values.userSettings.bucketName
j

Jegadesh Thirumeni

03/07/2024, 4:44 PM
trying...
getting some IAM permission issue
forbidden, Reason: "IAM", UserMessage: "Unable to generate access token; IAM returned 403 Forbidden: Permission 'iam.serviceAccounts.getAccessToken' denied on resource (or it may not exist).\nThis error could be caused by a missing IAM policy binding on the target IAM service account.\nFor more information, refer to the Workload Identity documentation:\n\t<https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to>\n", started at 2024-03-07 16:51:08.984515318 +0000 UTC m=+183187.278580063
Copy code
[conn-id:22c33a1d8d8ce5a9 ip:172.16.0.61 pod:flytesnacks-development/fef289c19c0fb4703b69-n0-0-driver rpc-id:3396cac83aeffbf5] "/computeMetadata/v1/instance/service-accounts/flyte-gcp-flyteworkers@<projectid>.iam.gserviceaccount.com/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdevstorage.full_control" HTTP/403: generic::permission_denied: loading: GenerateAccessToken("flyte-gcp-flyteworkers@<projectid>.iam.gserviceaccount.com", ""): googleapi: Error 403: Permission 'iam.serviceAccounts.getAccessToken' denied on resource (or it may not exist).
d

David Espejo (he/him)

03/07/2024, 4:55 PM
is this trying to execute a Spark task?
j

Jegadesh Thirumeni

03/07/2024, 4:56 PM
d

David Espejo (he/him)

03/07/2024, 4:57 PM
ok, let's try first a non Spark task. I want to rule out possible issues with IAM
this was specific to spark i guess
d

David Espejo (he/him)

03/07/2024, 5:18 PM
are you specifying the execution to use the
spark
Service Account?
I mean, doing something like
pyflyte run --remote <your-workflow> --service-account=spark
?
j

Jegadesh Thirumeni

03/07/2024, 5:20 PM
no, i am just simplying trying like this
Copy code
pyflyte run --remote pyspark_pi.py my_spark
should i try with service account ?
d

David Espejo (he/him)

03/07/2024, 5:20 PM
yes please
j

Jegadesh Thirumeni

03/07/2024, 5:21 PM
getting no such option error
d

David Espejo (he/him)

03/07/2024, 5:23 PM
sorry, it's a positional argument, so it should go right after
--remote
like
pyflyte run --remote --service-account=spark ...
j

Jegadesh Thirumeni

03/07/2024, 5:27 PM
getting same error
d

David Espejo (he/him)

03/07/2024, 5:32 PM
can you confirm that the
spark
SA has the IAM role annotation?
kubectl describe sa spark -n <spark-operator-namespace>
j

Jegadesh Thirumeni

03/07/2024, 5:36 PM
so all these namespaces should be of the spark-operator-namespace. is it ?
d

David Espejo (he/him)

03/07/2024, 5:37 PM
oh that variable seems to point to the ns where you installed flyte, which is
flyte
on
deploy-flyte
so
kubectl describe sa spark -n flyte
j

Jegadesh Thirumeni

03/07/2024, 5:38 PM
okay. becuase my spark operator is in
spark-operator
namespace where as flyte is in
flyte
namespace.
Copy code
Error from server (NotFound): serviceaccounts "spark" not found
d

David Espejo (he/him)

03/07/2024, 5:42 PM
sorry, what about
kubectl get sa -n spark-operator
j

Jegadesh Thirumeni

03/07/2024, 5:43 PM
image.png
my
values-core-gcp.yaml
looks like this https://gist.github.com/jegadesh-google/be2a44026525cb41a868462cb1cf384b and i used the below commands to setup the spark operator
Copy code
helm repo add spark-operator <https://googlecloudplatform.github.io/spark-on-k8s-operator>
helm install spark-operator spark-operator/spark-operator --namespace spark-operator --create-namespace
d

David Espejo (he/him)

03/07/2024, 6:07 PM
it's likely that the SA that the spark job uses is not part of the binding that the
<http://iam.tf|iam.tf>
module performs. One last check, isn't there a
spark
SA on the
flyte
ns?
kubect get sa -n flyte
j

Jegadesh Thirumeni

03/08/2024, 1:59 AM
@David Espejo (he/him) i dont see a
spark
sa.
d

David Espejo (he/him)

03/08/2024, 3:38 PM
I think I'll have to reproduce this config and will let you know what I find
j

Jegadesh Thirumeni

03/08/2024, 11:21 PM
sure. that would be great @David Espejo (he/him)
any luck @David Espejo (he/him)
d

David Espejo (he/him)

03/11/2024, 10:32 PM
I'll let you know the results. I already have a base Flyte core env on GCP
j

Jegadesh Thirumeni

03/11/2024, 11:25 PM
awesome. thanks for your help. stuck on this part...
@David Espejo (he/him) I think this part is missing in the spark setup. https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#kubectl as i am getting this error
iam.serviceAccounts.getAccessToken
but even after doing that, i am facing the same issue..
d

David Espejo (he/him)

03/12/2024, 9:14 PM
So @Jegadesh Thirumeni I think I made good progress on this. This is what I did so far: 1. Added a number of configuration options to the values file (some of them not covered in the docs yet). Attached is the file I used. 2. Update
<http://flyte.tf|flyte.tf>
to reference the new values file (optional just in case you want to keep this config on a different file) 3. Run terraform apply and verify that a
spark
SA is created on the
flytesnacks-development
namespace, verify it's annotated with the GSA and verify there's a
spark-role
role created in the same namespace. 4. Added to
<http://iam.tf|iam.tf>
the
"spark"
SA to this array: https://github.com/unionai-oss/deploy-flyte/blob/6a6765cd4cb92fad46bb4b6466edf8f5a766bbb4/environments/gcp/flyte-core/iam.tf#L2 3. Added the `iam.serviceAccounts.signBlob`permission to this role: https://github.com/unionai-oss/deploy-flyte/blob/6a6765cd4cb92fad46bb4b6466edf8f5a766bbb4/environments/gcp/flyte-core/iam.tf#L80 4. Save and terraform apply 5. I had to complete the steps to use Artifact Registry and specify my repo name (like
registry=<my-repo>
) here AS you may have noticed I removed the ResourceQuota bc the scheduler will fail admission if there are not either requests or limits and, that's another effort: to profile what are the recommended base resources. With this, I don't have any permission issues as of now. I'm getting an annoying
No Module found
type of error but that's probably on the user side. Please let me know if this works for you as an update to the docs is needed
j

Jegadesh Thirumeni

03/13/2024, 12:13 PM
yes this helps. Thank you so much. Was able to move past this issue
4 Views