Hi! I'm just getting started with a test Flyte dep...
# flyte-deployment
t
Hi! I'm just getting started with a test Flyte deployment on GCP with the goal of running pyflyte with the remote flag to start. I'm using the automated GCP setup with Opta. Is it possible to deploy and run Flyte on GCP without providing a registered domain name? I'd like to try out the deployment but do not wish to register a domain at this stage unless absolutely necessary. I was able to deploy Flyte by commenting out the domain fields and disabling ingress in the config. But was wondering if at that stage I can port-forward something to get everything needed to run “pyflyte run —remote”?
k
Cc @Prafulla Mahindrakar ? @jeev ? @Sören Brunk ?
p
yes that should be possible if you just port-forward flyteadmin and use that endpoint in the pyflyte config.yaml eg:
Copy code
admin:
  # For GRPC endpoints you might want to use dns:///flyte.myexample.com
  endpoint: dns:///port-forwarded-uri
  authType: Pkce
  insecure: true
j
will break forwarding to flyteconsole probably in case he's trying to monitor. our local sandbox uses a envoy proxy for this exact purpose.
p
yeah thats true, but we can monitor from flytectl aswell . Jeev do you have the steps with envoy proxy documented somewhere which you can share
👍 1
j
this is the config we use:
Copy code
apiVersion: apps/v1
kind: Deployment
metadata:
  name: flyte-proxy
  labels:
    app: flyte-proxy
spec:
  selector:
    matchLabels:
      app: flyte-proxy
  template:
    metadata:
      labels:
        app: flyte-proxy
    spec:
      containers:
      - name: proxy
        image: envoyproxy/envoy:v1.21.1
        args:
        - envoy
        - -c /etc/envoy/config.yaml
        ports:
        - name: http
          containerPort: 8000
        volumeMounts:
        - name: config-volume
          mountPath: /etc/envoy
      volumes:
      - name: config-volume
        configMap:
          name: flyte-proxy-config
will probably want to drop the unnecessary stuff like minio, k8s dashboard, etc.
and you can just
kubectl port-forward deploy/flyte-proxy 8000
perhaps worth PR'ing this @Prafulla Mahindrakar? we don't use this on our prod deploys, but its nice for sandboxes.
i think the current flyte sandbox just uses contour ingress controller instead though.
but that might not be ideal to deploy to the prod cluster
p
I think this can be additionally be done after opta deployment and not use it for prod deployments. This can be part of the docs once we get it working with opta deployed flyte . For sandbox we already have contour and that should be sufficient for trials locally .
t
@Prafulla Mahindrakar re:
yes that should be possible if you just port-forward flyteadmin and use that endpoint in the pyflyte config.yaml
I tried something similar (I believe) and got some RPC errors. First I did:
Copy code
kubectl -n flyte port-forward service/flyteadmin 30081:81
Then I set my
.flyte/config.yaml
to:
Copy code
admin:
  # For GRPC endpoints you might want to use dns:///flyte.myexample.com
  endpoint: dns:///localhost:30081
  authType: Pkce
  insecure: true
logger:
  show-source: true
  level: 0
Then I ran:
Copy code
$ pyflyte run --remote core/flyte_basics/basic_workflow.py my_wf --a 5 --b hello
The error I got is in this snippet.
Copy code
debug_error_string = "{"created":"@1658972857.863942000","description":"Error received from peer ipv6:[::1]:30081","file":"src/core/lib/surface/call.cc","file_line":904,"grpc_message":"failed to create a signed url. Error: unable to sign bytes: googleapi: Error 403: The caller does not have permission","grpc_status":2}"
I'm brand new here, so its very possible I missed a setup step somewhere.
@jeev re:
will break forwarding to flyteconsole probably in case he's trying to monitor. our local sandbox uses a envoy proxy for this exact purpose.
1. I'm still learning. Can you describe why it breaks forwarding to flyteconsole? Naively, I did try to additionally forward flyteconsole along with the above using:
Copy code
kubectl -n flyte port-forward service/flyteconsole 30080:80
I saw the console when I navigated to
localhost:30080/console
, but there was an error displayed. I'm curious why. Thank you. 2. Silly question. But once the everything is deployed via Opta, how do you apply that envoy k8s config and layer in the envoy config.yaml?
p
Can you check if the service account for flyteadmin and corresponding gcp linked service account has permissions to create signed url
t
@Prafulla Mahindrakar sure. I destroyed the env last night (EDT) before going to sleep. I'll re-deploy via opta right now and check that.
@Prafulla Mahindrakar Deployed. Hmm. Seems like the service accounts were never created after running:
opta apply -c flyte.yaml
. Tried re-running and no resources changed. The output says:
Copy code
adminflyteaccount_service_account_email = "gsa-flyteadmin@<GCP_PROJECT>.<http://iam.gserviceaccount.com|iam.gserviceaccount.com>"
adminflyteaccount_service_account_id = "gsa-flyteadmin"
bucket_id = "<NAME>-service-flyte"
bucket_name = "<NAME>-service-flyte"
datacatalogaccount_service_account_email = "gsa-datacatalog@<GCP_PROJECT>.<http://iam.gserviceaccount.com|iam.gserviceaccount.com>"
datacatalogaccount_service_account_id = "gsa-datacatalog"
flytedevelopmentaccount_service_account_email = "gsa-development@<GCP_PROJECT>.<http://iam.gserviceaccount.com|iam.gserviceaccount.com>"
flytedevelopmentaccount_service_account_id = "gsa-development"
flyteproductionaccount_service_account_email = "gsa-production@<GCP_PROJECT>.<http://iam.gserviceaccount.com|iam.gserviceaccount.com>"
flyteproductionaccount_service_account_id = "gsa-production"
flytepropelleraccount_service_account_email = "gsa-flytepropeller@<GCP_PROJECT>.<http://iam.gserviceaccount.com|iam.gserviceaccount.com>"
flytepropelleraccount_service_account_id = "gsa-flytepropeller"
flytescheduleraccount_service_account_email = "gsa-flytescheduler@<GCP_PROJECT>.<http://iam.gserviceaccount.com|iam.gserviceaccount.com>"
flytescheduleraccount_service_account_id = "gsa-flytescheduler"
flytestagingaccount_service_account_email = "gsa-staging@<GCP_PROJECT>.<http://iam.gserviceaccount.com|iam.gserviceaccount.com>"
flytestagingaccount_service_account_id = "gsa-staging"
However I don't see any of the
gsa-*
service accounts in my project IAM settings. I only see one new one:
Copy code
opta-<NAME>-ep63@<GCP_PROJECT>.<http://iam.gserviceaccount.com|iam.gserviceaccount.com>
🤔
p
hmm thats strange , did the opta apply complete. thats seems odd since it should have created the gsa’s.
adding opta folks on this thread @JD Palomino
t
It said it completed. The above is from the end of the run where it reports out everything.
Here's the output from the last step (the helm chart run).
Oh. I see:
Copy code
Warnings:

- Applied changes may be incomplete

To see the full warning notes, run Terraform without -compact-warnings.
That may be related. Is there a way to re-run the Opta (or terraform directly) to give full warnings?
p
It seems it only applied the helm chart and not recreate any env stuff. can you destroy the env and retry
t
env.yaml? flyte.yaml? or both?
p
both
t
Ok will do.
p
Also please capture the logs from opta runs so that the opta team can also have a look in case of issues.
👍 1
t
Here are the logs from me re-deploying env.yaml and flyte.yaml. I had to re-run once or twice due to internet connection dropouts. Same outcome. No
gsa-*
service accounts in my GCP IAM page like before. The output is using just
opta apply -c flyte.yaml
. If there is a way to get more verbose logs, I can re-run
Ah. I do see a
--detailed-plan
option. I'll destroy and re-apply flyte.yaml with that option to print more out.
☝️ update. 🤦 I see the service accounts now. I either did not properly refresh the page or checked at the wrong time. Disregard the opta issues above. I'll continue on checking service account permissions from here when I have a moment to poke. Apologies for the confusion!
p
ahh i see. np . check the bucket write permissions. Check if you have this permission iam.serviceAccounts.signBlob It also mentioned in the manual deployment doc here https://docs.flyte.org/en/latest/deployment/gcp/manual.html#permissions
t
Interesting. It doesn't seem any of those Roles were created or applied. The account does have access to the buckets though.
I'm currently trying to see where
iam.serviceAccounts.signBlob
permission is provided to the flyteadmin service in the Opta configuration.
p
that permission might be missing from opta probably since signed url feature came in later . Do you mind trying to add a role that contains permission and add
Copy code
serviceAccount:gsa-flyteadmin@urbn-data-science.iam.gserviceaccount.com
as a member
t
added. I'm able to successfully submit a run:
Copy code
$ pyflyte run --remote core/flyte_basics/basic_workflow.py my_wf --a 5 --b hello
Go to <http://localhost:30081/console/projects/flytesnacks/domains/development/executions/f75075fdeda774b358b4> to see execution in the console.
Naturally
localhost:30081
doesn't have the console UI for reasons discussed above (still curious why port forwarding 30080 doesn't work. that's for another time. 😉 ) But I can access the logs using flightctl
Copy code
$ flytectl get execution -p flytesnacks -d development
 ---------------------- ---------------------------------------- -------------------------- ------------- -------- ---------------- -------------------------------- --------------- -------------------- ----------------------------------------------------------------------------------------------- 
| NAME                 | LAUNCH PLAN NAME                       | VERSION                  | TYPE        | PHASE  | SCHEDULED TIME | STARTED                        | ELAPSED TIME  | ABORT DATA (TRUNC) | ERROR DATA (TRUNC)                                                                            |
 ---------------------- ---------------------------------------- -------------------------- ------------- -------- ---------------- -------------------------------- --------------- -------------------- ----------------------------------------------------------------------------------------------- 
| f75075fdeda774b358b4 | core.flyte_basics.basic_workflow.my_wf | 5UFDB8TsvDDDvRjqYjRC5w== | LAUNCH_PLAN | FAILED |                | 2022-07-28T14:42:51.395613125Z | 34.973618876s |                    | |1/1] currentAttempt done. Last Error: USER::Pod failed. No message received from kubernetes.
|                      |                                        |                          |             |        |                |                                |               |                    | [f750                                                                                         |
 ---------------------- ---------------------------------------- -------------------------- ------------- -------- ---------------- -------------------------------- --------------- -------------------- ----------------------------------------------------------------------------------------------- 
1 rows
Still learning the tools, so I just used kubectl instead in order to see the pod error. Looks like a
storage.objects.get
permissions issue now.
p
ohh. can you try giving all the roles mentioned in the manual gcp doc
t
I'm wondering if that's because of perhaps missing workload identity step in that doc
p
workload identity is enabled by opta automatically for gcp so i doubt that could be an issue.
t
Ok. I was curious because I didn't see the mentioned role
roles/iam.workloadIdentityUser
in my project anywhere
trying out the roles. taking a moment since creating the roles from scratch.
p
can you check what service account did the pod use and also describe that service account. With workfload identity you should see an annotation on the service account which maps the gsa and that should have right roles
for your execution i think it should show up annotation for this
Copy code
serviceAccount:gsa-development@urbn-data-science.iam.gserviceaccount.com
t
Would you be able to describe how to get that info?
I did a
k -n development describe pods f75075fdeda774b358b4-n0-0
and got this output. But didn't see a service account there.
p
kubectl get pod -n flytesnacks-development <name-of-exec-pod> -o yaml |grep service And then use the name of serviceAccount to do a get kubectl get sa -n flytesnacks-development <serviceAccount-name> -o yaml
t
Oh my mistake. I saw only a
default
service account on the k8s cluster and assumed I should have something different.
Output from the pod describe:
Copy code
serviceAccount: default
  serviceAccountName: default
Output from service account describe:
Copy code
$ k -n development get sa default -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    <http://iam.gke.io/gcp-service-account|iam.gke.io/gcp-service-account>: <mailto:gsa-development@urbn-data-science.iam.gserviceaccount.com|gsa-development@urbn-data-science.iam.gserviceaccount.com>
  creationTimestamp: "2022-07-28T13:38:02Z"
  name: default
  namespace: development
  resourceVersion: "31160"
  uid: 0bda0bec-49e4-4b32-991d-7fd706315c77
secrets:
- name: default-token-zfswb
p
ok so the mapped gsa annotation looks correct .
Copy code
<http://iam.gke.io/gcp-service-account|iam.gke.io/gcp-service-account>: <mailto:gsa-development@urbn-data-science.iam.gserviceaccount.com|gsa-development@urbn-data-science.iam.gserviceaccount.com>
This particular gsa needs to have those storage roles
t
Just finished adding the roles and applying them to the SAs. Will try re-running pyflyte
Bummer. No luck with all those roles and permissions applied manually. Same error as above.
I'll need to park it here to take care of some other things. Hopefully will get back to this experiment later today. 🙂
k
Would It be easier to get on a call
t
Yeah that may be easier. Unfortunately I have meetings most of rest of today (EDT) so I can circle back another time to coordinate. 👍
p
there something wrong with workload identity setup probably . You can try this tutorial from gcp which lets you create a wi test pod and call into the metadataserver to check if its using right identity https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity You can try to use the default service account in flytesnacks-development ns later if you are able to test out the same in another test ns
Also since we got past the signed url permissions on admin with your role changes , might be worth checking what’s different with the user pod service account and there roles. They should ideally work the same way .
t
I had a moment to poke this afternoon. Thanks for the link @Prafulla Mahindrakar. Here's the current state of affairs on my end: 1. 🟢 Confirmed tutorial workload identity process worked. I walked through that tutorial and confirmed that in that I can see the WI and GCP service accounts annotated:
Copy code
$ kubectl exec -it pod/workload-identity-test   --namespace test-wi   -- /bin/bash
root@workload-identity-test:/# curl -H "Metadata-Flavor: Google" <http://169.254.169.254/computeMetadata/v1/instance/service-accounts/>
default/
<mailto:test-wi-gsa@urbn-data-science.iam.gserviceaccount.com|test-wi-gsa@urbn-data-science.iam.gserviceaccount.com>/
2. 🟢 Confirmed I can get read the bucket, with a test pod, inside the
development
flyte namespace
Using this test spec, I was able to read the Flyte bucket in the container by executing:
gsutil ls <gs://flyte-ts-temp-service-flyte>
in the pod: Spec, flight-test.yaml:
Copy code
apiVersion: v1
kind: Pod
metadata:
  name: flyte-manual-test
  namespace: development
spec:
  containers:
    # - image: <http://ghcr.io/flyteorg/flytekit:py3.8-1.0.3|ghcr.io/flyteorg/flytekit:py3.8-1.0.3>
    - image: google/cloud-sdk:slim
      name: flyte-manual-test
      command: ["sleep", "infinity"]
      resources:
        limits:
          cpu: 500m
          memory: 500Mi
        requests:
          cpu: 500m
          memory: 500Mi
Output:
Copy code
root@workload-identity-test:/# gsutil ls <gs://flyte-ts-temp-service-flyte>
<gs://flyte-ts-temp-service-flyte/metadata/>
<gs://flyte-ts-temp-service-flyte/t2/>
This is using
<mailto:gsa-development@urbn-data-science.iam.gserviceaccount.com|gsa-development@urbn-data-science.iam.gserviceaccount.com>/
as the GCP mapped SA. 3. 🔴 Unable to run pyflyte, or even same spec, with flytekit image. pyflyte runs still result in permissions errors like above. Another interesting note: if I swap the image in the above spec from
google/cloud-sdk:slim
to
<http://ghcr.io/flyteorg/flytekit:py3.8-1.0.3|ghcr.io/flyteorg/flytekit:py3.8-1.0.3>
, I get the same permission errors. Output when using ``ghcr.io/flyteorg/flytekit:py3.8-1.0.3` in flyte-test.yaml:
Copy code
root@flight-test-ts-temp:~# gsutil ls <gs://flyte-ts-temp-service-flyte>
ServiceException: 401 Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.
(same error) I confirmed the service account in that container is the same as above by hitting the endpoint shown in #1 (from the tutorial, but using python requests since curl isn't in that flytekit image):
Copy code
root@flyte-manual-test:~# python3
>>> import requests
>>> r = requests.get("<http://169.254.169.254/computeMetadata/v1/instance/service-accounts/>", headers={"Metadata-Flavor": "Google"})
>>> print(r.content.decode())
default/
<mailto:gsa-development@urbn-data-science.iam.gserviceaccount.com|gsa-development@urbn-data-science.iam.gserviceaccount.com>/
It also uses
<mailto:gsa-development@urbn-data-science.iam.gserviceaccount.com|gsa-development@urbn-data-science.iam.gserviceaccount.com>/
I also pulled the spec from the pyflyte execution and attempted a manual
gsutil
on the running pod (entering with a sleep) and got the same error.
Will park here for now again since didn't have too much time to play again. 🙂
s
@Tom Szumowski this might be related to standalone
gsutil
not being able to authenticate without additional config (compared to google-cloud-sdk installed gsutil). I ran into this a while ago and we actually had a thread here I can’t find anymore probably due to the Slack history limit.
Looks like the flytekit image installs standalone gsutil via pip.
Let me look for a Dockerfile with that additional configuration needed for gsutil.
Ok found it. This is what we do. The boto config is the important part.
Copy code
RUN curl <https://storage.googleapis.com/pub/gsutil.tar.gz> | tar xfz - -C /opt && ln -s /opt/gsutil/gsutil /bin/gsutil
RUN echo '[GoogleCompute]\nservice_account = default' > /etc/boto.cfg # Required for gsutil to work with workload-identity
I guess it should also work with standalone gsutil installed via pip so perhaps try to derive an image from
<http://ghcr.io/flyteorg/flytekit:py3.8-1.0.3|ghcr.io/flyteorg/flytekit:py3.8-1.0.3>
with the second line added to check if that makes any difference.
t
@Sören Brunk interesting. Thank you for the advice. Will try it out when back on the laptop. So in your case, do you have a custom GCP deployment of Flyte that doesn't use (or extends from) the flytekit image?
k
@Sören Brunk did you not find it better to use
flytekitplugins-data-fsspec
and then install GCS for fsspec?
so @Tom Szumowski the storage layer is configurable
👍 1
s
@Tom Szumowski we have something similar to the manual deployment in the docs but with terraform for better integration into our existing infra. And yes we use a custom flytekit image as well, but mostly for historical reasons (the official one didn’t exist back then).
@Ketan (kumare3) I think we had an issue with fsspec back then which caused us to use gsutil again. Not sure if it’s still there. I have to try again with current flytekit.
t
@Sören Brunk success! 🎉 I used this Dockerfile
Copy code
FROM <http://ghcr.io/flyteorg/flytekit:py3.8-1.0.3|ghcr.io/flyteorg/flytekit:py3.8-1.0.3>

# Required for gsutil to work with workload-identity
RUN echo '[GoogleCompute]\nservice_account = default' > /etc/boto.cfg
Pushed it as image name:
Copy code
<http://gcr.io/urbn-data-science/flytekit-test-wrapper:latest|gcr.io/urbn-data-science/flytekit-test-wrapper:latest>
Then ran pyflyte:
Copy code
pyflyte run --image <http://gcr.io/urbn-data-science/flytekit-test-wrapper:latest|gcr.io/urbn-data-science/flytekit-test-wrapper:latest> --remote core/flyte_basics/basic_workflow.py my_wf --a 5 --b hello
And got a successful run on the GKE cluster:
Copy code
$ flytectl get execution ffe50f73fa4564737bf6 -p flytesnacks -d development 
 ---------------------- ---------------------------------------- -------------------------- ------------- ----------- ---------------- -------------------------------- --------------- -------------------- -------------------- 
| NAME                 | LAUNCH PLAN NAME                       | VERSION                  | TYPE        | PHASE     | SCHEDULED TIME | STARTED                        | ELAPSED TIME  | ABORT DATA (TRUNC) | ERROR DATA (TRUNC) |
 ---------------------- ---------------------------------------- -------------------------- ------------- ----------- ---------------- -------------------------------- --------------- -------------------- -------------------- 
| ffe50f73fa4564737bf6 | core.flyte_basics.basic_workflow.my_wf | 00jRSrIIdnwryVi5J7STWw== | LAUNCH_PLAN | SUCCEEDED |                | 2022-07-28T23:52:37.007821370Z | 78.295864832s |                    |                    |
 ---------------------- ---------------------------------------- -------------------------- ------------- ----------- ---------------- -------------------------------- --------------- -------------------- --------------------
Thank you everyone in this thread for the fantastic support! I think with this tweak, that concludes the investigation for the original goal, i.e. getting flyte to run on GCP without a domain. The only open item on my end is to try out that envoy config @jeev provided in order to see the GUI (perhaps tomorrow).
🎉 3
To summarize some "issues" discovered: 1. Opta install did not appear to install the roles, permissions, and workflow identity permissions -- I had to follow manual instructions for that to apply. 2. The line
'[GoogleCompute]\nservice_account = default'
is required in
/etc/boto.cfg
for the flytekit docker image to work. Otherwise the pod dies with a bucket permission error. -- though I am guessing this since this is cloud-specific it should be handled somewhere else outside the Dockerfile maybe? Do you all suggest I report these as GitHub issues? Or are they known already and a report not required?
k
Yes please
👍 1
Thank you Tom
t
No problem. This was fun!
🙏 1
j
@Tom Szumowski let me know if you cant get envoy working. it should work oob i think 🙂
t
@jeev success! Worked oob. Just needed to create the configmap, then deploy to flyte namespace and I was able to port-forward 8000. Thank you!
@jeev hmm. I may have celebrated too soon. I am able to access the console, but unable to get pyflyte to connect to flyteadmin and get a workflow submitted. I port-forward envoy with:
Copy code
kubectl -n flyte port-forward deployment/flyte-proxy 30080:8000
I then set my
~/.flyte/config.yaml
to:
Copy code
admin:
  # For GRPC endpoints you might want to use dns:///flyte.myexample.com
  endpoint: dns:///localhost:30080
...
And when I run I get the error:
Copy code
"Non-auth RPC error <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNIMPLEMENTED
...
(full trace attached) I can still get it to work if I separately port-forward flyteadmin with:
Copy code
kubectl -n flyte port-forward service/flyteadmin 30081:81
and set config to
30081
. When you use this, are you able to access flyteadmin and the console with one port-forward?
j
yes a single port-forward should suffice
it's possible that the config is missing a necessary endpoint.
if you are able to browse your projects in flyteconsole, you are connected to flyteadmin as well.
let's just figure out the grpc endpoint and we should be good
t
Thanks. My guess is in me hacking away at the Opto config a bit, I may have inadvertently removed an endpoint. I'll poke around at some point but trust the envoy config is good to go and in the meantime I can always keep that second port forwarded to test other things. Thanks again!
j
the envoy config might be missing an endpoint that i just haven't used yet. might be entirely possible.
alternatively if you already created an ingress, you can just refer to that
🙏 1
186 Views