Ok, I’m making some progress now, but I’m stuck on...
# ask-the-community
a
Ok, I’m making some progress now, but I’m stuck on the pods having permissions to pull images from our GCP artifact registry. I went through the process of patching the default k8s service account, but I’m still getting this 403 when the tasks try to run.
Copy code
task submitted to K8s

[ContainersNotReady|ContainerCreating]: containers with unready status: [f787e4fb7001d459f851-n0-0]|

[ContainersNotReady|ErrImagePull]: containers with unready status: [f787e4fb7001d459f851-n0-0]|rpc error: code = Unknown desc = failed to pull and unpack image "<location>-docker.pkg.dev/<project>/<repository>/<image_name>:MHon8F_9TgvC55qoS5mUtw..": failed to resolve reference "<location>-docker.pkg.dev/<project>/<repository>/<image_name>:MHon8F_9TgvC55qoS5mUtw..": failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://<location>-docker.pkg.dev/v2/token?scope=repository%3A<project>%2F<repository>%2F<image_name>%3Apull&service=<location>-docker.pkg.dev: 403 Forbidden
Here are the patch steps I followed: • Created service account and downloaded the .json key file •
kubectl create secret docker-registry artifact-json-key --docker-server=<http://pkg.dev|pkg.dev> --docker-username=_json_key --docker-password=(cat artifact_auth.json | string collect) --docker-email=<email>
kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "artifact-json-key"}]}'
s
@Andrew can you take a look at this response on stackoverflow? https://stackoverflow.com/a/36286707
a
I did try the answer below that, to do it all with kubectl, and it didn’t end up working. Maybe I’ll have to give that one a try as well. But when I did the kubectl way, I checked and the default service account did have the secret on there, so I’m not sure whats off, I’m guessing something weird with the way the secret was set up. Also, do you know if you’d have to manually patch the default service account under each namespace, for “flytesnacks.development”, “flytesnacks.staging” etc. as an example? Or if there’s a way to patch all of them and future ones? I tried manually for now to test the patch but with no luck.
s
Maybe I’ll have to give that one a try as well.
okay. let me know if it still results in a failure.
Also, do you know if you’d have to manually patch the default service account under each namespace, for “flytesnacks.development”, “flytesnacks.staging” etc. as an example? Or if there’s a way to patch all of them and future ones? I tried manually for now to test the patch but with no luck.
i'm not so sure. @David Espejo (he/him) do you know if this is a possibility?
d
@Andrew it could be that the role associated with the GSA that the
default
KSA uses is missing a role. Could you try adding to this list the
"artifactregistry.reader"
role?
save, terraform apply and report back please 🙂
a
It got an error, saying that permission is not valid, I think because its a role, not a permission. I can try adding
"artifactregistry.dockerimages.get"
, not sure if that’s the only one it would need
Might need to add all permissions under that role. Trying this now
@David Espejo (he/him) Ok, I tried that, and its still getting a 403 error. Here are the individual permissions I added
Copy code
"artifactregistry.dockerimages.get",
      "artifactregistry.dockerimages.list",
      "artifactregistry.files.get",
      "artifactregistry.files.list",
      "artifactregistry.locations.get",
      "artifactregistry.locations.list",
      "artifactregistry.mavenartifacts.get",
      "artifactregistry.mavenartifacts.list",
      "artifactregistry.npmpackages.get",
      "artifactregistry.npmpackages.list",
      "artifactregistry.packages.get",
      "artifactregistry.packages.list",
      "artifactregistry.projectsettings.get",
      "artifactregistry.pythonpackages.get",
      "artifactregistry.pythonpackages.list",
      "artifactregistry.repositories.downloadArtifacts",
      "artifactregistry.repositories.get",
      "artifactregistry.repositories.list",
      "artifactregistry.repositories.listEffectiveTags",
      "artifactregistry.repositories.listTagBindings",
      "artifactregistry.repositories.readViaVirtualRepository",
      "artifactregistry.tags.get",
      "artifactregistry.tags.list",
      "artifactregistry.versions.get",
      #"artifactregistry.versions.list ",
For some reason just that last one caused a 400 error, saying it was an invalid permission.. but all of those are listed under that role you mentioned
after I terraform applied that, is there a way to verify permissions of the workers?
@Samhita Alla unfortunately I’m running into a bunch of issues trying that other method from stackoverflow. It seems to be outdated now, with a bunch of commands not working, and flags not existing anymore
d
I'll be working to reproduce, isolate and fix this problem and will report back here
ok @Andrew I think I have it:
1. Just pushed a commit adding the permission that the `flyteworkers`GSA needs so you can use it to build and push images to Artifact Registry: https://github.com/davidmirror-ops/deploy-flyte/blob/8209aaf69c07cc1162af51c2a965b90477646337/environments/gcp/flyte-core/iam.tf#L123-L128 Add this, then do terraform plan - terraform apply 2. The GKE cluster deployed by these modules doesn't use the default Compute Engine SA, so, you'll need to: a. Generate a key for the
flyteworkers
GSA (
gcloud iam service-accounts keys create gcp-artifact.key --iam-account=flyte-gcp-flyteworkers@<your-project>.<http://iam.gserviceaccount.com|iam.gserviceaccount.com>
) b. Login to Docker using the GSA (
cat gcp-artifact.key| docker login -u _json_key --password-stdin https://<region>-<http://docker.pkg.dev|docker.pkg.dev>
) c. Complete the rest of the process described here with a couple of nits: • add `--namespace flytesnacks-development`to the `kubectl create secret...`command • Patch the
default
SA on the
flytesnacks-development
ns 3. Your ImageSpec definition was rendering errors. Here is a description of how to specify the base image. In summary, I just tested it a couple of times using:
Copy code
misc_image_spec = ImageSpec(
    name="example-v2",
    base_image="<http://ghcr.io/flyteorg/flytekit:py3.10-1.10.0|ghcr.io/flyteorg/flytekit:py3.10-1.10.0>", #This is, flytekit 1.10.0 and Python 3.10
    packages=["pendulum==2.1.2"],
    # apt_packages=["git"],
    registry="<region>-docker.pkg.dev/<project>/flyte",
)
Considering that you will have a
default
SA per every
project-domain
combination, I'm not sure this is the best approach but I'm still getting familiar with the interesting IAM model on GCP
a
Awesome, thank you so much! I can give this a try soon, hopefully tomorrow or early next week. So in summary, this method means I’d have to duplicate it for each
project-domain
combo, but there may be a way around that later?
d
You'd have to repeat the SA patching for every namespace yes. I suspect there's a better way but still not sure.
I mean, using tokens instead of SA keys would be better from a security pov, but rotating them would have to be part of the process
a
Just a thought on a potential way to manage it across namespaces. Would the custom pod template work for that? I’m not sure if that’s easily accessible in flyte. But the image pulling doc page mentions it, and it seems like that could work on all pods.
s
yes, you can use a custom pod template.
a
How does that fit into flyte, though? Is there a specific way you’re supposed to do that?
s
a
@David Espejo (he/him) I should have time to look at this tomorrow, I’ve been a bit delayed by other things. Just wanted to see if you’ve had any new thoughts, or if I should move forward with this new terraform and other steps for now?
Quick update on this. I tried it out, and in the little log window in the top right it got this
Copy code
12/5/2023 4:44:58 AM UTC task submitted to K8s

12/5/2023 4:44:58 AM UTC Scheduling

12/5/2023 4:44:58 AM UTC [ContainersNotReady|ContainerCreating]: containers with unready status: [f036bb5e57f094cc3883-n0-0]|
And in the task logs it got this:
Copy code
"Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "pyflyte-fast-execute": executable file not found in $PATH: unknown"
So it looks like it may have gotten past the 403 error, but I’m not exactly sure how to interpret these errors.
s
could you share your imagespec definition? looks like the image doesn't have flytekit installed.
a
ah, gotcha, yeah I can in the morning, and i’ll see if I changed something with that
d
I saw that problem too whenever I ran
pyflyte
outside an active venv
a
Ok, its working now! I’ll use that solution for now and see if I can get my own workflow moved over
d
@Andrew also, a slight change in the process: • The TF modules now create a GSA
"${local.name_prefix}-registrywriter"
that you or a CI system can use to get a token and do a docker login as described here • The workers now have the
artifactregistry.reader
role to be able to pull images
a
@David Espejo (he/him) I have some clarification questions about the notes found here Number 2 says to include permissions for each of those, so when I create the GSA that includes any permissions I need inside my tasks, I’ll also need to include flyteadmin and flytepropellor permissions as well? When I bind the KSA (default) to the GSA (custom created by me), will I have to do that in each namespace again? for each project-environment pair?
d
so, you need to create an additional GSA?
a
well, I may be misunderstanding. Did the tf already create at GSA for workers (code inside tasks) where I can set any required permissions I need, and that will already be setup with Workload Identity?
I think I may have misunderstood. I think I’m seeing now that the flyteworkers GSA is what runs in there so I can add permissions to that
d
right, the flyteworkers GSA already carries the minimum permissions required by the workers. You can add more if needed. Does that answers your q?
a
Yes! Sorry for the confusion