Hi All , working on setting up Flyte on GKE as par...
# flyte-deployment
k
Hi All , working on setting up Flyte on GKE as part of a side project. Does anyone have a document on how to deploy the helm chart on GKE cluster using helm_release resource block of terraform? I'm trying to make sure all the changes are tracked via terraform for me to easily pull apart the infra after use.
v
I’m doing exactly the same thing right now. I’m not yet done with it, but we can rely on the GCP/GKE manual installation page for guidance: https://docs.flyte.org/en/v1.0.0/deployment/gcp/manual.html You’ll need 7 components • IAM permissions • TLS (cert-manager) • ingress-nginx LB • DNS (your provider of choice, GCP cloud dns, cloudflare (has terraform provider), etc.) • cloudsql database (postgresql) • GCS bucket • flyte itself You’ll need 3 helmcharts: flyte-core, ingress-nginx, cert-manager For the IAM permissions: • google_service_account - service account • google_project_iam_custom_role - role • google_project_iam_member - attach role to service account You may want to use for_each with these permissions based on the guide:
Copy code
{
    flyteadmin = [
      "iam.serviceAccounts.signBlob",
      "storage.buckets.get",
      "storage.objects.create",
      "storage.objects.delete",
      "storage.objects.get",
      "storage.objects.getIamPolicy",
      "storage.objects.update",
    ],
    flytepropeller = [
      "storage.buckets.get",
      "storage.objects.create",
      "storage.objects.delete",
      "storage.objects.get",
      "storage.objects.getIamPolicy",
      "storage.objects.update",
    ],
    flytescheduler = [
      "storage.buckets.get",
      "storage.objects.create",
      "storage.objects.delete",
      "storage.objects.get",
      "storage.objects.getIamPolicy",
      "storage.objects.update",
    ],
    datacatalog = [
      "storage.buckets.get",
      "storage.objects.create",
      "storage.objects.delete",
      "storage.objects.get",
      "storage.objects.update",
    ],
    flyteworkflow = [
      "storage.buckets.get",
      "storage.objects.create",
      "storage.objects.delete",
      "storage.objects.get",
      "storage.objects.list",
      "storage.objects.update",
    ],
  }
ingress-nginx helm_release from
<https://kubernetes.github.io/ingress-nginx>
chart name
ingress-nginx
(i use version
4.0.13
) cert-manager helm_release from
<https://charts.jetstack.io>
chart name
cert-manager
version
v1.12.0
Note that cert-manager here is 1.12.0 instead of 0.12.0 that was used in the documentation example, that’s because we need it to be compatible with newer versions of kubernetes flyte-core helmchart from
<https://flyteorg.github.io/flyte>
chart name
flyte-core
with your preferred flyte version, I use 1.7.0. Use the values from https://github.com/flyteorg/flyte/blob/master/charts/flyte-core/values-gcp.yaml I recommend passing helm values using templatefile to allow dynamic configuration based on terraform values, here’s my example:
Copy code
values = templatefile("../infra-root-modules/helm-values/flyte.yaml", {
    project_id     = var.gcp_project
    db_host        = module.flyte-psql-instance[0].private_ip_address
    db_password    = sensitive(var.flyte_cluster_secrets["${var.environment}/flyte_sql_root_pw"]) # I use carplett sops provider for secrets, handle this however you prefer 
    storage_bucket = module.flyte-storage[0].name
    host_name      = "flyte.${var.environment}.${var.domain}"
  })
From my initial observation it seems that Flyte will automatically create the Certificate resource, but it’s a custom resource installed by cert-manager, so make sure you pass the helm value
installCRDs: true
to your cert-manager and have flyte helm_release depends_on the cert-manager helm_release so it will be able to create the Certificate. You’ll also need to set up an Issuer first, but in my case I prefer
ClusterIssuer
because it lets you separate cert-manager and flyte’s namespaces. Use kubectl provider or kubernetes_manifest resource with kubernetes provider to make something like this (in my case I used a templatefile, so there are some placeholder values):
Copy code
apiVersion: <http://cert-manager.io/v1|cert-manager.io/v1>
kind: ClusterIssuer
metadata:
  name: letsencrypt-production
spec:
  acme:
    server: <https://acme-v02.api.letsencrypt.org/directory>
    email: ${email}
    privateKeySecretRef:
      name: letsencrypt-production
    solvers:
    - selector: {}
      http01:
        ingress:
          class: ${ingress_class}
If using GKE Autopilot, you’ll need to set this for cert-manager values file as well (replace placeholder):
Copy code
global:
  leaderElection:
    namespace: ${certmanager_namespace}
This lets cert-manager create leases in a namespace other than kube-system, because GKE Autopilot restricts access to kube-system namespace. You will also probably need to configure at least 500m cpu requests in your flyte helm values.yaml, because it uses pod anti affinity which requires a minimum of 500m CPU requests when using GKE Autopilot Your helm install will probably fail because dns is not set up, it seems necessary for the ingress to work, which Flyte also uses. Now you should set up DNS however you like, I use cloudflare for this. Next there’s the cloudsql database. Create a
google_sql_database_instance
, create a
google_sql_database
named ‘flyteadmin’, and create a
google_sql_user
. Configure the user’s name and the host IP address outputs from terraform with flyte using templatefile on your values.yaml file. It seems we can’t use dns names here (or the connection name), at least not with private IPs, so I am using a static IP address for now. Next there’s the GCS bucket, a simple
google_storage_bucket
I currently have everything set up except for the DNS, connecting it at the moment. I’ll update here if there’s anything else worth mentioning about this process. Hope this helps
k
Thanks @Victor Churikov for the detailed writeup , I'm using flyte-binary chart , moreover the steps will be same and I think you mentioned the following component TLS (cert-manager) to fix certification verification failure while the helm chart is being deployed , makes sense .
I'm currently burning my head over fixing the tls error , once that is sorted out , this will be an achievement for sure 🙌
Reading through the writeup as of now to get where I'm going wrong
v
I got it working like this:
Copy code
resource "cloudflare_record" "flyte" {
  zone_id  = var.cloudflare_zone_id
  name     = local.flyte_host
  value    = data.kubernetes_service.nginx-lb.status[0].load_balancer[0].ingress[0].ip
  type     = "A"
  ttl      = 3600
  priority = 10
  proxied  = false
}

data "kubernetes_service" "nginx-lb" {
  metadata {
    name      = "${module.nginx-ingress[0].release_name}-ingress-nginx-controller"
    namespace = module.nginx-ingress[0].namespace
  }
  depends_on = [module.nginx-ingress]
}
Note that I used modules wrapping the charts instead of the helm_release directly. You don’t need to do this, you can use the helm_release directly, this is something unique to my own use case because of other, unrelated requirements. The idea is that you add a datasource for the ingress-nginx loadbalancer service, configured with the service name (which is going to have the release name as its prefix) and the namespace where you expect it to be. Then you can get its IP address using
data.kubernetes_service.nginx-lb.status[0].load_balancer[0].ingress[0].ip
If using cluster issuer, you have to make sure you configured the annotations of the ingress resources in the flyte helm values accordingly. Example from my flyte-core values.yaml templatefile:
Copy code
common:
  ingress:
    host: "{{ .Values.userSettings.hostName }}"
    tls:
      enabled: true
    annotations:
      <http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: nginx
      <http://nginx.ingress.kubernetes.io/ssl-redirect|nginx.ingress.kubernetes.io/ssl-redirect>: "true"
      <http://cert-manager.io/cluster-issuer|cert-manager.io/cluster-issuer>: "letsencrypt-production"
      <http://nginx.ingress.kubernetes.io/whitelist-source-range|nginx.ingress.kubernetes.io/whitelist-source-range>: ${whitelisted_cidrs}
Use either
<http://cert-manager.io/cluster-issuer|cert-manager.io/cluster-issuer>
or
<http://cert-manager.io/issuer|cert-manager.io/issuer>
depending on your choice with the kubernetes_manifest, has to match
k
Thanks Victor for this , although this is a bit over my head as I'm not that good at infra and certificates and DNS , trying to wrap my head around this. I think the cert manager one makes sense to be added to manifest as I was reading documentation on how to issue client TLS certificates for k8s cluster
Since the helm chart for flyte binary which is a single cluster setup is failing while doing a helm install basically via helm_release so this makes sense to solve the K8s cluster unreachable error
Meanwhile I was able to get helm cli to deploy flyte binary chart without any values
And its now waiting on DB , since I didnt set any values so checking how to do it via terraform to avoid exposing DB details
v
In the end I gave up on using GKE autopilot because of this bug in GCP that makes it unsuitable for ML workflows: https://issuetracker.google.com/issues/227162588 More on the GCP IAM permissions:
google_container_cluster
should be configured with a
workload_identity_config
input like this:
Copy code
resource "google_container_cluster" "gke" {
...
  workload_identity_config {
    workload_pool = "${var.gcp_project}.svc.id.goog"
  }
...
}
Then create the service accounts of each flyte component by looping for_each on the local map I shared above:
Copy code
resource "google_service_account" "flyte_sa" {
  for_each = local.service_accounts
  account_id   = each.key
  display_name = each.key
  project      = var.gcp_project
}
Add the custom role for each:
Copy code
resource "google_project_iam_custom_role" "flyte_role" {
  for_each = local.service_accounts
  title       = each.key
  project     = var.gcp_project
  permissions = each.value
  role_id = "${each.key}_${random_string.role_id_suffix.id}", #roles are not deleted immediately behind the scenes so name should be unique, use random_string resource to generate a suffix
}
Bind the gcp roles to the gcp service accounts:
Copy code
resource "google_project_iam_member" "membership" {
  for_each = local.service_accounts
  project = var.gcp_project
  role    = google_project_iam_custom_role.flyte_role["${each.key}"].name
  member = "serviceAccount:${google_service_account.flyte_sa["${each.key}"].email}"
}
Create bindings to allow kubernetes serviceaccounts (kind: ServiceAccount) to use workload identity permissions:
Copy code
resource "google_service_account_iam_member" "flyteworkflow_sa_binding" {
  for_each = toset(["development", "staging", "production"])
  service_account_id = google_service_account.flyte_sa["flyteworkflow"].id
  role               = "roles/iam.workloadIdentityUser"
  member             = "serviceAccount:${var.gcp_project}.svc.id.goog[${each.key}/default]"
}
These all loop on the same map so it’s a good idea to put them in a terraform module and to the for_each once on the module call, I’m not doing it here to keep the example simple With this IAM setup, flyte works for me from terraform
k
Makes sense , no issues , I finally found a way to get helm to work via terraform for deploying flyte binary as my data pipelines wont be that huge in number , but this is helpful , thanks Victor