Something weird going on with our flytepropeller. ...
# flyte-deployment
a
Something weird going on with our flytepropeller. Will probably try to fully redeploy.
Got the log error below when trying to run a workflow that needs to access a secret. Propeller tries to get the webhook to do its work but it seems to think the hostname for the webhook is
flyte-backend-flyte-binary-webhook.flyte.svc
.
Copy code
{
  "json": {
    "exec_id": "ap69gmldszgqd5xc94rk",
    "node": "n0",
    "ns": "...",
    "res_ver": "208312048",
    "routine": "worker-2",
    "tasktype": "python-task",
    "wf": "..."
  },
  "level": "error",
  "msg": "Failed to launch job, system error. err: Internal error occurred: failed calling webhook \"<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>\": Post \"<https://flyte-backend-flyte-binary-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>\": service \"flyte-backend-flyte-binary-webhook\" not found",
  "ts": "2023-02-22T18:26:12Z"
}
The
webhook
section of the
core.yaml
configmap for the propeller has
Copy code
webhook:
  certDir: /etc/webhook/certs
  serviceName: flyte-pod-webhook
Previously, I ran the
flyte-binary
deployment. But I eventually tore it down.
I’ve restarted the k8s deployments but haven’t done any redeploy of the chart
flyte-backend-flyte-binary-webhook.flyte.svc
looks like the host names
kubefwd
uses and I did use that tool at one point so I wonder if there was some weird alchemy that got things mixed up.
d
@Alex Papanicolaou so you used `kubefwd`as a kinda-Ingress before?
a
yea, but we got the ingress worked out. I’m currently redeploying the helm chart
d
ok, did you uninstall the previous
flyte-binary
chart? We should try a clean install if possible
a
yea, doing a clean install. removing everything
wow, still getting the error message
did
helm uninstall
. Then went through and cleared out the namespaces
d
and I guess the service that the chart actually created is
flyte-pod-webhook.flyte.svc
right?
a
core.yaml
in the running pod:
Copy code
/etc/flyte/config $ cat core.yaml 
manager:
  pod-application: flytepropeller
  pod-template-container-name: flytepropeller
  pod-template-name: flytepropeller-template
propeller:
  downstream-eval-duration: 30s
  enable-admin-launcher: true
  gc-interval: 12h
  kube-client-config:
    burst: 25
    qps: 100
    timeout: 30s
  leader-election:
    enabled: true
    lease-duration: 15s
    lock-config-map:
      name: propeller-leader
      namespace: flyte
    renew-deadline: 10s
    retry-period: 2s
  limit-namespace: all
  max-workflow-retries: 50
  metadata-prefix: metadata/propeller
  metrics-prefix: flyte
  prof-port: 10254
  queue:
    batch-size: -1
    batching-interval: 2s
    queue:
      base-delay: 5s
      capacity: 1000
      max-delay: 120s
      rate: 100
      type: maxof
    sub-queue:
      capacity: 1000
      rate: 100
      type: bucket
    type: batch
  rawoutput-prefix: <s3://infima-flyte/raw/>
  workers: 40
  workflow-reeval-duration: 30s
webhook:
  certDir: /etc/webhook/certs
  serviceName: flyte-pod-webhook
/etc/flyte/config $
Services are named right:
For more clarity, here is what I did: 1.
helm fetch --untar --untardir . flyteorg/flyte-core
2. Made two changes to the
flyteadmin
and
clusterresourcesync
deployment.yaml
. To get the cluster config working, had to add in these these volume mounts to the spec:
Copy code
{{- with .Values.flyteadmin.additionalVolumeMounts -}}
          {{ tpl (toYaml .) $ | nindent 10 }}
          {{- end }}
3. Ran through the deploy, ie
Copy code
helm upgrade flyte \
    ./flyte-core \
    --install \
    --values values.yaml \
    --values values-eks.yaml \
    --values values-cluster-config.yaml \
    --values values-ingress.yaml \
    --create-namespace \
    --namespace flyte
Caveat: I start with the data plane and do some secret updating. But it’s basically that command
here’s something I’m noticing though
the tags in the repo, for instance
1.1.72
for
flyteadmin
, are not the same as the tags that I get when I download the chart. https://github.com/flyteorg/flyte/blob/master/charts/flyte-core/values.yaml
the chart has
Copy code
flyteadmin:
  enabled: true
  # -- Replicas count for Flyteadmin deployment
  replicaCount: 1
  image:
    # -- Docker image for Flyteadmin deployment
    repository: <http://cr.flyte.org/flyteorg/flyteadmin-release|cr.flyte.org/flyteorg/flyteadmin-release> # FLYTEADMIN_IMAGE
    tag: v1.3.0 # FLYTEADMIN_TAG
    pullPolicy: IfNotPresent
I copied the values files from the repo to start with instead of starting with the ones in the downloaded chart.
well, that change didn’t make a difference. Updated the
values.yaml
to use the tags in the helm chart, so for instance
Copy code
<http://cr.flyte.org/flyteorg/flytepropeller-release:v1.3.0|cr.flyte.org/flyteorg/flytepropeller-release:v1.3.0>
instead of
Copy code
<http://cr.flyte.org/flyteorg/flytepropeller:v1.1.62|cr.flyte.org/flyteorg/flytepropeller:v1.1.62>
a
I’ve hit a deadend. I can’t figure out what the source of that host name the propeller is using. I can’t parse the propeller code to understand where it’s coming up with it. The only source for that name comes from this helper in the binary helm chart and from the manifests, I don’t see that name anywhere. https://github.com/flyteorg/flyte/blob/d60c9af85a59ebb4c2265f76cb082b992078a309/charts/flyte-binary/templates/_helpers.tpl#L159
k
Cc @jeev
j
the repo is always gonna be ahead of the stable chart release @Alex Papanicolaou
it looks like you’re using the flyte-core chart but linking the flyte-binary chart. can you clarify?
ok i think i understand. you had installed the flyte-binary chart before, but now uninstalling it and installing flyte-core?
i think you need to delete the old mutating webhook from the flyte namespace
a
ok i think i understand. you had installed the flyte-binary chart before, but now uninstalling it and installing flyte-core?
this is correct
i think you need to delete the old mutating webhook from the flyte namespace
when I’ve been doing clean installs, i’ve completely wiped everything flyte related. the namespaces are all deleted.
j
hmm can you do “kubectl get mutatingwebhookconfigurations”
a
!
Copy code
╰─❯ kubectl get mutatingwebhookconfigurations
NAME                                   WEBHOOKS   AGE
flyte-backend-flyte-binary-webhook     1          7d22h
flyte-pod-webhook                      1          65d
good catch.
j
can you delete the first one?
these aren’t namespaced
a
hmmm, perhaps a limitation of Lens.
j
it’s created by the flyte deployment. and not namespaced. easy to miss :)
k
@Alex Papanicolaou this is unfortunate - but how can we avoid this in the future. It is sadly super easy to blame Flyte
j
can you add owner references to webhooks?
a
helm uninstall
doesn’t seem to tear down everything
j
tricky though. we’d need to make the deployment the owner but the pod won’t know what the deployment is
a
I tried to track down things that I missed in the teardown. I ran
k get all -A
but clearly that wasn’t good enough.
It didn’t get everything
j
right
it’s not a helm owned resource unfortunately
a
well, I found a command that really gets everything.
Copy code
kubectl api-resources --verbs=list -o name \      
  | xargs -n 1 kubectl get --show-kind --ignore-not-found
Thanks for the help @jeev and @Ketan (kumare3)! Hopefully it’s a smooth flyte from here.
k
@Alex Papanicolaou also please help others in the future
a
Will do. Definitely becoming an expert on this.
d
I just ran into the very same problem and the solution worked for me. Probably the only missing piece would to be update the docs as they point to a different service name
kubectl -n flyte port-forward *service/flyte-binary* 8088:8088 8089:8089
@Alex Papanicolaou would you like to contribute that piece? I can assist you if you need
I mean, it should be
kubectl -n flyte port-forward *service/flyte-backend-flyte-binary* 8088:8088 8089:8089
j
@David Espejo (he/him) that name varies by the name of the helm release. in this case the helm release is named flyte-backend. it will likely be different from user to user.
a
@David Espejo (he/him) I was thinking of removing some key info from my repo and publishing it on github with a guide (that followed along the “AWS manual setup” guide in the old version of the docs). Also making a PR for this: https://flyte-org.slack.com/archives/C01P3B761A6/p1677172098125769?thread_ts=1676558819.041309&amp;cid=C01P3B761A6
d
@Alex Papanicolaou that would be great! Please let us know if you require any help
151 Views