Title
e

Eric Hsiao

08/12/2022, 9:09 PM
Hi all - having an issue injecting secrets from k8s into my task. Here's how I've set it up. I created a test secret like this to try and see if a task can read it
kubectl-n flytesnacks-development create secret generic common-secrets --from-literal=TEST_SECRET=blah
kubectl -n flytesnacks-development get secret/common-secrets -o json | jq '.data | to_entries | map(.value= (.value | @base64d))'
>> [
  {
    "key": "TEST_SECRET",
    "value": "blah"
  }
]
Within the flytesnacks-development namespace, I'm running a task like this
@task(secret_requests=[Secret(group="common-secrets", key="TEST_SECRET")])
def print_secret() -> str:
    secrets = current_context().secrets
    return secrets.get("common-secrets", "TEST_SECRET")
However, this fails with the following
Unable to find secret for key TEST_SECRET in group common-secrets in Env Var:_FSEC_COMMON-SECRETS_TEST_SECRET and FilePath: /root/secrets/common-secrets/test_secret
From looking at the code in the the SecretManager, it looks like it only checks the ENV variable or a file path (which does not exist because I'm using k8s secrets as an additional provider). Am I missing something? I've checked that the k8s secret is in the same namespace as the task being run
k

Ketan (kumare3)

08/12/2022, 11:11 PM
Hmm the secret should be Injected by the Flyte secrets injector
e

Eric Hsiao

08/15/2022, 1:42 AM
Do you know how I can check when / how that runs? It says in the docs that it should be injected to the pod when the task starts but that doesn't seem to be the case
e

Eric Hsiao

08/15/2022, 2:41 PM
Yes that's the document I'm using. I checked the webhook pod and things seem to be configured correctly
➜  kubectl logs pod/flyte-pod-webhook-595f7b6858-7qxdt -n flyte
time="2022-08-12T00:31:32Z" level=info msg=------------------------------------------------------------------------
time="2022-08-12T00:31:32Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2022-08-12 00:31:32.456181799 +0000 UTC m=+0.033941479]"
time="2022-08-12T00:31:32Z" level=info msg=------------------------------------------------------------------------
time="2022-08-12T00:31:32Z" level=info msg="Detected: 4 CPU's\n"
{"metrics-prefix":"flyte:","certDir":"/etc/webhook/certs","localCert":false,"listenPort":9443,"serviceName":"flyte-pod-webhook","servicePort":443,"secretName":"flyte-pod-webhook","secretManagerType":"K8s","awsSecretManager":{"sidecarImage":"<http://docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4|docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4>","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"vaultSecretManager":{"role":"flyte","kvVersion":"2"}}
k

Ketan (kumare3)

08/15/2022, 3:17 PM
Hmm interesting, so let's try to look at one of the launched pods- cc @Yee you recently saw this. What annotation should we look for
e

Eric Hsiao

08/15/2022, 4:57 PM
Here's the output for describing a task pod that starts up
➜  Documents kb describe pod a2wf2sqs2bfw6qr4l92d-n0-0 -n flytesnacks-development
Name:         a2wf2sqs2bfw6qr4l92d-n0-0
Namespace:    flytesnacks-development
Priority:     0
Node:         ip-10-15-147-196.ec2.internal/10.15.147.196
Start Time:   Mon, 15 Aug 2022 12:52:24 -0400
Labels:       domain=development
              execution-id=a2wf2sqs2bfw6qr4l92d
              inject-flyte-secrets=true
              interruptible=false
              node-id=n0
              project=flytesnacks
              shard-key=13
              task-name=flyte-workflows-hello-world-print-secret
              workflow-name=flyte-workflows-hello-world-wf
Annotations:  <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: false
              flyte.secrets/s0: m4zg54lqhiqcey2pnvww52rnonswg3tforzsectlmv3tuibclfavercjl4jukq1sivkcecq
              <http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
Status:       Succeeded
IP:           10.15.128.66
IPs:
  IP:           10.15.128.66
Controlled By:  flyteworkflow/a2wf2sqs2bfw6qr4l92d
Containers:
  a2wf2sqs2bfw6qr4l92d-n0-0:
    Container ID:  <docker://7e388ada4103f5fa7c5a0c1a673b3c61f79c441bcede9b1798bed8f4db128e6>5
    Image:         XXX
    Image ID:      XXX 
    Port:          <none>
    Host Port:     <none>
    Args:
      pyflyte-execute
      --inputs
      <s3://predictap-tyson-flyte/metadata/propeller/flytesnacks-development-a2wf2sqs2bfw6qr4l92d/n0/data/inputs.pb>
      --output-prefix
      <s3://predictap-tyson-flyte/metadata/propeller/flytesnacks-development-a2wf2sqs2bfw6qr4l92d/n0/data/0>
      --raw-output-data-prefix
      <s3://predictap-tyson-flyte/jh/a2wf2sqs2bfw6qr4l92d-n0-0>
      --checkpoint-path
      <s3://predictap-tyson-flyte/jh/a2wf2sqs2bfw6qr4l92d-n0-0/_flytecheckpoints>
      --prev-checkpoint
      ""
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      flyte.workflows.hello_world
      task-name
      print_secret
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 15 Aug 2022 12:52:25 -0400
      Finished:     Mon, 15 Aug 2022 12:52:28 -0400
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1000Mi
    Requests:
      cpu:     1
      memory:  1000Mi
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:<http://flyte.workflows.hello_world.wf|flyte.workflows.hello_world.wf>
      FLYTE_INTERNAL_EXECUTION_ID:        a2wf2sqs2bfw6qr4l92d
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               0
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           flyte.workflows.hello_world.print_secret
      FLYTE_INTERNAL_TASK_VERSION:        v0.0.5
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                flyte.workflows.hello_world.print_secret
      FLYTE_INTERNAL_VERSION:             v0.0.5
      AWS_STS_REGIONAL_ENDPOINTS:         regional
      AWS_DEFAULT_REGION:                 us-east-1
      AWS_REGION:                         us-east-1
      AWS_ROLE_ARN:                       XXX
      AWS_WEB_IDENTITY_TOKEN_FILE:        /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7kr64 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  kube-api-access-7kr64:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
k

Ketan (kumare3)

08/15/2022, 10:34 PM
this is weird - it has the right secret annotation
flyte.secrets/s0: m4zg54lqhiqcey2pnvww52rnonswg3tforzsectlmv3tuibclfavercjl4jukq1sivkcecq
cc @Kevin Su / @Yee does anyone of you know the problem here? is
-
a problem?
y

Yee

08/15/2022, 10:47 PM
i’m still debugging this
i think it’s just the caps
and not really sure why
failure to mount volume… this part is a bug.
if you specify a secret that doesn’t exist, it shouldn’t hang
k

Ketan (kumare3)

08/15/2022, 10:51 PM
ohh so it cannot be caps?
y

Yee

08/15/2022, 10:51 PM
Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  6m29s (x28 over 106m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[mnxw12lpnywxgzldojsxi316], unattached volumes=[kube-api-access-pbzq2 mnxw12lpnywxgzldojsxi316 aws-iam-token]: timed out waiting for the condition
  Warning  FailedMount  2m13s (x60 over 108m)  kubelet  MountVolume.SetUp failed for volume "mnxw12lpnywxgzldojsxi316" : references non-existent secret key: test_secret
i think this shouldn’t be lower
e

Eric Hsiao

08/15/2022, 11:29 PM
oh should I try it out with a lower case secret?
I added a lower case version of the secret
➜  symphony_hall git:(main) ✗ kubectl -n flytesnacks-development get secret/common-secrets -o json | jq '.data | to_entries | map(.value= (.value | @base64d))' 
[
  {
    "key": "TEST_SECRET",
    "value": "blah"
  },
  {
    "key": "test_secret",
    "value": "some_value"
  }
]
Code change
@task(secret_requests=[Secret(group="common-secrets", key="test_secret")])
def print_secret() -> str:
    secrets = current_context().secrets
    return secrets.get("common-secrets", "test_secret")
but still got the same error
Unable to find secret for key test_secret in group common-secrets in Env Var:_FSEC_COMMON-SECRETS_TEST_SECRET and FilePath: /root/secrets/common-secrets/test_secret
I wonder if the
-
in
common-secrets
is causing issues. I also changed the secrets default directory in the Dockerfile. Could that cause problems? I know the default is
/etc/secrets
ENV FLYTE_SECRETS_DEFAULT_DIR /root/secrets
y

Yee

08/16/2022, 3:04 AM
$ alias ksd
ksd='kubectl -n flytesnacks-development'

$ ksd create secret generic common-secrets --from-literal=test-secret=sosecret
secret/common-secrets created
with code (assuming the flytesnacks repo in cookbook/)
$ cat core/secret_example.py
from flytekit import Secret, task, workflow, current_context

@task(secret_requests=[Secret(group="common-secrets", key="test-secret")])
def print_secret():
    secrets = current_context().secrets
    s = secrets.get("common-secrets", "test-secret")
    print(s)


@workflow
def my_print():
    print_secret()
$ ksd logs f0cc52c90e3ae47a8876-n0-0
tar: Removing leading `/' from member names

sosecret
can you try again?
now i’m not able to repro
keep in mind when you change you unf have to change in two places, in the secret name in the task decorator, and again in the get call itself.
let me know if you’re still having issues and we can hop on a screenshare tomorrow.
we will also fix the casing bug.
@Dan Rammer (hamersaw) https://github.com/flyteorg/flytepropeller/pull/472/files - can you take a look at this tomorrow?
e

Eric Hsiao

08/16/2022, 12:52 PM
@Yee I took your code and packaged it up but it unfortunately still didn't find the secret. very odd
I ssh'ed into the task running container and tried to see if the secret was mounted but it doesn't look like it (at least from the envs and file paths it's looking at)
root@alxtm7pdc9zchkqttvtp-n0-0:~# printenv
KUBERNETES_SERVICE_PORT_HTTPS=443
FLYTE_INTERNAL_EXECUTION_DOMAIN=development
PYTHON_VERSION=3.9.13
FLYTE_INTERNAL_TASK_PROJECT=flytesnacks
FLYTE_INTERNAL_VERSION=v0.0.12
FLYTE_INTERNAL_PROJECT=flytesnacks
FLYTE_INTERNAL_TASK_NAME=flyte.workflows.hello_world.print_secret
FLYTE_SECRETS_DEFAULT_DIR=/root/secrets
FLYTE_INTERNAL_TASK_DOMAIN=development
FLYTE_INTERNAL_EXECUTION_PROJECT=flytesnacks
y

Yee

08/16/2022, 4:08 PM
can you
ls -lRa /root/secrets
btw are you on gcp?
e

Eric Hsiao

08/16/2022, 4:26 PM
I'm on aws
/root/secrets
didn't get created since the secrets weren't moved over
y

Yee

08/16/2022, 4:34 PM
what do you mean moved over? could you dump that pod spec?
actually can you hop on screenshare?
e

Eric Hsiao

08/16/2022, 4:39 PM
yep!
d

Dan Rammer (hamersaw)

08/17/2022, 6:45 PM
Hey @Eric Hsiao, quick update here. I think I'm able to reproduce this locally - can you provide some versioning information? specifically flytepropeller and k8s.
e

Eric Hsiao

08/17/2022, 7:01 PM
yes! I'm in a meeting right now but will send those over in ~15 minutes
^ @Chris Antenesse
c

Chris Antenesse

08/17/2022, 7:17 PM
flytepropeller is v1.1.0
k8s version is v1.22
k

Ketan (kumare3)

08/17/2022, 8:40 PM
@Dan Rammer (hamersaw) is this because of the all CAPS Secret name
TEST_SECRET
c

Chris Antenesse

08/17/2022, 8:44 PM
@Eric Hsiao i added a new secret called
test_secret
. case issue should be quick to verify on our end.
k

Ketan (kumare3)

08/17/2022, 8:45 PM
can you please try that
i think this is something to do with case sensitivity
we should be fixing to support both - cc @Yee did some investigations i feel
e

Eric Hsiao

08/17/2022, 8:47 PM
I verified that before but that seems like a separate issue
d

Dan Rammer (hamersaw)

08/17/2022, 8:47 PM
@Yee and I discussed the case sensitivity issues earlier today. We have a PR out which should fix that. I want to make sure it's not a bigger issue with the webhook deployment. Can you change the mutating webhook configuration to fail rather than ignore failures.
this can be done by setting
failurePolicy: Fail
e

Eric Hsiao

08/17/2022, 8:50 PM
Yep will try that
d

Dan Rammer (hamersaw)

08/17/2022, 8:50 PM
This causes FlytePropeller to fail in creating the Pod if the k8s API server is unable to call the mutating webhook (which handles secret injection). In my test case in the UI the task error was:
Workflow[flytesnacks:development:.flytegen.core.containerization.use_secrets.secret_task] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[corecontainerizationusesecretssecrettask]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": Post "<https://flyte-pod-webhook.all.svc:9443/mutate--v1-pod?timeout=10s>": service "flyte-pod-webhook" not found
but i just removed the webhook service - this should tell us if the webhook is active.
y

Yee

08/17/2022, 8:52 PM
can we add that to the default deployment @Dan Rammer (hamersaw)
d

Dan Rammer (hamersaw)

08/17/2022, 8:52 PM
^^ was just thinking the exact same thing
y

Yee

08/17/2022, 8:53 PM
though i suspect it’s not failing… i feel like it’s not getting called at all
c

Chris Antenesse

08/17/2022, 9:11 PM
where do we change the
failurePolicy
? i thought the mutating webhook config was at the service level, but not seeing it there?
d

Dan Rammer (hamersaw)

08/17/2022, 9:12 PM
oh sorry - it has it's own resource
kubectl -n flyte edit mutatingwebhookconfigurations flyte-pod-webhook -o yaml
c

Chris Antenesse

08/17/2022, 9:12 PM
ok got it!
that’s set on our end
e

Eric Hsiao

08/17/2022, 9:13 PM
I'll rerun the test flyte workflow
d

Dan Rammer (hamersaw)

08/17/2022, 9:15 PM
already set to "Fail"? then like Yee suggested, the webhook is not even being called. so there is a mismatch between the webhook criteria and the pod definitions.
c

Chris Antenesse

08/17/2022, 9:15 PM
my bad. i set it to
Fail
just now
👍 1
it was set to
Ignore
previously 🙂
d

Dan Rammer (hamersaw)

08/17/2022, 9:16 PM
yeah, we'll have to update this in the default configuration.
e

Eric Hsiao

08/17/2022, 9:16 PM
I'm seeing the task I'm kicking off hanging now (~3mins)
d

Dan Rammer (hamersaw)

08/17/2022, 9:18 PM
ok, i'm guessing there are webhook call errors and flytepropeller is internally retrying. there is a 10s timeout on the mutating webhook be default i think, so we'll have to incur that few times maybe?
it should produce an end error that will (hopefully 🙏) be a bit more descriptive on what is happening.
is there anything in the propeller logs like
failed calling webhook
? on my local testing (above) the propeller logs are littered with:
{
  "json": {
    "exec_id": "agbb24545n6vbwknrxqs",
    "node": "corecontainerizationusesecretssecrettask",
    "ns": "flytesnacks-development",
    "res_ver": "5504",
    "routine": "worker-4",
    "src": "handler.go:222",
    "wf": "flytesnacks:development:.flytegen.core.containerization.use_secrets.secret_task"
  },
  "level": "error",
  "msg": "handling parent node failed with error: failed at Node[corecontainerizationusesecretssecrettask]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook \"<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>\": Post \"<https://flyte-pod-webhook.all.svc:9443/mutate--v1-pod?timeout=10s>\": service \"flyte-pod-webhook\" not found",
  "ts": "2022-08-17T15:45:24-05:00"
}
for each individual retry.
c

Chris Antenesse

08/17/2022, 9:23 PM
yea
one sec, will grab one…
{"json":{"exec_id":"ancq4kjld5lf6cv88bd8","ns":"flytesnacks-development","res_ver":"10680755","routine":"worker-8","wf":"flytesnacks:development:<http://flyte.workflows.hello_world.wf|flyte.workflows.hello_world.wf>"},"level":"error","msg":"Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook \"<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>\": failed to call webhook: Post \"<https://flyte-pod-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>\": context deadline exceeded]. Error Type[*errors.NodeErrorWithCause]","ts":"2022-08-17T21:22:16Z"}
E0817 21:22:16.167997       1 workers.go:102] error syncing 'flytesnacks-development/ancq4kjld5lf6cv88bd8': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": failed to call webhook: Post "<https://flyte-pod-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": context deadline exceed
d

Dan Rammer (hamersaw)

08/17/2022, 9:32 PM
OK, so it looks like it's being called, but there is some issue with k8s communication internally. I'm doing some searching and see potential firewall issues are common.
e

Eric Hsiao

08/17/2022, 9:34 PM
This part is a little suspicious`failed calling webhook `flyte-pod-webhook.flyte.org``
d

Dan Rammer (hamersaw)

08/17/2022, 9:34 PM
It sounds like the API server is unable to connect to the webhook endpoint. but i'm not sure the easiest way to debug this.
e

Eric Hsiao

08/17/2022, 9:34 PM
Should that be pointed at flyte.org?
d

Dan Rammer (hamersaw)

08/17/2022, 9:36 PM
so i think that's just the webhook name. from the k8s resource:
apiVersion: <http://admissionregistration.k8s.io/v1|admissionregistration.k8s.io/v1>
kind: MutatingWebhookConfiguration
metadata:
  // ommitted
webhooks:
  // omitted
  name: <http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>
basically in that configuration the api server will try to call the service defined at:
service:
      name: flyte-pod-webhook
      namespace: all
      path: /mutate--v1-pod
      port: 9443
c

Chris Antenesse

08/17/2022, 9:38 PM
so
<https://flyte-pod-webhook.flyte.svc:443/mutate--v1-pod>
?
in our case
i can exec to the propeller pod and access that hostname:port
kubectl exec -n flyte -it flytepropeller-74bf956f6c-5zfbh -- /bin/sh
nc -z flyte-pod-webhook.flyte.svc 443
d

Dan Rammer (hamersaw)

08/17/2022, 9:53 PM
yeah, i am kind of running into a blank here. we know that the failure is in calling the webhook. it's important to make the distinction that the request to the webhook is not coming from propeller, but from the kube api server i believe. and propeller just sees the error when trying to create a Pod.
c

Chris Antenesse

08/17/2022, 9:54 PM
ah, ok. that makes sense.
d

Dan Rammer (hamersaw)

08/17/2022, 9:54 PM
can we take a quick look at the propeller config map? should be something like:
apiVersion: v1
kind: ConfigMap
metadata:
  name: flyte-propeller-config
  namespace: flyte
and the pod-webhook service:
apiVersion: v1
kind: Service
metadata:
  name: flyte-pod-webhook
  namespace: flyte
c

Chris Antenesse

08/17/2022, 9:54 PM
sure
the configmap
kind: ConfigMap
metadata:
  annotations:
    <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte
    <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
  creationTimestamp: "2022-08-12T00:31:18Z"
  labels:
    <http://app.kubernetes.io/instance|app.kubernetes.io/instance>: flyte
    <http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: flyteadmin
    <http://helm.sh/chart|helm.sh/chart>: flyte-core-v1.1.0
  name: flyte-propeller-config
  namespace: flyte
and pod-webhook service
apiVersion: v1
kind: Service
metadata:
  annotations:
    <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte
    <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
    <http://projectcontour.io/upstream-protocol.h2c|projectcontour.io/upstream-protocol.h2c>: grpc
  creationTimestamp: "2022-08-12T00:31:18Z"
  labels:
    <http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
  name: flyte-pod-webhook
  namespace: flyte
d

Dan Rammer (hamersaw)

08/17/2022, 9:57 PM
can you paste the data section on the webhook as well? so the full definition.
c

Chris Antenesse

08/17/2022, 9:57 PM
oh, yes
d

Dan Rammer (hamersaw)

08/17/2022, 9:57 PM
i mean the service - that you just pasted
i've finally lost it.
c

Chris Antenesse

08/17/2022, 9:59 PM
here’s the output of
kubectl get  svc/flyte-pod-webhook -o yaml -n flyt
apiVersion: v1
kind: Service
metadata:
  annotations:
    <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte
    <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
    <http://projectcontour.io/upstream-protocol.h2c|projectcontour.io/upstream-protocol.h2c>: grpc
  creationTimestamp: "2022-08-12T00:31:18Z"
  labels:
    <http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
  name: flyte-pod-webhook
  namespace: flyte
  resourceVersion: "8525789"
  uid: 0ffd7a94-14d1-44db-ae76-34d17dccb0cd
spec:
  clusterIP: 172.27.91.248
  clusterIPs:
  - 172.27.91.248
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 9443
  selector:
    app: flyte-pod-webhook
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
do you want output of the configmap too?
d

Dan Rammer (hamersaw)

08/17/2022, 9:59 PM
umm we might be alright.
do i see the service port 9443 being mapped to 443 in the pod? do we just have the wrong port in the MutatingWebhookConfiguration? can you try to update it to 9443 rather than 443
service:
      name: flyte-pod-webhook
      namespace: all
      path: /mutate--v1-pod
      port: 9443
mutatingwebhookconfigurations flyte-pod-webhook
c

Chris Antenesse

08/17/2022, 10:02 PM
sure
from
service:
      name: flyte-pod-webhook
      namespace: flyte
      path: /mutate--v1-pod
      port: 443
to
service:
      name: flyte-pod-webhook
      namespace: flyte
      path: /mutate--v1-pod
      port: 9443
that is complete on my end
d

Dan Rammer (hamersaw)

08/17/2022, 10:06 PM
yeah, if this doesn't work i'm going to have to defer to our in-house k8s expert. the only issue is he is on the other side of the world, so his hours can be a bit difficult to overlap.
c

Chris Antenesse

08/17/2022, 10:08 PM
got it
should we try and run the task again? and show some logs?
e

Eric Hsiao

08/17/2022, 10:09 PM
yeah I'm kicking it off and getting this
{"json":{"exec_id":"aknhc8ds5wn5z7sql6jb","ns":"flytesnacks-development","res_ver":"10692263","routine":"worker-5","wf":"flytesnacks:development:<http://flyte.workflows.hello_world.wf|flyte.workflows.hello_world.wf>"},"level":"error","msg":"Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook \"<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>\": failed to call webhook: Post \"<https://flyte-pod-webhook.flyte.svc:9443/mutate--v1-pod?timeout=10s>\": no service port 9443 found for service \"flyte-pod-webhook\"]. Error Type[*errors.NodeErrorWithCause]","ts":"2022-08-17T22:07:16Z"}
E0817 22:07:16.516247       1 workers.go:102] error syncing 'flytesnacks-development/aknhc8ds5wn5z7sql6jb': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": failed to call webhook: Post "<https://flyte-pod-webhook.flyte.svc:9443/mutate--v1-pod?timeout=10s>": no service port 9443 found for service "flyte-pod-webhook"
{"json":{"exec_id":"ancq4kjld5lf6cv88bd8","ns":"flytesnacks-development","routine":"worker-4"},"level":"warning","msg":"Workflow not found in cache.","ts":"2022-08-17T22:08:06Z"}
{"json":{"exec_id":"ancq4kjld5lf6cv88bd8","ns":"flytesnacks-development","routine":"worker-4"},"level":"warning","msg":"Workflow namespace[flytesnacks-development]/name[ancq4kjld5lf6cv88bd8] not found, may be deleted.","ts":"2022-08-17T22:08:06Z"}
no service port 9443 found for service
d

Dan Rammer (hamersaw)

08/17/2022, 10:11 PM
oh sure, that's interesting because that's how our configuration is setup - using 9443 as the target port in service and then mutating configuration endpoint.
@Yuvraj any chance you are seeing something here? I'm sure you'll take one look at it will be obvious!
c

Chris Antenesse

08/18/2022, 4:05 PM
we were able to solve this on our end. the k8s API was unable to connect to the webhook pod. after a security group change allowing the traffic, we were able to inject secrets.
👍 2
y

Yee

08/18/2022, 4:22 PM
oh all good?
everything is resolved? would you mind describing how you were able to debug?
c

Chris Antenesse

08/18/2022, 4:23 PM
yep, things are working as intended on our end. at least for that piece 🙂
y

Yee

08/18/2022, 4:24 PM
like how did you determine that, did you look at cloudwatch logs, what testing did you do? (sorry if some of this was covered in the messages i missed above)
c

Chris Antenesse

08/18/2022, 4:28 PM
sure - i think there are a few important points: • being able to have the more descriptive logs for flytepropller was a big help. however this line tripped us up a bit. it seemed like there was a connectivity issue from flytepropeller to the webhook pod, when in fact it was an issue from the k8s API nodes to the webhook pod
Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": failed to call webhook: Post "<https://flyte-pod-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": context deadline exceed
• we were able to determine that it was an issue from the k8s API -> webhook pod by looking at this diagram
❤️ 1
one we knew that the k8s API nodes were trying to talk to the webhook pods on 443, it was a matter of finding the security group that handled that access and adding a rule to allow that traffic. once i did that, i kicked off the workflow and things worked as intended.
there is one more hump, unrelated to this, that we need to get over. which has to do with private registry access. i’ll start a new conversation about that in #onboarding. but we’re good on secrets injection for now. thanks for the help. lmk if there is any other information that would be helpful on our side.
d

Dan Rammer (hamersaw)

08/18/2022, 4:42 PM
@Chris Antenesse @Eric Hsiao this is great to hear! glad we were able to get this resolved. we're going to update the configuration failure policy to fail by default and that should bubble up these issues into the console quicker in the future. once we know the webhook is unable to be called it should ease debugging. thanks for being so patient with this fix!
🙏 2
Also, I know we have some sparse documentation on private image registries, we can explore it a bit further in the new thread you're planning on starting. it would be great to fill this out a bit more with any questions you have!
c

Chris Antenesse

08/18/2022, 4:45 PM
nice! i’ve been through the docs and hitting some speed bumps. will surface those today.
🙌 1