Hi all - having an issue injecting secrets from k8...
# ask-the-community
e
Hi all - having an issue injecting secrets from k8s into my task. Here's how I've set it up. I created a test secret like this to try and see if a task can read it
Copy code
kubectl-n flytesnacks-development create secret generic common-secrets --from-literal=TEST_SECRET=blah
kubectl -n flytesnacks-development get secret/common-secrets -o json | jq '.data | to_entries | map(.value= (.value | @base64d))'
>> [
  {
    "key": "TEST_SECRET",
    "value": "blah"
  }
]
Within the flytesnacks-development namespace, I'm running a task like this
Copy code
@task(secret_requests=[Secret(group="common-secrets", key="TEST_SECRET")])
def print_secret() -> str:
    secrets = current_context().secrets
    return secrets.get("common-secrets", "TEST_SECRET")
However, this fails with the following
Copy code
Unable to find secret for key TEST_SECRET in group common-secrets in Env Var:_FSEC_COMMON-SECRETS_TEST_SECRET and FilePath: /root/secrets/common-secrets/test_secret
From looking at the code in the the SecretManager, it looks like it only checks the ENV variable or a file path (which does not exist because I'm using k8s secrets as an additional provider). Am I missing something? I've checked that the k8s secret is in the same namespace as the task being run
k
Hmm the secret should be Injected by the Flyte secrets injector
e
Do you know how I can check when / how that runs? It says in the docs that it should be injected to the pod when the task starts but that doesn't seem to be the case
e
Yes that's the document I'm using. I checked the webhook pod and things seem to be configured correctly
Copy code
➜  kubectl logs pod/flyte-pod-webhook-595f7b6858-7qxdt -n flyte
time="2022-08-12T00:31:32Z" level=info msg=------------------------------------------------------------------------
time="2022-08-12T00:31:32Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2022-08-12 00:31:32.456181799 +0000 UTC m=+0.033941479]"
time="2022-08-12T00:31:32Z" level=info msg=------------------------------------------------------------------------
time="2022-08-12T00:31:32Z" level=info msg="Detected: 4 CPU's\n"
{"metrics-prefix":"flyte:","certDir":"/etc/webhook/certs","localCert":false,"listenPort":9443,"serviceName":"flyte-pod-webhook","servicePort":443,"secretName":"flyte-pod-webhook","secretManagerType":"K8s","awsSecretManager":{"sidecarImage":"<http://docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4|docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4>","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"vaultSecretManager":{"role":"flyte","kvVersion":"2"}}
k
Hmm interesting, so let's try to look at one of the launched pods- cc @Yee you recently saw this. What annotation should we look for
e
Here's the output for describing a task pod that starts up
Copy code
➜  Documents kb describe pod a2wf2sqs2bfw6qr4l92d-n0-0 -n flytesnacks-development
Name:         a2wf2sqs2bfw6qr4l92d-n0-0
Namespace:    flytesnacks-development
Priority:     0
Node:         ip-10-15-147-196.ec2.internal/10.15.147.196
Start Time:   Mon, 15 Aug 2022 12:52:24 -0400
Labels:       domain=development
              execution-id=a2wf2sqs2bfw6qr4l92d
              inject-flyte-secrets=true
              interruptible=false
              node-id=n0
              project=flytesnacks
              shard-key=13
              task-name=flyte-workflows-hello-world-print-secret
              workflow-name=flyte-workflows-hello-world-wf
Annotations:  <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: false
              flyte.secrets/s0: m4zg54lqhiqcey2pnvww52rnonswg3tforzsectlmv3tuibclfavercjl4jukq1sivkcecq
              <http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
Status:       Succeeded
IP:           10.15.128.66
IPs:
  IP:           10.15.128.66
Controlled By:  flyteworkflow/a2wf2sqs2bfw6qr4l92d
Containers:
  a2wf2sqs2bfw6qr4l92d-n0-0:
    Container ID:  <docker://7e388ada4103f5fa7c5a0c1a673b3c61f79c441bcede9b1798bed8f4db128e6>5
    Image:         XXX
    Image ID:      XXX 
    Port:          <none>
    Host Port:     <none>
    Args:
      pyflyte-execute
      --inputs
      <s3://predictap-tyson-flyte/metadata/propeller/flytesnacks-development-a2wf2sqs2bfw6qr4l92d/n0/data/inputs.pb>
      --output-prefix
      <s3://predictap-tyson-flyte/metadata/propeller/flytesnacks-development-a2wf2sqs2bfw6qr4l92d/n0/data/0>
      --raw-output-data-prefix
      <s3://predictap-tyson-flyte/jh/a2wf2sqs2bfw6qr4l92d-n0-0>
      --checkpoint-path
      <s3://predictap-tyson-flyte/jh/a2wf2sqs2bfw6qr4l92d-n0-0/_flytecheckpoints>
      --prev-checkpoint
      ""
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      flyte.workflows.hello_world
      task-name
      print_secret
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 15 Aug 2022 12:52:25 -0400
      Finished:     Mon, 15 Aug 2022 12:52:28 -0400
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1000Mi
    Requests:
      cpu:     1
      memory:  1000Mi
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:<http://flyte.workflows.hello_world.wf|flyte.workflows.hello_world.wf>
      FLYTE_INTERNAL_EXECUTION_ID:        a2wf2sqs2bfw6qr4l92d
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               0
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           flyte.workflows.hello_world.print_secret
      FLYTE_INTERNAL_TASK_VERSION:        v0.0.5
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                flyte.workflows.hello_world.print_secret
      FLYTE_INTERNAL_VERSION:             v0.0.5
      AWS_STS_REGIONAL_ENDPOINTS:         regional
      AWS_DEFAULT_REGION:                 us-east-1
      AWS_REGION:                         us-east-1
      AWS_ROLE_ARN:                       XXX
      AWS_WEB_IDENTITY_TOKEN_FILE:        /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7kr64 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  kube-api-access-7kr64:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
k
this is weird - it has the right secret annotation
flyte.secrets/s0: m4zg54lqhiqcey2pnvww52rnonswg3tforzsectlmv3tuibclfavercjl4jukq1sivkcecq
cc @Kevin Su / @Yee does anyone of you know the problem here? is
-
a problem?
y
i’m still debugging this
i think it’s just the caps
and not really sure why
failure to mount volume… this part is a bug.
if you specify a secret that doesn’t exist, it shouldn’t hang
k
ohh so it cannot be caps?
y
Copy code
Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  6m29s (x28 over 106m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[mnxw12lpnywxgzldojsxi316], unattached volumes=[kube-api-access-pbzq2 mnxw12lpnywxgzldojsxi316 aws-iam-token]: timed out waiting for the condition
  Warning  FailedMount  2m13s (x60 over 108m)  kubelet  MountVolume.SetUp failed for volume "mnxw12lpnywxgzldojsxi316" : references non-existent secret key: test_secret
i think this shouldn’t be lower
e
oh should I try it out with a lower case secret?
I added a lower case version of the secret
Copy code
➜  symphony_hall git:(main) ✗ kubectl -n flytesnacks-development get secret/common-secrets -o json | jq '.data | to_entries | map(.value= (.value | @base64d))' 
[
  {
    "key": "TEST_SECRET",
    "value": "blah"
  },
  {
    "key": "test_secret",
    "value": "some_value"
  }
]
Code change
Copy code
@task(secret_requests=[Secret(group="common-secrets", key="test_secret")])
def print_secret() -> str:
    secrets = current_context().secrets
    return secrets.get("common-secrets", "test_secret")
but still got the same error
Copy code
Unable to find secret for key test_secret in group common-secrets in Env Var:_FSEC_COMMON-SECRETS_TEST_SECRET and FilePath: /root/secrets/common-secrets/test_secret
I wonder if the
-
in
common-secrets
is causing issues. I also changed the secrets default directory in the Dockerfile. Could that cause problems? I know the default is
/etc/secrets
Copy code
ENV FLYTE_SECRETS_DEFAULT_DIR /root/secrets
y
Copy code
$ alias ksd
ksd='kubectl -n flytesnacks-development'

$ ksd create secret generic common-secrets --from-literal=test-secret=sosecret
secret/common-secrets created
with code (assuming the flytesnacks repo in cookbook/)
Copy code
$ cat core/secret_example.py
from flytekit import Secret, task, workflow, current_context

@task(secret_requests=[Secret(group="common-secrets", key="test-secret")])
def print_secret():
    secrets = current_context().secrets
    s = secrets.get("common-secrets", "test-secret")
    print(s)


@workflow
def my_print():
    print_secret()
Copy code
$ ksd logs f0cc52c90e3ae47a8876-n0-0
tar: Removing leading `/' from member names

sosecret
can you try again?
now i’m not able to repro
keep in mind when you change you unf have to change in two places, in the secret name in the task decorator, and again in the get call itself.
let me know if you’re still having issues and we can hop on a screenshare tomorrow.
we will also fix the casing bug.
@Dan Rammer (hamersaw) https://github.com/flyteorg/flytepropeller/pull/472/files - can you take a look at this tomorrow?
e
@Yee I took your code and packaged it up but it unfortunately still didn't find the secret. very odd
I ssh'ed into the task running container and tried to see if the secret was mounted but it doesn't look like it (at least from the envs and file paths it's looking at)
Copy code
root@alxtm7pdc9zchkqttvtp-n0-0:~# printenv
KUBERNETES_SERVICE_PORT_HTTPS=443
FLYTE_INTERNAL_EXECUTION_DOMAIN=development
PYTHON_VERSION=3.9.13
FLYTE_INTERNAL_TASK_PROJECT=flytesnacks
FLYTE_INTERNAL_VERSION=v0.0.12
FLYTE_INTERNAL_PROJECT=flytesnacks
FLYTE_INTERNAL_TASK_NAME=flyte.workflows.hello_world.print_secret
FLYTE_SECRETS_DEFAULT_DIR=/root/secrets
FLYTE_INTERNAL_TASK_DOMAIN=development
FLYTE_INTERNAL_EXECUTION_PROJECT=flytesnacks
y
can you
ls -lRa /root/secrets
btw are you on gcp?
e
I'm on aws
/root/secrets
didn't get created since the secrets weren't moved over
y
what do you mean moved over? could you dump that pod spec?
actually can you hop on screenshare?
e
yep!
d
Hey @Eric Hsiao, quick update here. I think I'm able to reproduce this locally - can you provide some versioning information? specifically flytepropeller and k8s.
e
yes! I'm in a meeting right now but will send those over in ~15 minutes
^ @Chris Antenesse
c
flytepropeller is v1.1.0
k8s version is v1.22
k
@Dan Rammer (hamersaw) is this because of the all CAPS Secret name
TEST_SECRET
c
@Eric Hsiao i added a new secret called
test_secret
. case issue should be quick to verify on our end.
k
can you please try that
i think this is something to do with case sensitivity
we should be fixing to support both - cc @Yee did some investigations i feel
e
I verified that before but that seems like a separate issue
d
@Yee and I discussed the case sensitivity issues earlier today. We have a PR out which should fix that. I want to make sure it's not a bigger issue with the webhook deployment. Can you change the mutating webhook configuration to fail rather than ignore failures.
this can be done by setting
failurePolicy: Fail
e
Yep will try that
d
This causes FlytePropeller to fail in creating the Pod if the k8s API server is unable to call the mutating webhook (which handles secret injection). In my test case in the UI the task error was:
Copy code
Workflow[flytesnacks:development:.flytegen.core.containerization.use_secrets.secret_task] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[corecontainerizationusesecretssecrettask]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": Post "<https://flyte-pod-webhook.all.svc:9443/mutate--v1-pod?timeout=10s>": service "flyte-pod-webhook" not found
but i just removed the webhook service - this should tell us if the webhook is active.
y
can we add that to the default deployment @Dan Rammer (hamersaw)
d
^^ was just thinking the exact same thing
y
though i suspect it’s not failing… i feel like it’s not getting called at all
c
where do we change the
failurePolicy
? i thought the mutating webhook config was at the service level, but not seeing it there?
d
oh sorry - it has it's own resource
kubectl -n flyte edit mutatingwebhookconfigurations flyte-pod-webhook -o yaml
c
ok got it!
that’s set on our end
e
I'll rerun the test flyte workflow
d
already set to "Fail"? then like Yee suggested, the webhook is not even being called. so there is a mismatch between the webhook criteria and the pod definitions.
c
my bad. i set it to
Fail
just now
👍 1
it was set to
Ignore
previously 🙂
d
yeah, we'll have to update this in the default configuration.
e
I'm seeing the task I'm kicking off hanging now (~3mins)
d
ok, i'm guessing there are webhook call errors and flytepropeller is internally retrying. there is a 10s timeout on the mutating webhook be default i think, so we'll have to incur that few times maybe?
it should produce an end error that will (hopefully 🙏) be a bit more descriptive on what is happening.
is there anything in the propeller logs like
failed calling webhook
? on my local testing (above) the propeller logs are littered with:
Copy code
{
  "json": {
    "exec_id": "agbb24545n6vbwknrxqs",
    "node": "corecontainerizationusesecretssecrettask",
    "ns": "flytesnacks-development",
    "res_ver": "5504",
    "routine": "worker-4",
    "src": "handler.go:222",
    "wf": "flytesnacks:development:.flytegen.core.containerization.use_secrets.secret_task"
  },
  "level": "error",
  "msg": "handling parent node failed with error: failed at Node[corecontainerizationusesecretssecrettask]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook \"<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>\": Post \"<https://flyte-pod-webhook.all.svc:9443/mutate--v1-pod?timeout=10s>\": service \"flyte-pod-webhook\" not found",
  "ts": "2022-08-17T15:45:24-05:00"
}
for each individual retry.
c
yea
one sec, will grab one…
Copy code
{"json":{"exec_id":"ancq4kjld5lf6cv88bd8","ns":"flytesnacks-development","res_ver":"10680755","routine":"worker-8","wf":"flytesnacks:development:<http://flyte.workflows.hello_world.wf|flyte.workflows.hello_world.wf>"},"level":"error","msg":"Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook \"<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>\": failed to call webhook: Post \"<https://flyte-pod-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>\": context deadline exceeded]. Error Type[*errors.NodeErrorWithCause]","ts":"2022-08-17T21:22:16Z"}
E0817 21:22:16.167997       1 workers.go:102] error syncing 'flytesnacks-development/ancq4kjld5lf6cv88bd8': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": failed to call webhook: Post "<https://flyte-pod-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": context deadline exceed
d
OK, so it looks like it's being called, but there is some issue with k8s communication internally. I'm doing some searching and see potential firewall issues are common.
e
This part is a little suspicious`failed calling webhook `flyte-pod-webhook.flyte.org``
d
It sounds like the API server is unable to connect to the webhook endpoint. but i'm not sure the easiest way to debug this.
e
Should that be pointed at flyte.org?
d
so i think that's just the webhook name. from the k8s resource:
Copy code
apiVersion: <http://admissionregistration.k8s.io/v1|admissionregistration.k8s.io/v1>
kind: MutatingWebhookConfiguration
metadata:
  // ommitted
webhooks:
  // omitted
  name: <http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>
basically in that configuration the api server will try to call the service defined at:
Copy code
service:
      name: flyte-pod-webhook
      namespace: all
      path: /mutate--v1-pod
      port: 9443
c
so
<https://flyte-pod-webhook.flyte.svc:443/mutate--v1-pod>
?
in our case
i can exec to the propeller pod and access that hostname:port
Copy code
kubectl exec -n flyte -it flytepropeller-74bf956f6c-5zfbh -- /bin/sh
nc -z flyte-pod-webhook.flyte.svc 443
d
yeah, i am kind of running into a blank here. we know that the failure is in calling the webhook. it's important to make the distinction that the request to the webhook is not coming from propeller, but from the kube api server i believe. and propeller just sees the error when trying to create a Pod.
c
ah, ok. that makes sense.
d
can we take a quick look at the propeller config map? should be something like:
Copy code
apiVersion: v1
kind: ConfigMap
metadata:
  name: flyte-propeller-config
  namespace: flyte
and the pod-webhook service:
Copy code
apiVersion: v1
kind: Service
metadata:
  name: flyte-pod-webhook
  namespace: flyte
c
sure
the configmap
Copy code
kind: ConfigMap
metadata:
  annotations:
    <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte
    <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
  creationTimestamp: "2022-08-12T00:31:18Z"
  labels:
    <http://app.kubernetes.io/instance|app.kubernetes.io/instance>: flyte
    <http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
    <http://app.kubernetes.io/name|app.kubernetes.io/name>: flyteadmin
    <http://helm.sh/chart|helm.sh/chart>: flyte-core-v1.1.0
  name: flyte-propeller-config
  namespace: flyte
and pod-webhook service
Copy code
apiVersion: v1
kind: Service
metadata:
  annotations:
    <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte
    <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
    <http://projectcontour.io/upstream-protocol.h2c|projectcontour.io/upstream-protocol.h2c>: grpc
  creationTimestamp: "2022-08-12T00:31:18Z"
  labels:
    <http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
  name: flyte-pod-webhook
  namespace: flyte
d
can you paste the data section on the webhook as well? so the full definition.
c
oh, yes
d
i mean the service - that you just pasted
i've finally lost it.
c
here’s the output of
kubectl get  svc/flyte-pod-webhook -o yaml -n flyt
Copy code
apiVersion: v1
kind: Service
metadata:
  annotations:
    <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte
    <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
    <http://projectcontour.io/upstream-protocol.h2c|projectcontour.io/upstream-protocol.h2c>: grpc
  creationTimestamp: "2022-08-12T00:31:18Z"
  labels:
    <http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: Helm
  name: flyte-pod-webhook
  namespace: flyte
  resourceVersion: "8525789"
  uid: 0ffd7a94-14d1-44db-ae76-34d17dccb0cd
spec:
  clusterIP: 172.27.91.248
  clusterIPs:
  - 172.27.91.248
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 9443
  selector:
    app: flyte-pod-webhook
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
do you want output of the configmap too?
d
umm we might be alright.
do i see the service port 9443 being mapped to 443 in the pod? do we just have the wrong port in the MutatingWebhookConfiguration? can you try to update it to 9443 rather than 443
Copy code
service:
      name: flyte-pod-webhook
      namespace: all
      path: /mutate--v1-pod
      port: 9443
mutatingwebhookconfigurations flyte-pod-webhook
c
sure
from
Copy code
service:
      name: flyte-pod-webhook
      namespace: flyte
      path: /mutate--v1-pod
      port: 443
to
Copy code
service:
      name: flyte-pod-webhook
      namespace: flyte
      path: /mutate--v1-pod
      port: 9443
that is complete on my end
d
yeah, if this doesn't work i'm going to have to defer to our in-house k8s expert. the only issue is he is on the other side of the world, so his hours can be a bit difficult to overlap.
c
got it
should we try and run the task again? and show some logs?
e
yeah I'm kicking it off and getting this
Copy code
{"json":{"exec_id":"aknhc8ds5wn5z7sql6jb","ns":"flytesnacks-development","res_ver":"10692263","routine":"worker-5","wf":"flytesnacks:development:<http://flyte.workflows.hello_world.wf|flyte.workflows.hello_world.wf>"},"level":"error","msg":"Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook \"<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>\": failed to call webhook: Post \"<https://flyte-pod-webhook.flyte.svc:9443/mutate--v1-pod?timeout=10s>\": no service port 9443 found for service \"flyte-pod-webhook\"]. Error Type[*errors.NodeErrorWithCause]","ts":"2022-08-17T22:07:16Z"}
E0817 22:07:16.516247       1 workers.go:102] error syncing 'flytesnacks-development/aknhc8ds5wn5z7sql6jb': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": failed to call webhook: Post "<https://flyte-pod-webhook.flyte.svc:9443/mutate--v1-pod?timeout=10s>": no service port 9443 found for service "flyte-pod-webhook"
{"json":{"exec_id":"ancq4kjld5lf6cv88bd8","ns":"flytesnacks-development","routine":"worker-4"},"level":"warning","msg":"Workflow not found in cache.","ts":"2022-08-17T22:08:06Z"}
{"json":{"exec_id":"ancq4kjld5lf6cv88bd8","ns":"flytesnacks-development","routine":"worker-4"},"level":"warning","msg":"Workflow namespace[flytesnacks-development]/name[ancq4kjld5lf6cv88bd8] not found, may be deleted.","ts":"2022-08-17T22:08:06Z"}
no service port 9443 found for service
d
oh sure, that's interesting because that's how our configuration is setup - using 9443 as the target port in service and then mutating configuration endpoint.
@Yuvraj any chance you are seeing something here? I'm sure you'll take one look at it will be obvious!
c
we were able to solve this on our end. the k8s API was unable to connect to the webhook pod. after a security group change allowing the traffic, we were able to inject secrets.
👍 2
y
oh all good?
everything is resolved? would you mind describing how you were able to debug?
c
yep, things are working as intended on our end. at least for that piece 🙂
y
like how did you determine that, did you look at cloudwatch logs, what testing did you do? (sorry if some of this was covered in the messages i missed above)
c
sure - i think there are a few important points: • being able to have the more descriptive logs for flytepropller was a big help. however this line tripped us up a bit. it seemed like there was a connectivity issue from flytepropeller to the webhook pod, when in fact it was an issue from the k8s API nodes to the webhook pod
Copy code
Internal error occurred: failed calling webhook "<http://flyte-pod-webhook.flyte.org|flyte-pod-webhook.flyte.org>": failed to call webhook: Post "<https://flyte-pod-webhook.flyte.svc:443/mutate--v1-pod?timeout=10s>": context deadline exceed
• we were able to determine that it was an issue from the k8s API -> webhook pod by looking at this diagram
❤️ 1
one we knew that the k8s API nodes were trying to talk to the webhook pods on 443, it was a matter of finding the security group that handled that access and adding a rule to allow that traffic. once i did that, i kicked off the workflow and things worked as intended.
there is one more hump, unrelated to this, that we need to get over. which has to do with private registry access. i’ll start a new conversation about that in #onboarding. but we’re good on secrets injection for now. thanks for the help. lmk if there is any other information that would be helpful on our side.
d
@Chris Antenesse @Eric Hsiao this is great to hear! glad we were able to get this resolved. we're going to update the configuration failure policy to fail by default and that should bubble up these issues into the console quicker in the future. once we know the webhook is unable to be called it should ease debugging. thanks for being so patient with this fix!
🙏 2
Also, I know we have some sparse documentation on private image registries, we can explore it a bit further in the new thread you're planning on starting. it would be great to fill this out a bit more with any questions you have!
c
nice! i’ve been through the docs and hitting some speed bumps. will surface those today.
🙌 1
129 Views