Slackbot
10/03/2023, 9:20 AMBrian Tang
10/03/2023, 9:22 AMflyte-binary
logs don’t show anything substantialflyte-binary
container is missing AWS env vars like the AWS_ROLE_ARN
volumes
is also missing aws-iam-token
David Espejo (he/him)
10/03/2023, 9:26 PMBrian Tang
10/04/2023, 3:09 AM1.7.0
to 1.8.1
, but had an error about some label immutability after applying helm upgrade
, which i think is related to this.
after which we did a rollback but the rollback deployment pods had the dreaded connection refused
error to 8088
.
couldn’t figure it out, and finally decided to do a complete reinstall of flyte-binary to 1.9
. i could do this because it was in our dev cluster - but still resulted in downtime for a few of our devsRunning
without issues, but it was failing the readiness probe (from kubectl describe po
). Port-forwarding to the http service was also failing.
I “figured” out how to debug it by sheer luck: kubectl rollout restart deployment
and immediately portfwd the service, went on the browser flyte console and i could see our workflows, but the metadata details page showed 403 error when hitting s3. after whcih (around 20-30secs later), the portforwarding would fail. this was when i realised the missing annotationsDavid Espejo (he/him)
10/05/2023, 6:32 PMBrian Tang
10/18/2023, 4:00 AMv1.7.0
to v1.9.1
and i can confirm the error from above is recurring:
$ helm history flyte-backend -n flyte
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
1 Tue May 9 15:29:51 2023 superseded flyte-binary-v1.5.0 1.16.0 Install complete
2 Tue Jul 11 12:49:23 2023 deployed flyte-binary-v1.7.0 1.16.0 Upgrade complete
3 Wed Oct 18 11:55:31 2023 failed flyte-binary-v1.9.1 1.16.0 Upgrade "flyte-backend" failed: cannot patch "flyte-backend-flyte-binary" with kind Deployment: Deployment.apps "flyte-backend-flyte-binary" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"flyte-binary", "<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"flyte-backend", "<http://app.kubernetes.io/name|app.kubernetes.io/name>":"flyte-binary"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
helm upgrade flyte-backend flyteorg/flyte-binary --namespace flyte -f flyte-prod-values.yaml --version v1.9.1 --debug
the output:
upgrade.go:144: [debug] preparing upgrade for flyte-backend
upgrade.go:152: [debug] performing update for flyte-backend
upgrade.go:324: [debug] creating upgraded release for flyte-backend
client.go:396: [debug] checking 10 resources for changes
client.go:684: [debug] Patch ServiceAccount "flyte-backend-flyte-binary" in namespace flyte
client.go:417: [debug] Created a new Secret called "flyte-backend-flyte-binary-config-secret" in flyte
client.go:684: [debug] Patch ConfigMap "flyte-backend-flyte-binary-cluster-resource-templates" in namespace flyte
client.go:684: [debug] Patch ConfigMap "flyte-backend-flyte-binary-config" in namespace flyte
client.go:684: [debug] Patch ClusterRole "flyte-backend-flyte-binary-cluster-role" in namespace
client.go:684: [debug] Patch ClusterRoleBinding "flyte-backend-flyte-binary-cluster-role-binding" in namespace
client.go:684: [debug] Patch Service "flyte-backend-flyte-binary-grpc" in namespace flyte
client.go:684: [debug] Patch Service "flyte-backend-flyte-binary-http" in namespace flyte
client.go:684: [debug] Patch Service "flyte-backend-flyte-binary-webhook" in namespace flyte
client.go:684: [debug] Patch Deployment "flyte-backend-flyte-binary" in namespace flyte
client.go:428: [debug] error updating the resource "flyte-backend-flyte-binary":
cannot patch "flyte-backend-flyte-binary" with kind Deployment: Deployment.apps "flyte-backend-flyte-binary" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"flyte-binary", "<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"flyte-backend", "<http://app.kubernetes.io/name|app.kubernetes.io/name>":"flyte-binary"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
upgrade.go:436: [debug] warning: Upgrade "flyte-backend" failed: cannot patch "flyte-backend-flyte-binary" with kind Deployment: Deployment.apps "flyte-backend-flyte-binary" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"flyte-binary", "<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"flyte-backend", "<http://app.kubernetes.io/name|app.kubernetes.io/name>":"flyte-binary"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
Error: UPGRADE FAILED: cannot patch "flyte-backend-flyte-binary" with kind Deployment: Deployment.apps "flyte-backend-flyte-binary" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"flyte-binary", "<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"flyte-backend", "<http://app.kubernetes.io/name|app.kubernetes.io/name>":"flyte-binary"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
helm.go:84: [debug] cannot patch "flyte-backend-flyte-binary" with kind Deployment: Deployment.apps "flyte-backend-flyte-binary" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"flyte-binary", "<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"flyte-backend", "<http://app.kubernetes.io/name|app.kubernetes.io/name>":"flyte-binary"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
<http://helm.sh/helm/v3/pkg/kube.(*Client).Update|helm.sh/helm/v3/pkg/kube.(*Client).Update>
<http://helm.sh/helm/v3/pkg/kube/client.go:441|helm.sh/helm/v3/pkg/kube/client.go:441>
<http://helm.sh/helm/v3/pkg/action.(*Upgrade).releasingUpgrade|helm.sh/helm/v3/pkg/action.(*Upgrade).releasingUpgrade>
<http://helm.sh/helm/v3/pkg/action/upgrade.go:378|helm.sh/helm/v3/pkg/action/upgrade.go:378>
runtime.goexit
runtime/asm_amd64.s:1598
UPGRADE FAILED
main.newUpgradeCmd.func2
<http://helm.sh/helm/v3/cmd/helm/upgrade.go:203|helm.sh/helm/v3/cmd/helm/upgrade.go:203>
<http://github.com/spf13/cobra.(*Command).execute|github.com/spf13/cobra.(*Command).execute>
<http://github.com/spf13/cobra@v1.6.1/command.go:916|github.com/spf13/cobra@v1.6.1/command.go:916>
<http://github.com/spf13/cobra.(*Command).ExecuteC|github.com/spf13/cobra.(*Command).ExecuteC>
<http://github.com/spf13/cobra@v1.6.1/command.go:1044|github.com/spf13/cobra@v1.6.1/command.go:1044>
<http://github.com/spf13/cobra.(*Command).Execute|github.com/spf13/cobra.(*Command).Execute>
<http://github.com/spf13/cobra@v1.6.1/command.go:968|github.com/spf13/cobra@v1.6.1/command.go:968>
main.main
<http://helm.sh/helm/v3/cmd/helm/helm.go:83|helm.sh/helm/v3/cmd/helm/helm.go:83>
runtime.main
runtime/proc.go:250
runtime.goexit
runtime/asm_amd64.s:1598
[24.801ms] [rows:1] SELECT count(*) FROM pg_indexes WHERE tablename = 'artifacts' AND indexname = 'artifacts_dataset_uuid_idx' AND schemaname = CURRENT_SCHEMA()
{"metrics-prefix":"flyte:","certDir":"/var/run/flyte/certs","localCert":true,"listenPort":9443,"serviceName":"flyte-backend-flyte-binary-webhook","servicePort":443,"secretName":"flyte-backend-flyte-binary-webhook-secret","secretManagerType":"K8s","awsSecretManager":{"sidecarImage":"<http://docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4|docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4>","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"gcpSecretManager":{"sidecarImage":"<http://gcr.io/google.com/cloudsdktool/cloud-sdk:alpine|gcr.io/google.com/cloudsdktool/cloud-sdk:alpine>","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"vaultSecretManager":{"role":"flyte","kvVersion":"2","annotations":null}}
I0911 12:44:36.832423 7 request.go:601] Waited for 1.352591716s due to client-side throttling, not priority and fairness, request: GET:<https://10.100.0.1:443/apis/vpcresources.k8s.aws/v1beta1?timeout=30s>
I0911 13:20:52.014500 7 trace.go:205] Trace[279428830]: "DeltaFIFO Pop Process" ID:kube-system/ebs-csi-controller-596bfbdf75-djfnv,Depth:18,Reason:slow event handlers blocking the queue (11-Sep-2023 13:20:51.591) (total time: 414ms):
Trace[279428830]: [414.600447ms] [414.600447ms] END
I0912 10:29:01.064054 7 trace.go:205] Trace[1812686024]: "DeltaFIFO Pop Process" ID:kube-system/ebs-csi-node-qwwfr,Depth:18,Reason:slow event handlers blocking the queue (12-Sep-2023 10:29:00.865) (total time: 198ms):
Trace[1812686024]: [198.389967ms] [198.389967ms] END
I0912 13:07:59.000063 7 trace.go:205] Trace[443568182]: "DeltaFIFO Pop Process" ID:amazon-cloudwatch/cloudwatch-agent-rzz4p,Depth:17,Reason:slow event handlers blocking the queue (12-Sep-2023 13:07:58.736) (total time: 263ms):
Trace[443568182]: [263.796642ms] [263.796642ms] END
I0914 04:57:03.641797 7 trace.go:205] Trace[1871772048]: "DeltaFIFO Pop Process" ID:flytesnacks-development/f87a90168412e4b8eab0-n0-0,Depth:26,Reason:slow event handlers blocking the queue (14-Sep-2023 04:57:03.055) (total time: 243ms):
Trace[1871772048]: [243.657613ms] [243.657613ms] END
I0914 05:20:05.078879 7 trace.go:205] Trace[1514892829]: "DeltaFIFO Pop Process" ID:kube-system/aws-node-27rqr,Depth:26,Reason:slow event handlers blocking the queue (14-Sep-2023 05:20:04.972) (total time: 106ms):
Trace[1514892829]: [106.004169ms] [106.004169ms] END
I0914 07:31:03.122288 7 trace.go:205] Trace[698337872]: "DeltaFIFO Pop Process" ID:amazon-cloudwatch/fluent-bit-sk6nk,Depth:27,Reason:slow event handlers blocking the queue (14-Sep-2023 07:31:02.982) (total time: 139ms):
Trace[698337872]: [139.593986ms] [139.593986ms] END
I0914 07:45:03.559135 7 trace.go:205] Trace[2045504481]: "DeltaFIFO Pop Process" ID:amazon-cloudwatch/fluent-bit-mmsln,Depth:26,Reason:slow event handlers blocking the queue (14-Sep-2023 07:45:03.421) (total time: 137ms):
Trace[2045504481]: [137.224041ms] [137.224041ms] END
E0914 07:56:03.560064 7 workers.go:102] error syncing 'flytesnacks-development/fda86618d05614014af2': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "fda86618d05614014af2": the object has been modified; please apply your changes to the latest version and try again
W0920 20:43:41.245050 7 reflector.go:442] pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=20480") has prevented the request from succeeding
I1001 11:25:03.215956 7 trace.go:205] Trace[313363901]: "DeltaFIFO Pop Process" ID:amazon-cloudwatch/fluent-bit-sk6nk,Depth:17,Reason:slow event handlers blocking the queue (01-Oct-2023 11:25:02.058) (total time: 163ms):
Trace[313363901]: [163.660953ms] [163.660953ms] END
I1002 08:43:16.679835 7 request.go:601] Waited for 1.079424313s due to client-side throttling, not priority and fairness, request: GET:<https://10.100.0.1:443/apis/coordination.k8s.io/v1?timeout=30s>
I1010 17:21:03.096290 7 trace.go:205] Trace[2066149144]: "DeltaFIFO Pop Process" ID:kube-system/aws-node-s7zbh,Depth:37,Reason:slow event handlers blocking the queue (10-Oct-2023 17:21:02.980) (total time: 110ms):
Trace[2066149144]: [110.183772ms] [110.183772ms] END
I1017 13:30:04.865146 7 trace.go:205] Trace[1012800702]: "DeltaFIFO Pop Process" ID:amazon-cloudwatch/cloudwatch-agent-7wfhg,Depth:17,Reason:slow event handlers blocking the queue (17-Oct-2023 13:30:04.677) (total time: 113ms):
Trace[1012800702]: [113.7567ms] [113.7567ms] EN