This message was deleted.
# flyte-deployment
s
This message was deleted.
b
flyte-binary
logs don’t show anything substantial
how do we get access to the internal logs of the flyte-binary? eg the flyte-admin logs?
looks like the
flyte-binary
container is missing AWS env vars like the
AWS_ROLE_ARN
volumes
is also missing
aws-iam-token
d
@Brian Tang what was the original issue?
b
tried to perform helm upgrade from
1.7.0
to
1.8.1
, but had an error about some label immutability after applying
helm upgrade
, which i think is related to this. after which we did a rollback but the rollback deployment pods had the dreaded
connection refused
error to
8088
. couldn’t figure it out, and finally decided to do a complete reinstall of flyte-binary to
1.9
. i could do this because it was in our dev cluster - but still resulted in downtime for a few of our devs
also, @David Espejo (he/him) i think the error logs can be improved significantly. i forgot to annotate the service account after the reinstall, but the flyte-binary logs had nothing in it. the flyte-binary pod would show
Running
without issues, but it was failing the readiness probe (from
kubectl describe po
). Port-forwarding to the http service was also failing. I “figured” out how to debug it by sheer luck:
kubectl rollout restart deployment
and immediately portfwd the service, went on the browser flyte console and i could see our workflows, but the metadata details page showed 403 error when hitting s3. after whcih (around 20-30secs later), the portforwarding would fail. this was when i realised the missing annotations
and still no idea why the rollback failed which is scary to think if we had to rollback prod cluster. rollback’s issue was similar with the connection refused to 8088 - tbf i can’t quite remember if i had checked the service account annotations but fairly certain i did. also, helm rollback should’ve preserved the service account annotations as well?
d
@Brian Tang right, whatever was configured in the previous Helm release should have been preserved, including annotations which, if I understand correctly, was the main cause of the issue. Right now, rollback between these versions may need a reinstall instead. As Flyte components persist everything on the db/blob, the reinstall doesn't lose data/state. Nevertheless, I think a better exploration is needed. Would you mind creating an issue?
b
hi @David Espejo (he/him) - we’re upgrading our other cluster now from
v1.7.0
to
v1.9.1
and i can confirm the error from above is recurring:
Copy code
$ helm history flyte-backend -n flyte
REVISION        UPDATED                         STATUS          CHART                   APP VERSION     DESCRIPTION                                                                                                                                                                                                                                                                                                                                                                                                                                         
1               Tue May  9 15:29:51 2023        superseded      flyte-binary-v1.5.0     1.16.0          Install complete                                                                                                                                                                                                                                                                                                                                                                                                                                    
2               Tue Jul 11 12:49:23 2023        deployed        flyte-binary-v1.7.0     1.16.0          Upgrade complete                                                                                                                                                                                                                                                                                                                                                                                                                                    
3               Wed Oct 18 11:55:31 2023        failed          flyte-binary-v1.9.1     1.16.0          Upgrade "flyte-backend" failed: cannot patch "flyte-backend-flyte-binary" with kind Deployment: Deployment.apps "flyte-backend-flyte-binary" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"flyte-binary", "<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"flyte-backend", "<http://app.kubernetes.io/name|app.kubernetes.io/name>":"flyte-binary"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
the command:
helm upgrade flyte-backend flyteorg/flyte-binary --namespace flyte -f flyte-prod-values.yaml --version v1.9.1 --debug
the output:
Copy code
upgrade.go:144: [debug] preparing upgrade for flyte-backend
upgrade.go:152: [debug] performing update for flyte-backend
upgrade.go:324: [debug] creating upgraded release for flyte-backend
client.go:396: [debug] checking 10 resources for changes
client.go:684: [debug] Patch ServiceAccount "flyte-backend-flyte-binary" in namespace flyte
client.go:417: [debug] Created a new Secret called "flyte-backend-flyte-binary-config-secret" in flyte

client.go:684: [debug] Patch ConfigMap "flyte-backend-flyte-binary-cluster-resource-templates" in namespace flyte
client.go:684: [debug] Patch ConfigMap "flyte-backend-flyte-binary-config" in namespace flyte
client.go:684: [debug] Patch ClusterRole "flyte-backend-flyte-binary-cluster-role" in namespace 
client.go:684: [debug] Patch ClusterRoleBinding "flyte-backend-flyte-binary-cluster-role-binding" in namespace 
client.go:684: [debug] Patch Service "flyte-backend-flyte-binary-grpc" in namespace flyte
client.go:684: [debug] Patch Service "flyte-backend-flyte-binary-http" in namespace flyte
client.go:684: [debug] Patch Service "flyte-backend-flyte-binary-webhook" in namespace flyte
client.go:684: [debug] Patch Deployment "flyte-backend-flyte-binary" in namespace flyte
client.go:428: [debug] error updating the resource "flyte-backend-flyte-binary":
         cannot patch "flyte-backend-flyte-binary" with kind Deployment: Deployment.apps "flyte-backend-flyte-binary" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"flyte-binary", "<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"flyte-backend", "<http://app.kubernetes.io/name|app.kubernetes.io/name>":"flyte-binary"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
upgrade.go:436: [debug] warning: Upgrade "flyte-backend" failed: cannot patch "flyte-backend-flyte-binary" with kind Deployment: Deployment.apps "flyte-backend-flyte-binary" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"flyte-binary", "<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"flyte-backend", "<http://app.kubernetes.io/name|app.kubernetes.io/name>":"flyte-binary"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
Error: UPGRADE FAILED: cannot patch "flyte-backend-flyte-binary" with kind Deployment: Deployment.apps "flyte-backend-flyte-binary" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"flyte-binary", "<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"flyte-backend", "<http://app.kubernetes.io/name|app.kubernetes.io/name>":"flyte-binary"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
helm.go:84: [debug] cannot patch "flyte-backend-flyte-binary" with kind Deployment: Deployment.apps "flyte-backend-flyte-binary" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"flyte-binary", "<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"flyte-backend", "<http://app.kubernetes.io/name|app.kubernetes.io/name>":"flyte-binary"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
<http://helm.sh/helm/v3/pkg/kube.(*Client).Update|helm.sh/helm/v3/pkg/kube.(*Client).Update>
        <http://helm.sh/helm/v3/pkg/kube/client.go:441|helm.sh/helm/v3/pkg/kube/client.go:441>
<http://helm.sh/helm/v3/pkg/action.(*Upgrade).releasingUpgrade|helm.sh/helm/v3/pkg/action.(*Upgrade).releasingUpgrade>
        <http://helm.sh/helm/v3/pkg/action/upgrade.go:378|helm.sh/helm/v3/pkg/action/upgrade.go:378>
runtime.goexit
        runtime/asm_amd64.s:1598
UPGRADE FAILED
main.newUpgradeCmd.func2
        <http://helm.sh/helm/v3/cmd/helm/upgrade.go:203|helm.sh/helm/v3/cmd/helm/upgrade.go:203>
<http://github.com/spf13/cobra.(*Command).execute|github.com/spf13/cobra.(*Command).execute>
        <http://github.com/spf13/cobra@v1.6.1/command.go:916|github.com/spf13/cobra@v1.6.1/command.go:916>
<http://github.com/spf13/cobra.(*Command).ExecuteC|github.com/spf13/cobra.(*Command).ExecuteC>
        <http://github.com/spf13/cobra@v1.6.1/command.go:1044|github.com/spf13/cobra@v1.6.1/command.go:1044>
<http://github.com/spf13/cobra.(*Command).Execute|github.com/spf13/cobra.(*Command).Execute>
        <http://github.com/spf13/cobra@v1.6.1/command.go:968|github.com/spf13/cobra@v1.6.1/command.go:968>
main.main
        <http://helm.sh/helm/v3/cmd/helm/helm.go:83|helm.sh/helm/v3/cmd/helm/helm.go:83>
runtime.main
        runtime/proc.go:250
runtime.goexit
        runtime/asm_amd64.s:1598
and after failing to upgrade, our flyte-binary is now inaccessible - port forwarding to the flyte http service hangs. the pod doesn’t complain and doesn’t restart, so from its perspective, there is no issues. but the logs in the flyte binary shows:
Copy code
[24.801ms] [rows:1] SELECT count(*) FROM pg_indexes WHERE tablename = 'artifacts' AND indexname = 'artifacts_dataset_uuid_idx' AND schemaname = CURRENT_SCHEMA()
{"metrics-prefix":"flyte:","certDir":"/var/run/flyte/certs","localCert":true,"listenPort":9443,"serviceName":"flyte-backend-flyte-binary-webhook","servicePort":443,"secretName":"flyte-backend-flyte-binary-webhook-secret","secretManagerType":"K8s","awsSecretManager":{"sidecarImage":"<http://docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4|docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4>","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"gcpSecretManager":{"sidecarImage":"<http://gcr.io/google.com/cloudsdktool/cloud-sdk:alpine|gcr.io/google.com/cloudsdktool/cloud-sdk:alpine>","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"vaultSecretManager":{"role":"flyte","kvVersion":"2","annotations":null}}
I0911 12:44:36.832423       7 request.go:601] Waited for 1.352591716s due to client-side throttling, not priority and fairness, request: GET:<https://10.100.0.1:443/apis/vpcresources.k8s.aws/v1beta1?timeout=30s>
I0911 13:20:52.014500       7 trace.go:205] Trace[279428830]: "DeltaFIFO Pop Process" ID:kube-system/ebs-csi-controller-596bfbdf75-djfnv,Depth:18,Reason:slow event handlers blocking the queue (11-Sep-2023 13:20:51.591) (total time: 414ms):
Trace[279428830]: [414.600447ms] [414.600447ms] END
I0912 10:29:01.064054       7 trace.go:205] Trace[1812686024]: "DeltaFIFO Pop Process" ID:kube-system/ebs-csi-node-qwwfr,Depth:18,Reason:slow event handlers blocking the queue (12-Sep-2023 10:29:00.865) (total time: 198ms):
Trace[1812686024]: [198.389967ms] [198.389967ms] END
I0912 13:07:59.000063       7 trace.go:205] Trace[443568182]: "DeltaFIFO Pop Process" ID:amazon-cloudwatch/cloudwatch-agent-rzz4p,Depth:17,Reason:slow event handlers blocking the queue (12-Sep-2023 13:07:58.736) (total time: 263ms):
Trace[443568182]: [263.796642ms] [263.796642ms] END
I0914 04:57:03.641797       7 trace.go:205] Trace[1871772048]: "DeltaFIFO Pop Process" ID:flytesnacks-development/f87a90168412e4b8eab0-n0-0,Depth:26,Reason:slow event handlers blocking the queue (14-Sep-2023 04:57:03.055) (total time: 243ms):
Trace[1871772048]: [243.657613ms] [243.657613ms] END
I0914 05:20:05.078879       7 trace.go:205] Trace[1514892829]: "DeltaFIFO Pop Process" ID:kube-system/aws-node-27rqr,Depth:26,Reason:slow event handlers blocking the queue (14-Sep-2023 05:20:04.972) (total time: 106ms):
Trace[1514892829]: [106.004169ms] [106.004169ms] END
I0914 07:31:03.122288       7 trace.go:205] Trace[698337872]: "DeltaFIFO Pop Process" ID:amazon-cloudwatch/fluent-bit-sk6nk,Depth:27,Reason:slow event handlers blocking the queue (14-Sep-2023 07:31:02.982) (total time: 139ms):
Trace[698337872]: [139.593986ms] [139.593986ms] END
I0914 07:45:03.559135       7 trace.go:205] Trace[2045504481]: "DeltaFIFO Pop Process" ID:amazon-cloudwatch/fluent-bit-mmsln,Depth:26,Reason:slow event handlers blocking the queue (14-Sep-2023 07:45:03.421) (total time: 137ms):
Trace[2045504481]: [137.224041ms] [137.224041ms] END
E0914 07:56:03.560064       7 workers.go:102] error syncing 'flytesnacks-development/fda86618d05614014af2': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "fda86618d05614014af2": the object has been modified; please apply your changes to the latest version and try again
W0920 20:43:41.245050       7 reflector.go:442] pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: got short buffer with n=0, base=4092, cap=20480") has prevented the request from succeeding
I1001 11:25:03.215956       7 trace.go:205] Trace[313363901]: "DeltaFIFO Pop Process" ID:amazon-cloudwatch/fluent-bit-sk6nk,Depth:17,Reason:slow event handlers blocking the queue (01-Oct-2023 11:25:02.058) (total time: 163ms):
Trace[313363901]: [163.660953ms] [163.660953ms] END
I1002 08:43:16.679835       7 request.go:601] Waited for 1.079424313s due to client-side throttling, not priority and fairness, request: GET:<https://10.100.0.1:443/apis/coordination.k8s.io/v1?timeout=30s>
I1010 17:21:03.096290       7 trace.go:205] Trace[2066149144]: "DeltaFIFO Pop Process" ID:kube-system/aws-node-s7zbh,Depth:37,Reason:slow event handlers blocking the queue (10-Oct-2023 17:21:02.980) (total time: 110ms):
Trace[2066149144]: [110.183772ms] [110.183772ms] END
I1017 13:30:04.865146       7 trace.go:205] Trace[1012800702]: "DeltaFIFO Pop Process" ID:amazon-cloudwatch/cloudwatch-agent-7wfhg,Depth:17,Reason:slow event handlers blocking the queue (17-Oct-2023 13:30:04.677) (total time: 113ms):
Trace[1012800702]: [113.7567ms] [113.7567ms] EN
as before, we’ll uninstall the current release completely and helm install v1.9.1 fresh- which is not optimal
to summarise the previous issue and this, 1. failed to upgrade from v1.7.0 to v1.9.1 (both clusters) 2. rolling back to previous release failed (we only did this for the first cluster from our earlier conversation; second cluster (the one i just described) we went straight for the uninstall route) 3. fresh install of v1.9.1