https://flyte.org logo
#flyte-deployment
Title
# flyte-deployment
h

Harry Souris

05/19/2023, 7:28 PM
Hi I get those errors on my Mac M1 when following the quickstart tutorial Thank you for the tool and the time
"Error syncing pod, skipping" err="failed to \"StartContainer\" for \"minio\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=minio pod=flyte-sandbox-minio-645c8ddf7c-rrj5x_flyte(9d8637fe-a97e-40e7-a5a6-985bc0fbc21b)\"" pod="flyte/flyte-sandbox-minio-645c8ddf7c-rrj5x" podUID=9d8637fe-a97e-40e7-a5a6-985bc0fbc21b
I0519 19:26:18.275183 59 scope.go:110] "RemoveContainer" containerID="38a7520207d8aaaa55ca89553770b83c086e9d2e671580fbbe85f0b34cc8608a"
E0519 19:26:18.276055 59 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"flyte\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=flyte pod=flyte-sandbox-b789778f6-c6lxd_flyte(94b613f8-e323-4cf8-b250-bfcfc1bb1f4a)\"" pod="flyte/flyte-sandbox-b789778f6-c6lxd" podUID=94b613f8-e323-4cf8-b250-bfcfc1bb1f4a
y

Yee

05/19/2023, 8:41 PM
what was the command you ran to start everything?
h

Harry Souris

05/19/2023, 8:47 PM
this one flytectl demo start
also have no_proxy=localhost,127.0.0.1
i see. this too
Copy code
INFO[0000] [0] Couldn't find a config file []. Relying on env vars and pflags.
y

Yee

05/19/2023, 9:11 PM
i don’t understand the significance of the no_proxy… where is that set?
you mean as env var?
on host or in the container?
can you
kubectl -n flyte get pod
and
kubectl -n flyte describe <all failing pods>
h

Harry Souris

05/19/2023, 9:28 PM
well don’t know how and why but after deleting everything and pulling a new image i can see the flyte console.
when executing the example i get an error
y

Yee

05/19/2023, 9:28 PM
i see
what’s the error?
h

Harry Souris

05/19/2023, 9:29 PM
pyflyte run --remote wine_flyte.py training_workflow --hyperparameters ‘{“C”: 0.1}’ Failed with Unknown Exception <class ‘requests.exceptions.ConnectionError’> Reason: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed connection without response’)) (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed connection without response’))
y

Yee

05/19/2023, 9:29 PM
kubectl get pods
-n flyte
anything failing?
h

Harry Souris

05/19/2023, 9:29 PM
NAME READY STATUS RESTARTS AGE flyte-sandbox-docker-registry-7744c9999-lbxzc 1/1 Running 0 5m51s flyte-sandbox-proxy-d95874857-r5lzh 1/1 Running 0 5m51s flyte-sandbox-kubernetes-dashboard-6757db879c-6t2rf 1/1 Running 0 5m51s flyte-sandbox-b789778f6-hw96t 1/1 Running 0 5m51s flyte-sandbox-postgresql-0 1/1 Running 1 (77s ago) 5m51s flyte-sandbox-minio-645c8ddf7c-cchgz 0/1 CrashLoopBackOff 5 (23s ago) 5m51s
y

Yee

05/19/2023, 9:29 PM
i see
h

Harry Souris

05/19/2023, 9:29 PM
i think my resources in the cluster?
y

Yee

05/19/2023, 9:30 PM
yeah minio shouldn’t be crashing
describe
it
and
logs -p
it probably doesn’t have logs if it’s crashing but it might
h

Harry Souris

05/19/2023, 9:33 PM
no logs
I think i will give more resources to the cluster more space restart it and let you know
y

Yee

05/19/2023, 9:33 PM
k
describe has more info too often
in the events section
h

Harry Souris

05/19/2023, 9:38 PM
by the way why i see this INFO[0000] [0] Couldn’t find a config file []. Relying on env vars and pflags.
y

Yee

05/19/2023, 9:38 PM
depends on where you’re seeing it
h

Harry Souris

05/19/2023, 9:38 PM
even if I have set it up correctly the config file
y

Yee

05/19/2023, 9:38 PM
copy paste the full command/stacktrace
h

Harry Souris

05/19/2023, 9:39 PM
flytectl demo start INFO[0000] [0] Couldn’t find a config file []. Relying on env vars and pflags. 🧑‍🏭 Bootstrapping a brand new flyte cluster... 🔨 🔧 🐋 Going to use Flyte v1.6.0 release with image cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-d391691c6db314da7298520e4fc83b2f5fe01eb9 🐋 pulling docker image for release cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-d391691c6db314da7298520e4fc83b2f5fe01eb9
y

Yee

05/19/2023, 9:39 PM
oh i think that’s fine actually
shouldn’t matter for that command
but it’s looking for a file at
~/.flyte/config.yaml
h

Harry Souris

05/19/2023, 9:45 PM
waiting to start ….
now i have a problem with the flyte console
flyte-sandbox-proxy-d95874857-6hqf2 1/1 Running 0 9m2s flyte-sandbox-kubernetes-dashboard-6757db879c-txnwr 1/1 Running 0 9m2s flyte-sandbox-docker-registry-7744c9999-kw497 1/1 Running 0 9m2s flyte-sandbox-minio-645c8ddf7c-q8492 1/1 Running 5 (3m44s ago) 9m2s flyte-sandbox-postgresql-0 1/1 Running 2 (2m59s ago) 9m2s flyte-sandbox-b789778f6-fwkvq 0/1 CrashLoopBackOff 6 (40s ago) 9m2s
y

Yee

05/19/2023, 9:52 PM
logs?
h

Harry Souris

05/19/2023, 9:58 PM
{“json”{“src”“composite_workqueue.go88”},“level”“debug”,“msg”:“Subqueue handler batch round”,“ts”“2023 05 19T2153:48Z”} {“json”{“src”“composite_workqueue.go:98"},“level”“debug”,“msg”“Dynamically configured batch size [-1]“,”ts”“2023 05 19T2153:48Z”} {“json”{“src”“composite_workqueue.go:129"},“level”“debug”,“msg”“Exiting SubQueue handler batch round”,“ts”“2023 05 19T2153:48Z”} {“json”{“src”“composite_workqueue.go88”},“level”“debug”,“msg”:“Subqueue handler batch round”,“ts”“2023 05 19T2153:49Z”} {“json”{“src”“composite_workqueue.go:98"},“level”“debug”,“msg”“Dynamically configured batch size [-1]“,”ts”“2023 05 19T2153:49Z”} {“json”{“src”“composite_workqueue.go:129"},“level”“debug”,“msg”“Exiting SubQueue handler batch round”,“ts”“2023 05 19T2153:49Z”} {“json”{“src”“composite_workqueue.go88”},“level”“debug”,“msg”:“Subqueue handler batch round”,“ts”“2023 05 19T2153:50Z”} {“json”{“src”“composite_workqueue.go:98"},“level”“debug”,“msg”“Dynamically configured batch size [-1]“,”ts”“2023 05 19T2153:50Z”} {“json”{“src”“composite_workqueue.go:129"},“level”“debug”,“msg”“Exiting SubQueue handler batch round”,“ts”“2023 05 19T2153:50Z”} {“json”{“src”“composite_workqueue.go88”},“level”“debug”,“msg”:“Subqueue handler batch round”,“ts”“2023 05 19T2153:51Z”} {“json”{“src”“composite_workqueue.go:98"},“level”“debug”,“msg”“Dynamically configured batch size [-1]“,”ts”“2023 05 19T2153:51Z”} {“json”{“src”“composite_workqueue.go:129"},“level”“debug”,“msg”“Exiting SubQueue handler batch round”,“ts”“2023 05 19T2153:51Z”}
y

Yee

05/19/2023, 9:58 PM
reload the web page?
h

Harry Souris

05/19/2023, 9:58 PM
kk
Copy code
upstream request timeout
in the cluster is like this E0519 215850.734034 57 pod_workers.go:951] “Error syncing pod, skipping” err=“failed to \“StartContainer\” for \“flyte\” with CrashLoopBackOff: \“back-off 5m0s restarting failed container=flyte pod=flyte-sandbox-b789778f6-fwkvq_flyte(ce5aaf96-ef0e-4985-8d9a-7b1a75d33d73)\“” pod=“flyte/flyte-sandbox-b789778f6-fwkvq” podUID=ce5aaf96-ef0e-4985-8d9a-7b1a75d33d73
the same message as before
y

Yee

05/19/2023, 10:00 PM
the pod is still crashing?
can you describe?
and copy paste
h

Harry Souris

05/19/2023, 10:01 PM
% kubectl describe pods Name: py39-cacher Namespace: default Priority: 0 Service Account: default Node: 856e6b497fcd/172.17.0.2 Start Time: Sat, 20 May 2023 004255 +0300 Labels: <none> Annotations: <none> Status: Succeeded IP: 10.42.0.11 IPs: IP: 10.42.0.11 Containers: flytekit: Container ID: containerd://837ac176431d8b34f990a0ee037272059f55ad30f4361ccb94b86bbc1eaa085c Image: ghcr.io/flyteorg/flytekit:py3.9-latest Image ID: ghcr.io/flyteorg/flytekit@sha256:757c05c2b8cfea93ba9b0952a7cdd1adf6952e58dfab62993aaceee8b8357beb Port: <none> Host Port: <none> Command: echo Args: Flyte State: Terminated Reason: Completed Exit Code: 0 Started: Sat, 20 May 2023 004743 +0300 Finished: Sat, 20 May 2023 004743 +0300 Ready: False Restart Count: 0 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mxptd (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-mxptd: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 18m default-scheduler Successfully assigned default/py39-cacher to 856e6b497fcd Normal Pulling 18m kubelet Pulling image “ghcr.io/flyteorg/flytekit:py3.9-latest” Normal Pulled 13m kubelet Successfully pulled image “ghcr.io/flyteorg/flytekit:py3.9-latest” in 4m46.943337882s Normal Created 13m kubelet Created container flytekit Normal Started 13m kubelet Started container flytekit
on image is still on pulling phase
y

Yee

05/19/2023, 10:03 PM
kubectl -n flyte describe pod flyte-sandbox-xyzxyz-xyzxyz
h

Harry Souris

05/19/2023, 10:05 PM
kubectl -n flyte describe pod flyte-sandbox-b789778f6-fwkvq Name: flyte-sandbox-b789778f6-fwkvq Namespace: flyte Priority: 0 Service Account: flyte-sandbox Node: 856e6b497fcd/172.17.0.2 Start Time: Sat, 20 May 2023 004150 +0300 Labels: app.kubernetes.io/instance=flyte-sandbox app.kubernetes.io/name=flyte-sandbox pod-template-hash=b789778f6 Annotations: checksum/cluster-resource-templates: 6fd9b172465e3089fcc59f738b92b8dc4d8939360c19de8ee65f68b0e7422035 checksum/configuration: 7ef5ef618ebb04f965552e2e4814dc053ef5338fee3ada32517e4e4b1695989b checksum/db-password-secret: 669e1cdf4633c6dd40085f78d1bb6b9672d8120ff1f62077a879a4d46db133e2 Status: Running IP: 10.42.0.4 IPs: IP: 10.42.0.4 Controlled By: ReplicaSet/flyte-sandbox-b789778f6 Init Containers: wait-for-db: Container ID: containerd://2b824dc0bc7c10d47400bd68e59b0c9fd20734d3b37ed811475509f408c0a7c8 Image: bitnami/postgresql:sandbox Image ID: sha256:a729f5f0de5fa39ba4d649e7366d499299304145d2456d60a16b0e63395bd61a Port: <none> Host Port: <none> Command: sh -ec Args: until pg_isready \ -h flyte-sandbox-postgresql \ -p 5432 \ -U postgres do echo waiting for database sleep 0.1 done State: Terminated Reason: Completed Exit Code: 0 Started: Sat, 20 May 2023 004154 +0300 Finished: Sat, 20 May 2023 004224 +0300 Ready: True Restart Count: 0 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9js4w (ro) Containers: flyte: Container ID: containerd://8d9c7124a05059352065eec0dc9b9a68144e34d6a05e8747f8de1f54989c0493 Image: flyte-binary:sandbox Image ID: sha256:b26c073652ff86b27f03a534177274b17dcb45f5ced24987c11282e5ddd7f110 Ports: 8088/TCP, 8089/TCP, 9443/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: start --config /etc/flyte/config.d/*.yaml State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Sat, 20 May 2023 010002 +0300 Finished: Sat, 20 May 2023 010002 +0300 Ready: False Restart Count: 9 Liveness: http-get http//http/healthcheck delay=0s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get http//http/healthcheck delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: POD_NAME: flyte-sandbox-b789778f6-fwkvq (v1:metadata.name) POD_NAMESPACE: flyte (v1:metadata.namespace) Mounts: /etc/flyte/cluster-resource-templates from cluster-resource-templates (rw) /etc/flyte/config.d from config (rw) /var/run/flyte from state (rw) /var/run/secrets/flyte/db-pass from db-pass (rw,path=“db-pass”) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9js4w (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: cluster-resource-templates: Type: Projected (a volume that contains injected data from multiple sources) ConfigMapName: flyte-sandbox-cluster-resource-templates ConfigMapOptional: <nil> ConfigMapName: flyte-sandbox-extra-cluster-resource-templates ConfigMapOptional: <nil> config: Type: Projected (a volume that contains injected data from multiple sources) ConfigMapName: flyte-sandbox-config ConfigMapOptional: <nil> ConfigMapName: flyte-sandbox-extra-config ConfigMapOptional: <nil> db-pass: Type: Secret (a volume populated by a Secret) SecretName: flyte-sandbox-db-pass Optional: false state: Type: EmptyDir (a temporary directory that shares a pod’s lifetime) Medium: SizeLimit: <unset> kube-api-access-9js4w: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 22m default-scheduler Successfully assigned flyte/flyte-sandbox-b789778f6-fwkvq to 856e6b497fcd Normal Pulled 22m kubelet Container image “bitnami/postgresql:sandbox” already present on machine Normal Created 22m kubelet Created container wait-for-db Normal Started 22m kubelet Started container wait-for-db Warning Unhealthy 21m (x3 over 21m) kubelet Liveness probe failed: Get “http://10.42.0.4:8088/healthcheck”: dial tcp 10.42.0.48088 connect: connection refused Normal Killing 21m kubelet Container flyte failed liveness probe, will be restarted Normal Pulled 20m (x2 over 21m) kubelet Container image “flyte-binary:sandbox” already present on machine Normal Created 20m (x2 over 21m) kubelet Created container flyte Normal Started 20m (x2 over 21m) kubelet Started container flyte Warning Unhealthy 17m (x46 over 21m) kubelet Readiness probe failed: Get “http://10.42.0.4:8088/healthcheck”: dial tcp 10.42.0.48088 connect: connection refused Warning BackOff 2m23s (x59 over 16m) kubelet Back-off restarting failed container
y

Yee

05/19/2023, 10:16 PM
can you get all the logs for flyte binary container as well?
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2023-05-19T21:53:48Z"}
isn’t quite enough - need the messages from startup
h

Harry Souris

05/19/2023, 10:22 PM
need to go now unfortunatelly
@Yee thanks for your help last night. today I run 1. docker system prune -a 2. started docker again and it worked not sure exactly with the prob last night perhaps a resource issue
it worked in the sense i could access the UI and run a workflow at the demo cluster. the execution of the wine dataset workflow fails though
i think problem was with db 2023-05-20 192236.738 GMT [266] LOG: checkpoint starting: end-of-recovery immediate wait 2023-05-20 192236.752 GMT [266] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.003 s, sync=0.002 s, total=0.014 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB 2023-05-20 192236.774 GMT [1] LOG: database system is ready to accept connections 2023-05-20 192710.073 GMT [1] LOG: received smart shutdown request 2023-05-20 192712.860 GMT [831] FATAL: the database system is shutting down 2023-05-20 192716.648 GMT [841] FATAL: the database system is shutting down 2023-05-20 192726.640 GMT [851] FATAL: the database system is shutting down 2023-05-20 192736.586 GMT [860] FATAL: the database system is shutting down
FYI @Yee
y

Yee

05/20/2023, 10:45 PM
why is the database shutting down? can you
describe
the postgres pod?
something feels wrong outside of flyte. what type of system are you on? how much resources have you given docker?
may need to bump it up a bit.
d

David Espejo (he/him)

05/23/2023, 3:16 PM
@Harry Souris what's the current status?
25 Views