I started getting `Error docker sandbox doesn t have suffici Flyte #flyte-support

I started getting `Error: docker sandbox doesn't h...

shy-accountant-549

03/17/2023, 12:11 AM

I started getting

Error: docker sandbox doesn't have sufficient memory available. Please run docker system prune -a --volumes

when starting the sandbox cluster. But there is enough memory according to docker info below. Any ideas?

Copy code

...
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 62.5GiB
Name: Mercury
ID: fa8608e9-e110-482e-8c5f-908edce3debb
Docker Root Dir: /var/lib/docker
...

shy-accountant-549

03/17/2023, 12:12 AM

this seems to happen after I pulled a large image ~ 17Gb. Is the large image causing the issue here?

glamorous-carpet-83516

03/17/2023, 1:40 AM

You may have many images in your disk, so you may have to run that command to remove those images

shy-accountant-549

03/17/2023, 1:41 AM

is there a limit of image size in the sandbox container registry (localhost:30000)?

shy-accountant-549

03/17/2023, 1:43 AM

below are the images I have

Copy code

✘ nan@Mercury  [c] (baby39)  3.9.16  ~/BabyFaceMask-MobileNetV2-Cloak-experiment   main ●  docker image ls
REPOSITORY                                                           TAG               IMAGE ID       CREATED             SIZE
<none>                                                               <none>            625003646edd   About an hour ago   117MB
flyte-sandbox-gpu                                                    latest            5a33e57a9ff5   5 hours ago         2.09GB
a-pytorch-image   4.2.2             28f38afecd99   27 hours ago        16.1GB
moby/buildkit                                                        master            c5348a51d57d   27 hours ago        168MB
moby/buildkit                                                        buildx-stable-1   477ce8a5e273   10 days ago         168MB

shy-accountant-549

03/17/2023, 2:07 AM

I did a test with following steps: 1. run docker system prune -a --volumes 2. run flytectl demo start 3. check the sandbox is up and console is responding (yes) 4. docker pull an image ~16G 5. when docker almost finished pulling the image in step 4, the sandbox cluster crashed. console isn't responding anymore 6. run flytectl demo teardown 7. run flytectl demo start, and got

Error: docker sandbox doesn't have sufficient memory available. Please run docker system prune -a --volumes

In the whole process the memory usage is below 30% of total memory. don't know why the images that are saved on disk can interfere with the sandbox cluster and cause the memory issue 🤔

shy-accountant-549

03/17/2023, 2:12 AM

some logs from the sandbox container:

Copy code

I0317 02:13:25.223129      51 eviction_manager.go:338] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
I0317 02:13:25.223189      51 container_gc.go:85] "Attempting to delete unused containers"
I0317 02:13:25.223264      51 controller.go:611] quota admission added evaluator for: <http://leases.coordination.k8s.io|leases.coordination.k8s.io>
I0317 02:13:25.224612      51 image_gc_manager.go:327] "Attempting to delete unused images"
I0317 02:13:25.227104      51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:f1845e2b5222cf46f3d823a7f9f317eee412c337dbb068ad8056141f9b97813e" size=135144912
I0317 02:13:25.241593      51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:99376d8f35e0abb6ff9d66b50a7c81df6e6dfdb649becc5df84a691a7b4beca4" size=49672672
I0317 02:13:25.248924      51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:827365c7baf137228e94bcfc6c47938b4ffde26c68c32bf3d3a7762cd04056a5" size=5088600
I0317 02:13:25.256277      51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:63c251b5cbdfce496959e87f9c155db279bb348ac26294624becb56ca9813268" size=80642070
I0317 02:13:25.263486      51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:11e23119f2c697a4d756a33d130370517aa268908f2e8dce5345385ca467099f" size=88538609
I0317 02:13:25.270731      51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:0d153fadf70b612a5215e3a788a0b58ba6fa25e5df4b59698e0feb2174e8a98c" size=24702520
I0317 02:13:25.278309      51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:a729f5f0de5fa39ba4d649e7366d499299304145d2456d60a16b0e63395bd61a" size=284035241
I0317 02:13:25.286097      51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:83c8830c18680f53476a5661a17323d1d8836f2d0a4ac2fbdf441eb48645c799" size=224684680
I0317 02:13:25.293678      51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:07655ddf2eebe5d250f7a72c25f638b27126805d61779741b4e62e69ba080558" size=249227352
I0317 02:13:25.301283      51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:fb9b574e03c344e1619ced3ef0700acb2ab8ef1d39973cabd90b8371a46148be" size=35257594
I0317 02:13:25.308796      51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:f73640fb506199d02192ef1dc99404aeb1afec43a9f7dad5de96c09eda17cd71" size=65673656
I0317 02:13:25.323203      51 eviction_manager.go:349] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage"
E0317 02:13:25.323252      51 eviction_manager.go:360] "Eviction manager: eviction thresholds have been met, but no pods are active to evict"
I0317 02:13:28.154911      51 node_lifecycle_controller.go:1192] Controller detected that some Nodes are Ready. Exiting master disruption mode.
E0317 02:13:33.407314      51 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
W0317 02:13:33.816754      51 garbagecollector.go:747] failed to discover some groups: map[<http://metrics.k8s.io/v1beta1:the|metrics.k8s.io/v1beta1:the> server is currently unable to handle the request]
I0317 02:13:35.333247      51 eviction_manager.go:338] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
I0317 02:13:35.333304      51 container_gc.go:85] "Attempting to delete unused containers"
I0317 02:13:35.334926      51 image_gc_manager.go:327] "Attempting to delete unused images"
I0317 02:13:35.343198      51 eviction_manager.go:349] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage"
E0317 02:13:35.343235      51 eviction_manager.go:360] "Eviction manager: eviction thresholds have been met, but no pods are active to evict"
I0317 02:13:45.352874      51 eviction_manager.go:338] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
I0317 02:13:45.352939      51 container_gc.go:85] "Attempting to delete unused containers"
I0317 02:13:45.354333      51 image_gc_manager.go:327] "Attempting to delete unused images"
I0317 02:13:45.363436      51 eviction_manager.go:349] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage"
E0317 02:13:45.363472      51 eviction_manager.go:360] "Eviction manager: eviction thresholds have been met, but no pods are active to evict"
I0317 02:13:55.372514      51 eviction_manager.go:338] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
I0317 02:13:55.372570      51 container_gc.go:85] "Attempting to delete unused containers"
I0317 02:13:55.374127      51 image_gc_manager.go:327] "Attempting to delete unused images"
I0317 02:13:55.382783      51 eviction_manager.go:349] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage"
E0317 02:13:55.382823      51 eviction_manager.go:360] "Eviction manager: eviction thresholds have been met, but no pods are active to evict"
E0317 02:14:03.442711      51 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
W0317 02:14:03.841875      51 garbagecollector.go:747] failed to discover some groups: map[<http://metrics.k8s.io/v1beta1:the|metrics.k8s.io/v1beta1:the> server is currently unable to handle the request]
W0317 02:14:04.894328      51 handler_proxy.go:105] no RequestInfo found in the context
E0317 02:14:04.894384      51 controller.go:113] loading OpenAPI spec for "<http://v1beta1.metrics.k8s.io|v1beta1.metrics.k8s.io>" failed with: Error, could not get list of group versions for APIService
I0317 02:14:04.894403      51 controller.go:126] OpenAPI AggregationController: action for item <http://v1beta1.metrics.k8s.io|v1beta1.metrics.k8s.io>: Rate Limited Requeue.
W0317 02:14:04.895561      51 handler_proxy.go:105] no RequestInfo found in the context
E0317 02:14:04.895660      51 controller.go:116] loading OpenAPI spec for "<http://v1beta1.metrics.k8s.io|v1beta1.metrics.k8s.io>" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable

quick-salesclerk-18019

03/17/2023, 7:40 AM

You probably need more disk space on your server; check with

df -h

and the volume/disk that holds your docker directory

👍 1

quick-salesclerk-18019

03/17/2023, 7:51 AM

(i.e. the error message is misleading - you're running out of disk, not memory)

shy-accountant-549

03/17/2023, 3:04 PM

and docker system prune didn't remove all the volumes for me as in this issue

167 Views

Open in Slack

Previous Next