Nan Qin
03/17/2023, 12:11 AMError: docker sandbox doesn't have sufficient memory available. Please run docker system prune -a --volumes
when starting the sandbox cluster. But there is enough memory according to docker info below. Any ideas?
...
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 62.5GiB
Name: Mercury
ID: fa8608e9-e110-482e-8c5f-908edce3debb
Docker Root Dir: /var/lib/docker
...
Kevin Su
03/17/2023, 1:40 AMNan Qin
03/17/2023, 1:41 AM✘ nan@Mercury [c] (baby39) 3.9.16 ~/BabyFaceMask-MobileNetV2-Cloak-experiment main ● docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
<none> <none> 625003646edd About an hour ago 117MB
flyte-sandbox-gpu latest 5a33e57a9ff5 5 hours ago 2.09GB
a-pytorch-image 4.2.2 28f38afecd99 27 hours ago 16.1GB
moby/buildkit master c5348a51d57d 27 hours ago 168MB
moby/buildkit buildx-stable-1 477ce8a5e273 10 days ago 168MB
Error: docker sandbox doesn't have sufficient memory available. Please run docker system prune -a --volumes
In the whole process the memory usage is below 30% of total memory.
don't know why the images that are saved on disk can interfere with the sandbox cluster and cause the memory issue 🤔I0317 02:13:25.223129 51 eviction_manager.go:338] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
I0317 02:13:25.223189 51 container_gc.go:85] "Attempting to delete unused containers"
I0317 02:13:25.223264 51 controller.go:611] quota admission added evaluator for: <http://leases.coordination.k8s.io|leases.coordination.k8s.io>
I0317 02:13:25.224612 51 image_gc_manager.go:327] "Attempting to delete unused images"
I0317 02:13:25.227104 51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:f1845e2b5222cf46f3d823a7f9f317eee412c337dbb068ad8056141f9b97813e" size=135144912
I0317 02:13:25.241593 51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:99376d8f35e0abb6ff9d66b50a7c81df6e6dfdb649becc5df84a691a7b4beca4" size=49672672
I0317 02:13:25.248924 51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:827365c7baf137228e94bcfc6c47938b4ffde26c68c32bf3d3a7762cd04056a5" size=5088600
I0317 02:13:25.256277 51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:63c251b5cbdfce496959e87f9c155db279bb348ac26294624becb56ca9813268" size=80642070
I0317 02:13:25.263486 51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:11e23119f2c697a4d756a33d130370517aa268908f2e8dce5345385ca467099f" size=88538609
I0317 02:13:25.270731 51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:0d153fadf70b612a5215e3a788a0b58ba6fa25e5df4b59698e0feb2174e8a98c" size=24702520
I0317 02:13:25.278309 51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:a729f5f0de5fa39ba4d649e7366d499299304145d2456d60a16b0e63395bd61a" size=284035241
I0317 02:13:25.286097 51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:83c8830c18680f53476a5661a17323d1d8836f2d0a4ac2fbdf441eb48645c799" size=224684680
I0317 02:13:25.293678 51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:07655ddf2eebe5d250f7a72c25f638b27126805d61779741b4e62e69ba080558" size=249227352
I0317 02:13:25.301283 51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:fb9b574e03c344e1619ced3ef0700acb2ab8ef1d39973cabd90b8371a46148be" size=35257594
I0317 02:13:25.308796 51 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:f73640fb506199d02192ef1dc99404aeb1afec43a9f7dad5de96c09eda17cd71" size=65673656
I0317 02:13:25.323203 51 eviction_manager.go:349] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage"
E0317 02:13:25.323252 51 eviction_manager.go:360] "Eviction manager: eviction thresholds have been met, but no pods are active to evict"
I0317 02:13:28.154911 51 node_lifecycle_controller.go:1192] Controller detected that some Nodes are Ready. Exiting master disruption mode.
E0317 02:13:33.407314 51 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
W0317 02:13:33.816754 51 garbagecollector.go:747] failed to discover some groups: map[<http://metrics.k8s.io/v1beta1:the|metrics.k8s.io/v1beta1:the> server is currently unable to handle the request]
I0317 02:13:35.333247 51 eviction_manager.go:338] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
I0317 02:13:35.333304 51 container_gc.go:85] "Attempting to delete unused containers"
I0317 02:13:35.334926 51 image_gc_manager.go:327] "Attempting to delete unused images"
I0317 02:13:35.343198 51 eviction_manager.go:349] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage"
E0317 02:13:35.343235 51 eviction_manager.go:360] "Eviction manager: eviction thresholds have been met, but no pods are active to evict"
I0317 02:13:45.352874 51 eviction_manager.go:338] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
I0317 02:13:45.352939 51 container_gc.go:85] "Attempting to delete unused containers"
I0317 02:13:45.354333 51 image_gc_manager.go:327] "Attempting to delete unused images"
I0317 02:13:45.363436 51 eviction_manager.go:349] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage"
E0317 02:13:45.363472 51 eviction_manager.go:360] "Eviction manager: eviction thresholds have been met, but no pods are active to evict"
I0317 02:13:55.372514 51 eviction_manager.go:338] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
I0317 02:13:55.372570 51 container_gc.go:85] "Attempting to delete unused containers"
I0317 02:13:55.374127 51 image_gc_manager.go:327] "Attempting to delete unused images"
I0317 02:13:55.382783 51 eviction_manager.go:349] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage"
E0317 02:13:55.382823 51 eviction_manager.go:360] "Eviction manager: eviction thresholds have been met, but no pods are active to evict"
E0317 02:14:03.442711 51 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
W0317 02:14:03.841875 51 garbagecollector.go:747] failed to discover some groups: map[<http://metrics.k8s.io/v1beta1:the|metrics.k8s.io/v1beta1:the> server is currently unable to handle the request]
W0317 02:14:04.894328 51 handler_proxy.go:105] no RequestInfo found in the context
E0317 02:14:04.894384 51 controller.go:113] loading OpenAPI spec for "<http://v1beta1.metrics.k8s.io|v1beta1.metrics.k8s.io>" failed with: Error, could not get list of group versions for APIService
I0317 02:14:04.894403 51 controller.go:126] OpenAPI AggregationController: action for item <http://v1beta1.metrics.k8s.io|v1beta1.metrics.k8s.io>: Rate Limited Requeue.
W0317 02:14:04.895561 51 handler_proxy.go:105] no RequestInfo found in the context
E0317 02:14:04.895660 51 controller.go:116] loading OpenAPI spec for "<http://v1beta1.metrics.k8s.io|v1beta1.metrics.k8s.io>" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
Björn
03/17/2023, 7:40 AMdf -h
and the volume/disk that holds your docker directoryNan Qin
03/17/2023, 3:04 PM