acceptable-jackal-25563
08/29/2023, 11:57 AMfreezing-airport-6809
freezing-airport-6809
acceptable-jackal-25563
08/29/2023, 2:13 PMproject: flytesnacks
domain: development
defaults:
cpu: "8"
gpu: "1"
memory: 32Gi
storage: "32Gi"
limits:
cpu: "8"
gpu: "1"
memory: 32Gi
storage: "32Gi"
acceptable-jackal-25563
08/29/2023, 2:13 PMfreezing-airport-6809
acceptable-jackal-25563
08/29/2023, 2:13 PMacceptable-jackal-25563
08/29/2023, 2:13 PMacceptable-jackal-25563
08/29/2023, 2:14 PMacceptable-jackal-25563
08/29/2023, 2:22 PMacceptable-jackal-25563
08/29/2023, 2:24 PMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
acceptable-jackal-25563
08/29/2023, 2:57 PM8/29/2023 2:22:28 PM UTC Unschedulable:0/1 nodes are available: 1 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
acceptable-jackal-25563
08/29/2023, 3:06 PMkind-kite-58745
08/29/2023, 3:32 PMk3d
directly, and ran into gpu issues for which this was the solution . Our local sandbox gpu clusters are running well now, so while sandbox does not officially support GPUs you can work around it in the same way
From kubernetes’ side, to get a pod to schedule you need your pod to have a toleration for the nvidia.com/gpu taint (if the node is tainted), which you can set in the pod spec as you shared already
Then you need to install the nvidia device plugin daemonset on the cluster, here we had issues with this daemonset’s pods being stuck in ContainerCreating status because the nvidia container runtime was not used, which we fixed according to this discussionacceptable-jackal-25563
08/29/2023, 3:43 PMaverage-finland-92144
08/29/2023, 3:48 PMflyte-binary
on a local K3d environment
Then, you can adjust taints and tolerations as described here to enable task Pods to consume GPUs