I tried running a task that needs a GPU on my loca...
# ask-the-community
p
I tried running a task that needs a GPU on my local sandbox cluster, to test it out first. In the console, it shows as RUNNING, in kubectl, the pod shows as PENDING (for several hours even). I tried adding a simple print statement at the start, but don't see see it with kubectl logs, so I assume it's some scheduling or setup issue. I wonder what's a good way to debug this sort of issue?
k
There are no gpus available in sandbox
Click on the exclamation icon
p
I setup the flytesnacks resources as follows:
Copy code
project: flytesnacks 
domain: development
defaults:
  cpu: "8"
  gpu: "1"
  memory: 32Gi 
  storage: "32Gi"
limits:
  cpu: "8"
  gpu: "1"
  memory: 32Gi 
  storage: "32Gi"
Which exclamation mark are you referring to?
k
That is ok, but sandbox is probably running on your local machine with no gpus right
p
image.png
This is what the console looks like
Yes, it's running locally, but I do have a GPU
I tried again and got some logs from flyte
k
IMG_2900.jpg
See the top right
If you see the status is queued
It’s unable to find gpu
Even if you have local gpu it needs to be mounted
Some community members have created a GPU enabled sandbox image
p
Ah, indeed, the exclamation mark is clear:
Copy code
8/29/2023 2:22:28 PM UTC Unschedulable:0/1 nodes are available: 1 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Some hints for future explorers: Setup nvidia in K3: https://github.com/k3s-io/k3s/issues/4391#issuecomment-1233314825 Use V1PodSpec on the task decodrator to specify runtime_class_name
v
This PR has a helpful relevant discussion: https://github.com/flyteorg/flyte/pull/3256 My coworker set up sandbox with GPUs yesterday, he couldn’t do it with flytectl so he used
k3d
directly, and ran into gpu issues for which this was the solution . Our local sandbox gpu clusters are running well now, so while sandbox does not officially support GPUs you can work around it in the same way From kubernetes’ side, to get a pod to schedule you need your pod to have a toleration for the nvidia.com/gpu taint (if the node is tainted), which you can set in the pod spec as you shared already Then you need to install the nvidia device plugin daemonset on the cluster, here we had issues with this daemonset’s pods being stuck in ContainerCreating status because the nvidia container runtime was not used, which we fixed according to this discussion
p
Awesome, thanks so much for the pointers Victor, building the image right now and will report back!:)
d
@Petr Pilař You can try this tutorial to run
flyte-binary
on a local K3d environment Then, you can adjust taints and tolerations as described here to enable task Pods to consume GPUs