Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

I tried running a task that needs a GPU on my local sandbox cluster, to test it out first. In the console, it shows as RUNNING, in kubectl, the pod shows as PENDING (for several hours even). I tried adding a simple print statement at the start, but don't see see it with kubectl logs, so I assume it's some scheduling or setup issue. I wonder what's a good way to debug this sort of issue?

I setup the flytesnacks resources as follows:

```project: flytesnacks 
domain: development
defaults:
  cpu: "8"
  gpu: "1"
  memory: 32Gi 
  storage: "32Gi"
limits:
  cpu: "8"
  gpu: "1"
  memory: 32Gi 
  storage: "32Gi"```

Which exclamation mark are you referring to?

That is ok, but sandbox is probably running on your local machine with no gpus right

image.png

Yes, it's running locally, but I do have a GPU

I tried again and got some logs from flyte

IMG_2900.jpg

Even if you have local gpu it needs to be mounted

Some community members have created a GPU enabled sandbox image

Ah, indeed, the exclamation mark is clear:

```8/29/2023 2:22:28 PM UTC Unschedulable:0/1 nodes are available: 1 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.```

Some hints for future explorers:
Setup nvidia in K3: <https://github.com/k3s-io/k3s/issues/4391#issuecomment-1233314825>
Use V1PodSpec on the task decodrator to specify runtime_class_name

This PR has a helpful relevant discussion: <https://github.com/flyteorg/flyte/pull/3256>

My coworker set up sandbox with GPUs yesterday, he couldn’t do it with flytectl so he used `k3d` directly, and ran into gpu issues for which <https://github.com/k3d-io/k3d/issues/1108|this was the solution> . Our local sandbox gpu clusters are running well now, so while sandbox does not officially support GPUs you can work around it in the same way

From kubernetes’ side, to get a pod to schedule you need your pod to have a toleration for the <http://nvidia.com/gpu|nvidia.com/gpu> taint (if the node is tainted), which you can set in the pod spec as you shared already

Then you need to install the <https://github.com/NVIDIA/k8s-device-plugin|nvidia device plugin daemonset> on the cluster, here we had issues with this daemonset’s pods being stuck in ContainerCreating status because the nvidia container runtime was not used, which we fixed <https://github.com/k3d-io/k3d/issues/1108|according to this discussion>

Awesome, thanks so much for the pointers Victor, building the image right now and will report back!:)

<@U05PDAMKG3H> You can try <https://github.com/davidmirror-ops/flyte-the-hard-way/blob/main/docs/on-premises/001-configure-local-k8s.md|this tutorial> to run `flyte-binary` on a local K3d environment

Then, you can adjust taints and tolerations as described <https://docs.flyte.org/projects/cookbook/en/latest/auto_examples/deployment/configure_use_gpus.html#configure-gpus|here> to enable task Pods to consume GPUs