Hi, I’m using flyte sandbox and trying to run some tasks in flyte. I have a task that needs to acces...
g

Gaurav Kumar

about 2 years ago
Hi, I’m using flyte sandbox and trying to run some tasks in flyte. I have a task that needs to access gpu from my host machine. Since, I have gpu is in my host machine, it needs to be passed on to the k8 pod that’s running inside the docker container (sandbox). However, I see that kubectl is not able to schedule the pod.
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  11m    default-scheduler  0/1 nodes are available: 1 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>, 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
I also checked the
kubectl describe node 31e2e2ba6f9f
(
31e2e2ba6f9f
is the flyte-sandbox container in which flyte is running in sandbox environment) Looks like node (docker container in this case) itself doesn’t have gpu to allocate to the pod
>kubectl describe node 31e2e2ba6f9f
...
Capacity:
  cpu:                8                                                                                                                                                       ephemeral-storage:  944801904Ki
  hugepages-1Gi:      0                                                                                                                                                   hugepages-2Mi:      0
  memory:             65774996Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  919103291491                                                                                                                                        hugepages-1Gi:      0
  hugepages-2Mi:      0                                                                                                                                                   memory:             65774996Ki
  pods:               110
...
I think this is happening because the flyte-sandbox docker container is not able to access gpus. I logged into the container and it doesn’t give output of
nvidia-smi
which is required to work for container to access gpus.
❯ docker exec -it 31e2e2ba6f9f /bin/sh
/ # nvidia-smi
/bin/sh: nvidia-smi: not found
/ #
Note that outside of flyte, I’m able to run the task image in a docker container in my host machine as below which needs
--gpus all
to be passed for container to access gpus
docker run --gpus all <task_image_with_gpu_req>
I think if we can somehow pass the
--gpus all
to the docker run command for sandbox during
flytectl demo start
command, it should work. Please help !!!