square-agency-59624
06/17/2025, 1:45 PMtorch.cuda.is_available()
is false on the final task. Why does this happen?
To further test, I modified all my tasks to have GPU. When I do this, my final task does have torch.cuda.is_available()
is true. This does not make sense to me as each task is suppose to run on it's own containerjolly-florist-66728
06/17/2025, 1:45 PMtorch.cuda.is_available()
will return False
. When you set GPU requirements for all tasks, all pods are scheduled on GPU nodes, so the environment is correct for CUDA detection in every task, including the last one. This suggests your image and cluster are set up correctly, but the last task only gets a GPU if its pod is scheduled on a GPU node, which only happens if you request a GPU for that task specifically.
Make sure your final task's decorator includes requests=Resources(gpu="1")
(and optionally limits=Resources(gpu="1")
), and that your Flyte cluster has available GPU nodes. Also, ensure your image is built with CUDA support. If these are set, only the task(s) with GPU requests will run on GPU nodes, and torch.cuda.is_available()
should be True
only in those tasks. See Flyte GPU configuration docs and task resource settings for more details.
Would you like a step-by-step breakdown or example code?
Sources:
- Configuring Access to GPUs
- task resource settings 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.square-agency-59624
06/17/2025, 2:40 PMjolly-florist-66728
06/17/2025, 2:40 PMtorch.cuda.is_available()
is still False
in the last task, the most likely cause is that the pod for the last task is not being scheduled on a GPU node. This can happen if your cluster does not have available GPU nodes, or if the Flyte/Kubernetes configuration for GPU scheduling is not correct. Each task runs in its own pod, so only the task with requests=Resources(gpu="1")
should be scheduled on a GPU node, and only that pod should have access to the GPU and CUDA.
If all tasks request GPU, then all pods are scheduled on GPU nodes, so torch.cuda.is_available()
is True
everywhere. But if only the last task requests GPU and it still cannot see the GPU, check that your cluster has available GPU nodes, the node labels and taints are set up correctly, and your Flyte deployment is configured to schedule GPU tasks properly. See Configuring Access to GPUs for troubleshooting.
Would you like more detail on how to verify your cluster and Flyte configuration?
Sources:
- Configuring Access to GPUs 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.