I am running the PyTorch Lightning MNIST example (...
# flyte-support
w
I am running the PyTorch Lightning MNIST example (https://docs.flyte.org/en/latest/flytesnacks/examples/kfpytorch_plugin/pytorch_lightning_mnist_autoencoder.html) on my homelab computer. First it was complaining that I should not use cuda="12.1.0", so I replaced it with conda_channels=["nvidia"] and now I could start the execution of the workflow on my cluster but it does not seem to use the GPU at all nvidia-smi shows 0% volatile usage. I expected the fan to go crazy. Should I not use Elastic with only one GPU? How can I check what the job is doing right now?
f
Cc @broad-monitor-993
Cc @flaky-parrot-42438
You can use the elastic with one GPU. Usually the problem is the driver and the version of library mismatch.
w
@wonderful-air-12263 what does your
ImageSpec
look like?
It’s important to check if the built image has the correct cuda drivers that matches your pytorch version: test with
torch.cuda.is_available()
You can do
pyflyte build <script>.py <workflow/task_name>
to build the image locally
also, what does your
Elastic
task config look like?