Hi Team, We are trying to run our workflow in GPU,...
# ask-the-community
v
Hi Team, We are trying to run our workflow in GPU, but when we are running the flyte task we are getting following error:
Copy code
File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1024, in _new_process_group_helper
        backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)

Message:

    ProcessGroupNCCL is only supported with GPUs, no GPUs found!
We have also added the check in the start of the task about GPU availability and what i can see is that torch is able to identify the GPU :
Copy code
INFO:dummy.test_gpu:GPU is available in your machine with following device count : 1
2023-09-13 17:00:31,611 [dummy.test_gpu] [INFO ]  GPU is available in your machine with following device count : 1 
GPU available: True (cuda), used: True
The EC2 instance type we are using :
g4dn.xlarge
Can anyone help me with this. Is it possible to define the instance type in flytekit.task resources
k
You can specify instance through labels
You shouldn’t do that inverse pattern
But coming soon to a Flyte near you us gpu type selection- cc @jeev
@Vipul Goswami can you share your snippet of config in the task decorator
v
Copy code
@task(requests=Resources(cpu="100m", mem="1Gi", gpu="1"),
    limits=Resources(cpu="200m", mem="6Gi", gpu="1"),
)
def test_run(
Its a pretty simple config nothing special i have added
k
Ohh then why are you using distributed training?
You don’t need a process group
v
Sorry what do you mean by distributed training (I am new bie to this field)
k
If you want to still set it up - my recommendation use flytekit elastic plugin
Vipul - no worries- single gpu vs multi gpu training is different- from the error it seems your code is trying to setup multi gpu
v
Ok but in that case i dont think at this moment we need to use multi gpu. So as per my understanding you are suggesting to use to
flytekit elastic plugin
to use only single Single gpu
k
It will scale from one to n
And I agree you shouldn’t need to use more than one
But then you have to modify your code
v
But then you have to modify your code
Thats fine we can do that. but would it be possible to share about how can i define the instance type in flyte recources section
s
You cannot specify GPU type in the task resources. That's still being worked on: https://github.com/flyteorg/flyte/discussions/3796. If you aren't using distributed training, not sure why your code is hitting that specific code. Are you running an example from the docs?
v
I am running the code provided by our Data Scientist, but i suspect is that i am using number of worker as 4 (but as a suspect)
s
Could you share your configuration?
v
Copy code
metadata:
    name: local
    phase: serve
  training:
    accelerator: gpu
    num_workers: 4
    hyperparameters:
      batch_size: 4096
      embedding_dim: 10
      weight_decay: 1e-6
      lr: 0.005
      optim_func: XXX
      loss_func: XXX
      max_epochs: 10
s
I meant, are you using any plugin?