Hi Team We are trying to run our workflow in GPU but when we Flyte #flyte-support

Hi Team, We are trying to run our workflow in GPU,...

helpful-church-28990

09/14/2023, 6:46 AM

Hi Team, We are trying to run our workflow in GPU, but when we are running the flyte task we are getting following error:

Copy code

File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1024, in _new_process_group_helper
        backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)

Message:

    ProcessGroupNCCL is only supported with GPUs, no GPUs found!

We have also added the check in the start of the task about GPU availability and what i can see is that torch is able to identify the GPU :

Copy code

INFO:dummy.test_gpu:GPU is available in your machine with following device count : 1
2023-09-13 17:00:31,611 [dummy.test_gpu] [INFO ]  GPU is available in your machine with following device count : 1 
GPU available: True (cuda), used: True

The EC2 instance type we are using :

g4dn.xlarge

Can anyone help me with this. Is it possible to define the instance type in flytekit.task resources

freezing-airport-6809

09/14/2023, 1:17 PM

You can specify instance through labels

freezing-airport-6809

09/14/2023, 1:17 PM

You shouldn’t do that inverse pattern

freezing-airport-6809

09/14/2023, 1:18 PM

But coming soon to a Flyte near you us gpu type selection- cc @freezing-boots-56761

freezing-airport-6809

09/14/2023, 1:18 PM

@helpful-church-28990 can you share your snippet of config in the task decorator

helpful-church-28990

09/14/2023, 1:24 PM

Copy code

@task(requests=Resources(cpu="100m", mem="1Gi", gpu="1"),
    limits=Resources(cpu="200m", mem="6Gi", gpu="1"),
)
def test_run(

Its a pretty simple config nothing special i have added

freezing-airport-6809

09/14/2023, 1:26 PM

Ohh then why are you using distributed training?

freezing-airport-6809

09/14/2023, 1:26 PM

You don’t need a process group

helpful-church-28990

09/14/2023, 1:26 PM

Sorry what do you mean by distributed training (I am new bie to this field)

freezing-airport-6809

09/14/2023, 1:27 PM

If you want to still set it up - my recommendation use flytekit elastic plugin

freezing-airport-6809

09/14/2023, 1:28 PM

Vipul - no worries- single gpu vs multi gpu training is different- from the error it seems your code is trying to setup multi gpu

helpful-church-28990

09/14/2023, 1:30 PM

Ok but in that case i dont think at this moment we need to use multi gpu. So as per my understanding you are suggesting to use to

flytekit elastic plugin

to use only single Single gpu

freezing-airport-6809

09/14/2023, 1:44 PM

It will scale from one to n

freezing-airport-6809

09/14/2023, 1:45 PM

And I agree you shouldn’t need to use more than one

freezing-airport-6809

09/14/2023, 1:45 PM

But then you have to modify your code

helpful-church-28990

09/14/2023, 1:51 PM

But then you have to modify your code

Thats fine we can do that. but would it be possible to share about how can i define the instance type in flyte recources section

tall-lock-23197

09/15/2023, 8:56 AM

You cannot specify GPU type in the task resources. That's still being worked on: https://github.com/flyteorg/flyte/discussions/3796. If you aren't using distributed training, not sure why your code is hitting that specific code. Are you running an example from the docs?

helpful-church-28990

09/15/2023, 9:00 AM

I am running the code provided by our Data Scientist, but i suspect is that i am using number of worker as 4 (but as a suspect)

tall-lock-23197

09/15/2023, 9:08 AM

Could you share your configuration?

helpful-church-28990

09/15/2023, 10:46 AM

Copy code

metadata:
    name: local
    phase: serve
  training:
    accelerator: gpu
    num_workers: 4
    hyperparameters:
      batch_size: 4096
      embedding_dim: 10
      weight_decay: 1e-6
      lr: 0.005
      optim_func: XXX
      loss_func: XXX
      max_epochs: 10

tall-lock-23197

09/15/2023, 1:53 PM

I meant, are you using any plugin?

2 Views

Open in Slack

Previous Next