Vipul Goswami
09/14/2023, 6:46 AMFile "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1024, in _new_process_group_helper
backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)
Message:
ProcessGroupNCCL is only supported with GPUs, no GPUs found!
We have also added the check in the start of the task about GPU availability and what i can see is that torch is able to identify the GPU :
INFO:dummy.test_gpu:GPU is available in your machine with following device count : 1
2023-09-13 17:00:31,611 [dummy.test_gpu] [INFO ] GPU is available in your machine with following device count : 1
GPU available: True (cuda), used: True
The EC2 instance type we are using : g4dn.xlarge
Can anyone help me with this.
Is it possible to define the instance type in flytekit.task resourcesKetan (kumare3)
Vipul Goswami
09/14/2023, 1:24 PM@task(requests=Resources(cpu="100m", mem="1Gi", gpu="1"),
limits=Resources(cpu="200m", mem="6Gi", gpu="1"),
)
def test_run(
Its a pretty simple config nothing special i have addedKetan (kumare3)
Vipul Goswami
09/14/2023, 1:26 PMKetan (kumare3)
Vipul Goswami
09/14/2023, 1:30 PMflytekit elastic plugin
to use only single Single gpuKetan (kumare3)
Vipul Goswami
09/14/2023, 1:51 PMBut then you have to modify your codeThats fine we can do that. but would it be possible to share about how can i define the instance type in flyte recources section
Samhita Alla
Vipul Goswami
09/15/2023, 9:00 AMSamhita Alla
Vipul Goswami
09/15/2023, 10:46 AMmetadata:
name: local
phase: serve
training:
accelerator: gpu
num_workers: 4
hyperparameters:
batch_size: 4096
embedding_dim: 10
weight_decay: 1e-6
lr: 0.005
optim_func: XXX
loss_func: XXX
max_epochs: 10
Samhita Alla