Hi, I am wondering if there is an example/tutorial...
# ask-the-community
t
Hi, I am wondering if there is an example/tutorial on multi-node multi-gpu training. I only see the single-node multi-gpu training example.
k
for multi-node multi-gpu you will need to use mpi plugin
or use Ray plugin
cc @Niels Bantilan / @Samhita Alla
t
when using a the mpi plugin to create a mpijob, how would you create a job like this:
Copy code
mpirun -np 4 \
-H 104.171.200.62:2,104.171.200.182:2 \
-x MASTER_ADDR=104.171.200.62 \
-x MASTER_PORT=1234 \
-x PATH \
-bind-to none -map-by slot \
-mca pml ob1 -mca btl ^openib \
python3 main.py --backend=nccl --use_syn --batch_size=8192 --arch=resnet152
The ip addresses of the nodes aren't know until they are spun up correct?
this is taken from here
k
tf-plugin and pytorch-plugin also support multi-node multi-gpu training. basically, we use training-operator to manage the pods (tasks) life cycle, and it will set a env
TF_CONFIG
in every pods.
TF_CONFIG
contains each node ip address, and master’s address. Therefore, the tasks can communicate with each others when training. same as horovod and mpi
k
@Tarmily Wen the mpi job is automatically created
so it should just work - 🤞 - famous last words
but @Tarmily Wen we are here to cheer you on and help. Please try it and let us know if you see any problems
229 Views