Tarmily Wen
10/27/2022, 4:35 PMKetan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Tarmily Wen
10/27/2022, 4:47 PMmpirun -np 4 \
-H 104.171.200.62:2,104.171.200.182:2 \
-x MASTER_ADDR=104.171.200.62 \
-x MASTER_PORT=1234 \
-x PATH \
-bind-to none -map-by slot \
-mca pml ob1 -mca btl ^openib \
python3 main.py --backend=nccl --use_syn --batch_size=8192 --arch=resnet152
The ip addresses of the nodes aren't know until they are spun up correct?Tarmily Wen
10/27/2022, 4:48 PMKevin Su
10/27/2022, 5:09 PMTF_CONFIG
in every pods. TF_CONFIG
contains each node ip address, and master’s address. Therefore, the tasks can communicate with each others when training. same as horovod and mpiKetan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Ketan (kumare3)
Ketan (kumare3)