delightful-computer-49028
10/27/2022, 4:35 PMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
delightful-computer-49028
10/27/2022, 4:47 PMmpirun -np 4 \
-H 104.171.200.62:2,104.171.200.182:2 \
-x MASTER_ADDR=104.171.200.62 \
-x MASTER_PORT=1234 \
-x PATH \
-bind-to none -map-by slot \
-mca pml ob1 -mca btl ^openib \
python3 main.py --backend=nccl --use_syn --batch_size=8192 --arch=resnet152
The ip addresses of the nodes aren't know until they are spun up correct?delightful-computer-49028
10/27/2022, 4:48 PMglamorous-carpet-83516
10/27/2022, 5:09 PMTF_CONFIG
in every pods. TF_CONFIG
contains each node ip address, and master’s address. Therefore, the tasks can communicate with each others when training. same as horovod and mpifreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809