Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Hi,
I am wondering if there is an example/tutorial on multi-node multi-gpu training. I only see the single-node multi-gpu training example.

for multi-node multi-gpu you will need to use mpi plugin

<https://docs.flyte.org/projects/cookbook/en/latest/auto/integrations/kubernetes/kfmpi/index.html#sphx-glr-auto-integrations-kubernetes-kfmpi>

Ray example: <https://docs.flyte.org/projects/cookbook/en/latest/auto/integrations/kubernetes/ray_example/ray_example.html#sphx-glr-auto-integrations-kubernetes-ray-example-ray-example-py|https://docs.flyte.org/projects/cookbook/en/latest/auto/integrations/kubernetes/ray_example/ray_example.html#sphx-glr-auto-inte[…]ple-ray-example-py>

or use spark + horovod -&gt; there is an example for this.
<https://docs.flyte.org/projects/cookbook/en/latest/auto/case_studies/ml_training/spark_horovod/index.html#sphx-glr-auto-case-studies-ml-training-spark-horovod|https://docs.flyte.org/projects/cookbook/en/latest/auto/case_studies/ml_training/spark_horovod/index.html#sphx-glr-auto-case-[…]aining-spark-horovod>

when using a the mpi plugin to create a mpijob, how would you create a job like this:
```mpirun -np 4 \
-H 104.171.200.62:2,104.171.200.182:2 \
-x MASTER_ADDR=104.171.200.62 \
-x MASTER_PORT=1234 \
-x PATH \
-bind-to none -map-by slot \
-mca pml ob1 -mca btl ^openib \
python3 main.py --backend=nccl --use_syn --batch_size=8192 --arch=resnet152```
The ip addresses of the nodes aren't know until they are spun up correct?

this is taken from <https://lambdalabs.com/blog/multi-node-pytorch-distributed-training-guide/#write-multi-node-pytorch-distributed-applications|here>

tf-plugin and pytorch-plugin also support multi-node multi-gpu training. basically, we use <https://github.com/kubeflow/training-operator|training-operator> to manage the pods (tasks) life cycle, and it will set a env `TF_CONFIG` in every pods. `TF_CONFIG` contains each node ip address, and master’s address. Therefore, the tasks can communicate with each others when training. same as horovod and mpi

<@U044PJ74DL1> the mpi job is automatically created

check this - <https://github.com/flyteorg/flytekit/blob/caf612d37e3767d6d0184d6d4e86c4a01f8adeab/plugins/flytekit-kf-mpi/flytekitplugins/kfmpi/task.py#L91>

especially - <https://github.com/flyteorg/flytekit/blob/caf612d37e3767d6d0184d6d4e86c4a01f8adeab/plugins/flytekit-kf-mpi/flytekitplugins/kfmpi/task.py#L123>

so it should just work - :crossed_fingers: - famous last words

but <@U044PJ74DL1> we are here to cheer you on and help. Please try it and let us know if you see any problems