Distributed training

About 3 min

Distributed training

ESPnet2 provides some kinds of data-parallel distributed training.

	DP/DDP	Single/Multi host	Option
Multi-processing with single host	DistributedDataParallel	Single	--ngpu `N-GPU` --multiprocessing_distributed true
Multi-threading with single host	DataParallel	Single	--ngpu `N-GPU` --multiprocessing_distributed false
Multi-processing with `N-HOST` jobs with `N-GPU` for each host (=`N-HOST`x`N-GPU` nodes)	DistributedDataParallel	Multi	--dist_world_size `N-HOST` --ngpu `N-GPU` --multiprocessing_distributed true
Multi-threading with `N-HOST` jobs with `N-GPU` for each host (=`N-HOST`x`N-GPU` nodes)	DistributedDataParallel	Multi	--dist_world_size `N-HOST` --ngpu `N-GPU` --multiprocessing_distributed false
`N-NODE` jobs with `1-GPU` for each node	DistributedDataParallel	Single/Multi	--dist_world_size `N-NODE` --ngpu 1

Examples

Note: The behavior of batch size in ESPnet2 during multi-GPU training is different from that in ESPnet1. In ESPnet2, the total batch size is not changed regardless of the number of GPUs. Therefore, you need to manually increase the batch size if you increase the number of GPUs. Please refer to this doc for more information.

Single node with 4GPUs with distributed mode

% python -m espnet2.bin.asr_train --ngpu 4 --multiprocessing_distributed true

You can disable distributed mode and switch to threading based data parallel as follows:

% python -m espnet2.bin.asr_train --ngpu 4 --multiprocessing_distributed false

If you meet some errors with distributed mode, please try single gpu mode or multi-GPUs with --multiprocessing_distributed false before reporting the issue.

To enable sharded training

We supports sharded training provided by fairscale

% python -m espnet2.bin.asr_train --ngpu 4 --multiprocessing_distributed true --sharded_ddp true

Note that the other features of fairscale are not supported now.

2Host and 2GPUs for each host with multiprocessing distributed mode

Note that multiprocessing distributed mode assumes the same number of GPUs for each node.

(host1) % python -m espnet2.bin.asr_train \
    --multiprocessing_distributed true \
    --ngpu 2 \
    --dist_rank 0  \
    --dist_world_size 2  \
    --dist_master_addr host1  \
    --dist_master_port <any-free-port>
(host2) % python -m espnet2.bin.asr_train \
    --multiprocessing_distributed true \
    --ngpu 2 \
    --dist_rank 1  \
    --dist_world_size 2  \
    --dist_master_addr host1  \
    --dist_master_port <any-free-port>

RANK and WORLD_SIZE

--dist_rank and --dist_world_size indicate RANK and WORLD_SIZE in terms of MPI; i.e., they indicate the id of each processe and the number of processes respectively. They can be also specified by the environment variables ${RANK} and ${WORLD_SIZE}.

About init method

See: https://pytorch.org/docs/stable/distributed.html#tcp-initialization

There are two ways to initialize and these methods can be interchanged in all examples.

TCP initialization

# These three are equivalent:
--dist_master_addr <rank0-host> --dist_master_port <any-free-port>
--dist_init_method "tcp://<rank0-host>:<any-free-port>"
export MASTER_ADDR=<rank0-host> MASTER_PORT=<any-free-port>

Shared file system initialization
```
--dist_init_method "file:///nfs/some/where/filename"
```
This initialization might be failed if the previous file is existing. I recommend you to use a random file name to avoid to reuse it. e.g.
```
--dist_init_method "file://$(pwd)/.dist_init_$(openssl rand -base64 12)"
```

2Hosts which have 2GPUs and 1GPU respectively

(host1) % python -m espnet2.bin.asr_train \
    --ngpu 1 \
    --multiprocessing_distributed false \
    --dist_rank 0  \
    --dist_world_size 3  \
    --dist_master_addr host1  \
    --dist_master_port <any-free-port>
(host1) % python -m espnet2.bin.asr_train \
    --ngpu 1 \
    --multiprocessing_distributed false \
    --dist_rank 1  \
    --dist_world_size 3  \
    --dist_master_addr host1  \
    --dist_master_port <any-free-port>
(host2) % python -m espnet2.bin.asr_train \
    --ngpu 1 \
    --multiprocessing_distributed false \
    --dist_rank 2  \
    --dist_world_size 3  \
    --dist_master_addr host1  \
    --dist_master_port <any-free-port>

2Hosts and 2GPUs for each node using `Slurm` with multiprocessing distributed

 % srun -c2 -N2 --gres gpu:2 \
    python -m espnet2.bin.asr_train --ngpu 2 --multiprocessing_distributed true \
    --dist_launcher slurm \
    --dist_init_method "file://$(pwd)/.dist_init_$(openssl rand -base64 12)"

I recommend shared-file initialization in this case because the host will be determined after submitting the job, therefore we can't tell the free port number before.

5GPUs with 3nodes using `Slurm`

(Not tested)

% srun -n5 -N3 --gpus-per-task 1 \
    python -m espnet2.bin.asr_train --ngpu 1 --multiprocessing_distributed false  \
    --dist_launcher slurm \
    --dist_init_method "file://$(pwd)/.dist_init_$(openssl rand -base64 12)"

2Hosts and 2GPUs for each node using `MPI` with multiprocessing distributed

 % mpirun -np 2 -host host1,host2 \
    python -m espnet2.bin.asr_train --ngpu 2 --multiprocessing_distributed true \
    --dist_launcher mpi \
    --dist_init_method "file://$(pwd)/.dist_init_$(openssl rand -base64 12)"

`espnet2.bin.launch`

Coming soon...

Troubleshooting for NCCL with Ethernet case

NCCL WARN Connect to 192.168.1.51<51890> failed : No route to host
- Reason: Firewall?
- Need to free all ports?
NCCL INFO Call to connect returned Connection refused, retrying
- Reason: NIC is found, but connection is refused?
- Set NCCL_SOCKET_IFNAME=<appropriate_interface>
NCCL WARN Bootstrap : no socket interface found
- Reason: Any NIC are not found . (Maybe NCCL_SOCKET_IFNAME is incorrect)
- Set NCCL_SOCKET_IFNAME=<appropriate_interface>.
NCCL WARN peer mapping resources exhausted
- ???
- https://devtalk.nvidia.com/default/topic/970010/cuda-programming-and-performance/cuda-peer-resources-error-when-running-on-more-than-8-k80s-aws-p2-16xlarge-/post/4994583/#4994583

The rules of `NCCL_SOCKET_IFNAME`

See: https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html

The default value is NCCL_SOCKET_IFNAME=^lo,docker.
Support two syntax: white list or black list
White list e.g.: NCCL_SOCKET_IFNAME=eth,em
- It's enough to specify the prefix only. You don't need to set it as eth0.
Blacklist e.g.: ^virbr,lo,docker.
If multiple network interfaces are found in your environment, the first is selected.
- You can check your environment by ifconfig for example. https://www.cyberciti.biz/faq/linux-list-network-interfaces-names-command/
- Note that lo is the first normally, so lo must be filtered.

My recommended setting for a non-virtual environment

NCCL_SOCKET_IFNAME=en,eth,em,bond
Or, NCCL_SOCKET_IFNAME=^lo,docker,virbr,vmnet,vboxnet,wl,ww,ppp

The prefix of network interface name	Note
lo	Loopback.
eth	Ethernet. Classically used.
em	Ethernet. Dell machine?
en	Ethernet (Used in recent Linux. e.g CentOS7)
wlan	Wireless
wl	Wireless LAN (Used in recent Linux)
ww	Wireless wan (Used in recent Linux)
ib	IP over IB
bond	Bonding of multiple ethernets
virbr	Virtual bridge
docker,vmnet,vboxnet	Virtual machine
ppp	Point to point