Please first check ESPnet1 tutorial
Please first check ESPnet2 tutorial
Multiple GPU TIPs¶
Note that if you want to use multiple GPUs, the installation of nccl is required before setup.
Currently, espnet1 only supports multiple GPU training within a single node. The distributed setup across multiple nodes is only supported in espnet2.
We don’t support multiple GPU inference. Instead, please split the recognition task for multiple jobs and distribute these split jobs to multiple GPUs.
If you cannot get enough speed improvement with multiple GPUs, you should first check the GPU usage by
nvidia-smi. If the GPU-Util percentage is low, the bottleneck will come from disk access. You can apply data prefetching by
--n-iter-processes 2in your
run.shto mitigate the problem. Note that this data prefetching consumes a lot of CPU memory, so please be careful when you increase the number of processes.
The behavior of batch size in ESPnet2 during multi-GPU training is different from that in ESPnet1. In ESPnet2, the total batch size is not changed regardless of the number of GPUs. Therefore, you need to manually increase the batch size if you increase the number of GPUs. Please refer to this doc for more information.
Start from the middle stage or stop at the specified stage¶
run.sh has multiple stages, including data preparation, training, etc., so you may likely want to start
from the specified stage if some stages failed for some reason, for example.
You can start from the specified stage as follows and stop the process at the specified stage:
# Start from 3rd stage and stop at 5th stage $ ./run.sh --stage 3 --stop-stage 5
CTC, attention, and hybrid CTC/attention¶
ESPnet can easily switch the model’s training/decoding mode from CTC, attention, and hybrid CTC/attention.
Each mode can be trained by specifying
# hybrid CTC/attention (default) mtlalpha: 0.3 # CTC mtlalpha: 1.0 # attention mtlalpha: 0.0
# hybrid CTC/attention (default) model_conf: ctc_weight: 0.3 # CTC model_conf: ctc_weight: 1.0 # attention model_conf: ctc_weight: 0.0
Decoding for each mode can be done using the following decoding configurations:
# hybrid CTC/attention (default) ctc-weight: 0.3 beam-size: 10 # CTC ctc-weight: 1.0 ## for best path decoding api: v1 # default setting (can be omitted) ## for prefix search decoding w/ beam search api: v2 beam-size: 10 # attention ctc-weight: 0.0 beam-size: 10 maxlenratio: 0.8 minlenratio: 0.3
# hybrid CTC/attention (default) ctc_weight: 0.3 beam_size: 10 # CTC ctc_weight: 1.0 beam_size: 10 # attention ctc_weight: 0.0 beam_size: 10 maxlenratio: 0.8 minlenratio: 0.3
The CTC mode does not compute the validation accuracy, and the optimum model is selected with its loss value, e.g.,
best_model_criterion: - - valid - cer_ctc - min
./run.sh --recog_model model.loss.best
The pure attention mode requires setting the maximum and minimum hypothesis length (
--minlenratio) appropriately. In general, if you have more insertion errors, you can decrease the
maxlenratiovalue, while if you have more deletion errors, you can increase the
minlenratiovalue. Note that the optimum values depend on the ratio of the input frame and output label lengths, which are changed for each language and each BPE unit.
maxlenratiocan be used to set the constant maximum hypothesis length independently from the number of input frames. If
maxlenratiois set to
-1, the decoding will always stop after the first output, which can be used to emulate the utterance classification tasks. This is suitable for some spoken language understanding and speaker identification tasks.
About the effectiveness of hybrid CTC/attention during training and recognition, see  and . For example, hybrid CTC/attention is not sensitive to the above maximum and minimum hypothesis heuristics.