About 2 min

Common usages

ESPnet1

ESPnet2

Multiple GPU TIPs

Note that if you want to use multiple GPUs, the installation of nccl is required before setup.
Currently, espnet1 only supports multiple GPU training within a single node. The distributed setup across multiple nodes is only supported in espnet2.
We don't support multiple GPU inference. Instead, please split the recognition task for multiple jobs and distribute these split jobs to multiple GPUs.
If you cannot get enough speed improvement with multiple GPUs, you should first check the GPU usage by nvidia-smi. If the GPU-Util percentage is low, the bottleneck will come from disk access. You can apply data prefetching by --n-iter-processes 2 in your run.sh to mitigate the problem. Note that this data prefetching consumes a lot of CPU memory, so please be careful when you increase the number of processes.
The behavior of batch size in ESPnet2 during multi-GPU training is different from that in ESPnet1. In ESPnet2, the total batch size is not changed regardless of the number of GPUs. Therefore, you need to manually increase the batch size if you increase the number of GPUs. Please refer to this doc for more information.

Start from the middle stage or stop at the specified stage

run.sh has multiple stages, including data preparation, training, etc., so you may likely want to start from the specified stage if some stages failed for some reason, for example.

You can start from the specified stage as follows and stop the process at the specified stage:

# Start from 3rd stage and stop at 5th stage
$ ./run.sh --stage 3 --stop-stage 5

CTC, attention, and hybrid CTC/attention

ESPnet can easily switch the model's training/decoding mode from CTC, attention, and hybrid CTC/attention.

Each mode can be trained by specifying mtlalpha (espnet1) ctc_weight (espnet2):

espnet1

# hybrid CTC/attention (default)
mtlalpha: 0.3

# CTC
mtlalpha: 1.0

# attention
mtlalpha: 0.0

espnet2

# hybrid CTC/attention (default)
model_conf:
    ctc_weight: 0.3

# CTC
model_conf:
    ctc_weight: 1.0

# attention
model_conf:
    ctc_weight: 0.0

Decoding for each mode can be done using the following decoding configurations:

espnet1

# hybrid CTC/attention (default)
ctc-weight: 0.3
beam-size: 10

# CTC
ctc-weight: 1.0
## for best path decoding
api: v1 # default setting (can be omitted)
## for prefix search decoding w/ beam search
api: v2
beam-size: 10

# attention
ctc-weight: 0.0
beam-size: 10
maxlenratio: 0.8
minlenratio: 0.3

espnet2

# hybrid CTC/attention (default)
ctc_weight: 0.3
beam_size: 10

# CTC
ctc_weight: 1.0
beam_size: 10

# attention
ctc_weight: 0.0
beam_size: 10
maxlenratio: 0.8
minlenratio: 0.3

The CTC mode does not compute the validation accuracy, and the optimum model is selected with its loss value, e.g.,
- espnet1
```
best_model_criterion:
-   - valid
    - cer_ctc
    - min
```
- espnet2
```
./run.sh --recog_model model.loss.best
```
The pure attention mode requires setting the maximum and minimum hypothesis length (--maxlenratio and --minlenratio) appropriately. In general, if you have more insertion errors, you can decrease the maxlenratio value, while if you have more deletion errors, you can increase the minlenratio value. Note that the optimum values depend on the ratio of the input frame and output label lengths, which are changed for each language and each BPE unit.
Negative maxlenratio can be used to set the constant maximum hypothesis length independently from the number of input frames. If maxlenratio is set to -1, the decoding will always stop after the first output, which can be used to emulate the utterance classification tasks. This is suitable for some spoken language understanding and speaker identification tasks.
About the effectiveness of hybrid CTC/attention during training and recognition, see [2] and [3]. For example, hybrid CTC/attention is not sensitive to the above maximum and minimum hypothesis heuristics.