Automatic Speech Recognition with Discrete Units
Automatic Speech Recognition with Discrete Units
This is a template of ASR2 recipe for ESPnet2. The difference from ASR1 is that discrete tokens are used as input instead of conventional audios / spectrum features.
Table of Contents
- Recipe flow
- 1. Data preparation
- 2. Speed perturbation
- 3. Wav format
- 4. Removal of long / short data
- 5. Generate discrete tokens
- 6. Generate dump raw folder
- 7. Input and Output token list generation
- 8. LM statistics collection
- 9. LM training
- 10. LM perplexity
- 11. Ngram-LM training
- 12. ASR statistics collection
- 13. ASR training
- 14. ASR inference
- 15. ASR scoring
- 16-18. (Optional) Pack results for upload
- How to run
- Related works
Recipe flow
ASR2 recipe consists of 15 stages.
1. Data preparation
Data preparation stage.
ESPnet format:
It calls local/data.sh
to creates Kaldi-style data directories in data/
for training, validation, and evaluation sets. It's the same as asr1
tasks.
See also:
2. Speed perturbation
Augment training data with speed perturbation. data/${train_set}_spXX
would be generated (XX
means the speed factor). This step is optional.
3. Wav format
Format the wave files in wav.scp
to a single format (wav / flac / kaldi_ark).
4. Removal of long / short data
Remove utterances by the following conditions
- Too short / long utterances.
- 0-length in target text.
5. Generate discrete tokens
The discrete tokens of the input speech signals are generated. For ASR2 task, the input is the discrete tokens (from self-supervised learning (SSL) features) and the target is the ASR transcriptions. After getting the discrete tokens (usually in integers), they will be converted to CJK characters, which are more convenient in tokenization.
Input / Target / Process of data preparation
- Stages:
- Generate SSL features for train / valid / test sets.
- Train the K-Means model on a subset from training data.
- Generate K-Means-based discrete tokens for train / valid / test sets.
- (Optional) Measure the discrete tokens quality if forced-alignment is accessible.
6. Generate dump raw folder
This stage move necessary files for training from dump/extracted
folder to dump/raw
folder.
7. Input and Output token list generation
Token list (BPE / Char / etc) generation for both input and targets.
8. LM statistics collection
Neural-network (NN) based Language model (LM) is optional for ASR task. You can skip stage 5-8 by setting --use_lm false
. Statistics calculation stage. It collects the shape information of LM texts and calculates statistics for LM training.
9. LM training
NN-based LM model training stage. You can change the training setting via --lm_config
and --lm_args
options.
See also:
10. LM perplexity
NN-based LM evaluation stage. Perplexity (PPL) is computed against the trained model
See also:
11. N-gram LM training
N-gram-based LM model training stage.
12. ASR statistics collection
Statistics calculation stage. It collects the shape information of input and output texts for ASR training.
13. ASR training
ASR model training stage. You can change the training setting via --asr_config
and --asr_args
options.
See also:
14. ASR inference
ASR inference stage.
15. ASR scoring
ASR scoring stage: error rates (char / word / token) are computed.
16. (Optional) Pack results for upload
Packing stage. It packs the trained model files to prepare for uploading to Hugging Face.
See also:
17: (Optional) Upload model
Upload the trained model to Hugging Face for sharing. Additional information at Docs.
How to run
LibriSpeech Training
Here, we show the procedure to run the recipe using egs2/librispeech/asr2
.
Move on the recipe directory.
$ cd egs2/librispeech/asr2
Modify LIBRISPEECH
variable in db.sh
if you want to change the download directory.
$ vim db.sh
Modify cmd.sh
and conf/*.conf
if you want to use job scheduler. See the detail in using job scheduling system.
$ vim cmd.sh
Run run.sh
, which conducts all of the stages explained above.
$ ./run.sh
Related works
@INPROCEEDINGS{9054224,
author={Baevski, Alexei and Mohamed, Abdelrahman},
booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Effectiveness of Self-Supervised Pre-Training for ASR},
year={2020},
volume={},
number={},
pages={7694-7698},
doi={10.1109/ICASSP40776.2020.9054224}}
@article{chang2023exploration,
title={Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning},
author={Chang, Xuankai and Yan, Brian and Fujita, Yuya and Maekaku, Takashi and Watanabe, Shinji},
journal={arXiv preprint arXiv:2305.18108},
year={2023}
}