Automatic Speech Recognition with Discrete Units

About 4 min

Automatic Speech Recognition with Discrete Units

This is a template of ASR2 recipe for ESPnet2. The difference from ASR1 is that discrete tokens are used as input instead of conventional audios / spectrum features.

Recipe flow
How to run
- LibriSpeech training
Related works

Recipe flow

ASR2 recipe consists of 15 stages.

Data preparation

Data preparation stage.

ESPnet format:

It calls local/data.sh to creates Kaldi-style data directories in data/ for training, validation, and evaluation sets. It's the same as asr1 tasks.

See also:

About Kaldi-style data directory

Speed perturbation

Augment training data with speed perturbation. data/${train_set}_spXX would be generated (XX means the speed factor). This step is optional.

Wav format

Format the wave files in wav.scp to a single format (wav / flac / kaldi_ark).

Removal of long / short data

Remove utterances by the following conditions

Too short / long utterances.
0-length in target text.

Generate discrete tokens

The discrete tokens of the input speech signals are generated. For ASR2 task, the input is the discrete tokens (from self-supervised learning (SSL) features) and the target is the ASR transcriptions. After getting the discrete tokens (usually in integers), they will be converted to CJK characters, which are more convenient in tokenization.

Input / Target / Process of data preparation

Stages:
1. Generate SSL features for train / valid / test sets.
2. Train the K-Means model on a subset from training data.
3. Generate K-Means-based discrete tokens for train / valid / test sets.
4. (Optional) Measure the discrete tokens quality if forced-alignment is accessible.

Generate dump raw folder

This stage move necessary files for training from dump/extracted folder to dump/raw folder.

Input and Output token list generation

Token list (BPE / Char / etc) generation for both input and targets.

LM statistics collection

Neural-network (NN) based Language model (LM) is optional for ASR task. You can skip stage 5-8 by setting --use_lm false. Statistics calculation stage. It collects the shape information of LM texts and calculates statistics for LM training.

LM training

NN-based LM model training stage. You can change the training setting via --lm_config and --lm_args options.

See also:

LM perplexity

NN-based LM evaluation stage. Perplexity (PPL) is computed against the trained model

See also:

Change the configuration for training

N-gram LM training

N-gram-based LM model training stage.

ASR statistics collection

Statistics calculation stage. It collects the shape information of input and output texts for ASR training.

ASR training

ASR model training stage. You can change the training setting via --asr_config and --asr_args options.

See also:

ASR inference

ASR inference stage.

ASR scoring

ASR scoring stage: error rates (char / word / token) are computed.

(Optional) Pack results for upload

Packing stage. It packs the trained model files to prepare for uploading to Hugging Face.

See also:

ESPnet Model Zoo

17: (Optional) Upload model

Upload the trained model to Hugging Face for sharing. Additional information at Docs.

How to run

LibriSpeech Training

Here, we show the procedure to run the recipe using egs2/librispeech/asr2.

Move on the recipe directory.

$ cd egs2/librispeech/asr2

Modify LIBRISPEECH variable in db.sh if you want to change the download directory.

$ vim db.sh

Modify cmd.sh and conf/*.conf if you want to use job scheduler. See the detail in using job scheduling system.

$ vim cmd.sh

Run run.sh, which conducts all of the stages explained above.

$ ./run.sh

@INPROCEEDINGS{9054224,
  author={Baevski, Alexei and Mohamed, Abdelrahman},
  booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Effectiveness of Self-Supervised Pre-Training for ASR},
  year={2020},
  volume={},
  number={},
  pages={7694-7698},
  doi={10.1109/ICASSP40776.2020.9054224}}

@article{chang2023exploration,
  title={Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning},
  author={Chang, Xuankai and Yan, Brian and Fujita, Yuya and Maekaku, Takashi and Watanabe, Shinji},
  journal={arXiv preprint arXiv:2305.18108},
  year={2023}
}

Automatic Speech Recognition with Discrete Units

Automatic Speech Recognition with Discrete Units

Table of Contents

Recipe flow

How to run

Related works