Automatic Speech Recognition (Multi-tasking)

About 5 min

Automatic Speech Recognition (Multi-tasking)

This is a template of ASR1 Multi-tasking recipe for ESPnet2. This README provides comprehensive instructions on how to enhance ASR1 for prompt-based multi-task learning.

Recipe flow
How to run
- SLU Multi-task training
Related works

Recipe flow

ASR1 recipe consists of 13 stages.

Data preparation

Data preparation stage.

ESPnet format:

It calls local/data.sh to creates Kaldi-style data directories in data/ for training, validation, and evaluation sets. In addition to the files in the asr1 recipe, it generates an additional file called prompt that specifies the task to be performed for the given utterance..

prompt format

uttidA &lt;prompt&gt;
uttidB &lt;prompt&gt;
...

See also:

About Kaldi-style data directory

Speed perturbation

Augment training data with speed perturbation. data/${train_set}_spXX would be generated (XX means the speed factor). This step is optional.

Generate dump folder

Dumping stage. This stage move necessary files for training from data folder to dump folder.

Removal of long / short data

This stage is the same as that in ASR recipes. At this stage, the dump directories for all datasets on which multi-tasking is to be performed are merged by simple concatenation.

Input / Output Token list generation

Token list (BPE / Char / etc) generation for both input and targets. Additionally, for Whisper tokenization, you have the option to incorporate special tokens into the Whisper vocabulary using the --nlsyms_txt flag. If you are utilizing task specifiers for prompt-based multi-tasking, similar to the original Whisper formulation, it is necessary to include these task specifiers in the Whisper vocabulary.

LM statistics collection

Neural-network (NN) based Language model (LM) is optional for ASR task. You can skip stage 5-8 by setting --use_lm false. Statistics calculation stage. It collects the shape information of LM texts and calculates statistics for LM training.

LM training

NN-based LM model training stage. You can change the training setting via --lm_config and --lm_args options.

See also:

LM perplexity

NN-based LM evaluation stage. Perplexity (PPL) is computed against the trained model

See also:

Change the configuration for training

N-gram LM training

N-gram-based LM model training stage.

ASR statistics collection

Statistics calculation stage. It collects the shape information of input and output texts for ASR training.

Prompt based multi-tasking

Instructions:
1. To enable prompt-based multi-task learning across multiple tasks in English, ensure that --use_prompt is set to True. By default, this setting replaces the task specifier in the Whisper formulation with the one specified in the prompt file to perform multi-task learning across multiple tasks in English. Please refer to stage 5 for instructions on adding task specifiers to the Whisper vocabulary.
2. If you want to perform prompt-based multi-task learning across multiple tasks in multiple languages, additionally, set --use_lang_prompt to true. This step replaces both the language and task specifiers in the Whisper formulation with those specified in the prompt file and can also introduce a new dataset specifier. Please ensure that task, dataset, and language specifiers are all included in the Whisper vocabulary for this option to work.
3. (Optional) To use natural language phrases for prompt-based multi-tasking, set --use_nlp_prompt to true. In this case, you do not need to make any modifications to the Whisper vocabulary.

ASR training

ASR model training stage. You can change the training setting via --asr_config and --asr_args options. You need to follow similar steps as described in stage 10 to perform prompt based multi-task learning.

See also:

ASR inference

ASR inference stage.

Prompt based multi-tasking

Instructions:
1. If you have incorporated any special tokens into the Whisper vocabulary, make sure to specify the file containing these special tokens as prompt_token_file in decoder config.
2. If you are utilizing task, language, and dataset specifiers, please specify these specifiers as lang_prompt_token in decoder config.
3. If you are employing a natural language phrase as a prompt, specify the phrase as nlp_prompt_token in decoder config.
4. To perform language identification and voice activity detection, we follow the Whisper's pre-training setupwhere we predict language id and no speech tags immediately after the start of the transcript tag. Hence for these tasks, set lid prompt to true.

ASR scoring

ASR scoring stage: error rates (char / word / token) are computed.

(Optional) Pack results for upload

Packing stage. It packs the trained model files to prepare for uploading to Hugging Face.

See also:

ESPnet Model Zoo

15: (Optional) Upload model

Upload the trained model to Hugging Face for sharing. Additional information at Docs.

How to run

SLU-Multi-task-training

Here, we show the procedure to run multi-tasking learning across 14 speech classification tasks.

Create a dump directory using the following recipes: . asvspoof, speechcommands, grabo, lt_speech_commands, arabic_sc, fsc, voxforge/lid1, iemocap, accentdb, mustard, mustard_plus_plus, voxceleb1, freesound and esc50. You can do this by running the following command in each of these recipes:

$ ./run.sh --stop_stage 4

Note: Download all the dataset zip files first before creating dump directory. Please refer to https://github.com/ga642381/SpeechPrompt-v2/blob/main/docs/dataset.md to download all datasets.

Move to the egs2/uslu14/asr1 recipe directory. Generate the prompt file by running

$ python local/create_*_prompt.py

Concatenate wav.scp, prompt, text, utt2spk, spk2utt, utt2num_samples from all train and valid dump folders in each of the dump directories and create two new directories, dump/raw/train_combined and dump/raw/valid to contain the combined data. Start training using:

$ ./run.sh --stage 5 --stop_stage 11

Run decoding for each of the datasets, i.e., test_<dataset>, with the specified inference_config, e.g., conf/decode_asr_<task>.yaml, using the following command:

$ ./run.sh --stage 12 --stop_stage 12  --inference_config conf/decode_asr_&lt;task&gt;.yaml --test_sets test_&lt;dataset&gt;

For some tasks, you need to clean prediction files using python local/clean_emotion_pred.py, python local/check_lid_results.py, python local/check_vad_results.py. To get accuracy, run

$ ./run.sh --stage 13 --stop_stage 13  --inference_config conf/decode_asr_&lt;task&gt;.yaml --test_sets test_&lt;dataset&gt;

For tasks where you need to compute f1 or weighted_f1, run python local/compute_f1.py and python local/compute_weighted_f1.py.


@misc{arora2023universlu,
      title={UniverSLU: Universal Spoken Language Understanding for Diverse Classification and Sequence Generation Tasks with a Single Network},
      author={Siddhant Arora and Hayato Futami and Jee-weon Jung and Yifan Peng and Roshan Sharma and Yosuke Kashiwagi and Emiru Tsunoo and Shinji Watanabe},
      year={2023},
      eprint={2310.02973},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{pmlr-v202-radford23a,
  title = 	 {Robust Speech Recognition via Large-Scale Weak Supervision},
  author =       {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and Mcleavey, Christine and Sutskever, Ilya},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {28492--28518},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
}

Automatic Speech Recognition (Multi-tasking)

Automatic Speech Recognition (Multi-tasking)

Table of Contents

Recipe flow

How to run

Related works