CMU 11751/18781 2021: ESPnet Tutorial
CMU 11751/18781 2021: ESPnet Tutorial
ESPnet is an end-to-end speech processing toolkit, initially focused on end-to-end speech recognition and end-to-end text-to-speech, but now extended to various other speech processing. ESPnet uses PyTorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.
This tutorial is based on the collection of espnet notebook demos https://github.com/espnet/notebook, espnet documentations in https://espnet.github.io/espnet/, and README.md in https://github.com/espnet/espnet
Author: Shinji Watanabe (@sw005320)
Useful links
- Installation https://espnet.github.io/espnet/installation.html
- Usage https://espnet.github.io/espnet/espnet2_tutorial.html
Run an inference example
ESPnet covers various speech applications and their pre-trained models.
Please check a model shown in espnet_model_zoo
We can play with a demo based on these pre-trained models.
What we only need is to install
espnet_model_zoo
Note that this
pip
based installation does not include training and so on. The full installation is explained later.You can also find similar demos in HuggingFace Hub https://huggingface.co/espnet
# It takes 1 minute.
!pip install -q espnet_model_zoo
Speech recognition demo
Author: Jiatong Shi (@ftshijt)
Model Selection
Please select the model shown in espnet_model_zoo.
They are stored in zenodo https://zenodo.org/communities/espnet or HuggingFace Hub https://huggingface.co/espnet
In this demonstration, we will show English, Japanese, Spanish, Mandrain, and Multilingual ASR models, respectively
#@title Choose English ASR model { run: "auto" }
lang = 'en'
fs = 16000 #@param {type:"integer"}
tag = 'Shinji Watanabe/spgispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000_valid.acc.ave' #@param ["Shinji Watanabe/spgispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000_valid.acc.ave", "kamo-naoyuki/librispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave"] {type:"string"}
#@title Choose Japanese ASR model { run: "auto" }
lang = 'ja'
fs = 16000 #@param {type:"integer"}
tag = 'Shinji Watanabe/laborotv_asr_train_asr_conformer2_latest33_raw_char_sp_valid.acc.ave' #@param ["Shinji Watanabe/laborotv_asr_train_asr_conformer2_latest33_raw_char_sp_valid.acc.ave"] {type:"string"}
#@title Choose Spanish ASR model { run: "auto" }
lang = 'es'
fs = 16000 #@param {type:"integer"}
tag = 'ftshijt/mls_asr_transformer_valid.acc.best' #@param ["ftshijt/mls_asr_transformer_valid.acc.best"] {type:"string"}
#@title Choose Mandrain ASR model { run: "auto" }
lang = 'zh'
fs = 16000 #@param {type:"integer"}
tag = 'Emiru Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave' #@param [" Emiru Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave"] {type:"string"}
#@title Choose Multilingual ASR model { run: "auto" }
lang = 'multilingual'
fs = 16000 #@param {type:"integer"}
tag = 'ftshijt/open_li52_asr_train_asr_raw_bpe7000_valid.acc.ave_10best' #@param [" ftshijt/open_li52_asr_train_asr_raw_bpe7000_valid.acc.ave_10best"] {type:"string"}
Model Setup
import time
import torch
import string
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_inference import Speech2Text
d = ModelDownloader()
# It may takes a while to download and build models
speech2text = Speech2Text(
**d.download_and_unpack(tag),
device="cuda",
minlenratio=0.0,
maxlenratio=0.0,
ctc_weight=0.3,
beam_size=10,
batch_size=0,
nbest=1
)
def text_normalizer(text):
text = text.upper()
return text.translate(str.maketrans('', '', string.punctuation))
Recognize our examples of pre-recorded samples
!git clone https://github.com/ftshijt/ESPNet_asr_egs.git
import pandas as pd
import soundfile
import librosa.display
from IPython.display import display, Audio
import matplotlib.pyplot as plt
egs = pd.read_csv("ESPNet_asr_egs/egs.csv")
for index, row in egs.iterrows():
if row["lang"] == lang or lang == "multilingual":
speech, rate = soundfile.read("ESPNet_asr_egs/" + row["path"])
assert fs == int(row["sr"])
nbests = speech2text(speech)
text, *_ = nbests[0]
print(f"Input Speech: ESPNet_asr_egs/{row['path']}")
# let us listen to samples
display(Audio(speech, rate=rate))
librosa.display.waveplot(speech, sr=rate)
plt.show()
print(f"Reference text: {text_normalizer(row['text'])}")
print(f"ASR hypothesis: {text_normalizer(text)}")
print("*" * 50)
Recognize your own live-recordings
- Record your own voice
- Recognize your voice with the ASR system
# from https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode
RECORD = """
const sleep = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
const reader = new FileReader()
reader.onloadend = e => resolve(e.srcElement.result)
reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
stream = await navigator.mediaDevices.getUserMedia({ audio: true })
recorder = new MediaRecorder(stream)
chunks = []
recorder.ondataavailable = e => chunks.push(e.data)
recorder.start()
await sleep(time)
recorder.onstop = async ()=>{
blob = new Blob(chunks)
text = await b2text(blob)
resolve(text)
}
recorder.stop()
})
"""
def record(sec, filename='audio.wav'):
display(Javascript(RECORD))
s = output.eval_js('record(%d)' % (sec * 1000))
b = b64decode(s.split(',')[1])
with open(filename, 'wb+') as f:
f.write(b)
audio = 'audio.wav'
second = 5
print(f"Speak to your microphone {second} sec...")
record(second, audio)
print("Done!")
import librosa
import librosa.display
speech, rate = librosa.load(audio, sr=16000)
librosa.display.waveplot(speech, sr=rate)
import matplotlib.pyplot as plt
plt.show()
import pysndfile
pysndfile.sndio.write('audio_ds.wav', speech, rate=rate, format='wav', enc='pcm16')
from IPython.display import display, Audio
display(Audio(speech, rate=rate))
nbests = speech2text(speech)
text, *_ = nbests[0]
print(f"ASR hypothesis: {text_normalizer(text)}")
Speech synthesis demo
This notebook provides a demonstration of the realtime E2E-TTS using ESPnet2-TTS and ParallelWaveGAN repo.
- ESPnet2-TTS: https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1
- ParallelWaveGAN: https://github.com/kan-bayashi/ParallelWaveGAN
Author: Tomoki Hayashi (@kan-bayashi)
Installation
# NOTE: pip shows imcompatible errors due to preinstalled libraries but you do not need to care
# It takes 1 minute
!pip install -q pyopenjtalk==0.1.5 parallel_wavegan==0.5.3
Model Selection
Please select model: English, Japanese, and Mandarin are supported.
You can try end-to-end text2wav model & combination of text2mel and vocoder.
If you use text2wav model, you do not need to use vocoder (automatically disabled).
Text2wav models:
- VITS
Text2mel models:
- Tacotron2
- Transformer-TTS
- (Conformer) FastSpeech
- (Conformer) FastSpeech2
Vocoders:
- Parallel WaveGAN
- Multi-band MelGAN
- HiFiGAN
- Style MelGAN.
The terms of use follow that of each corpus. We use the following corpora:
ljspeech_*
: LJSpeech dataset- https://keithito.com/LJ-Speech-Dataset/
jsut_*
: JSUT corpus- https://sites.google.com/site/shinnosuketakamichi/publication/jsut
jvs_*
: JVS corpus + JSUT corpus- https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus
- https://sites.google.com/site/shinnosuketakamichi/publication/jsut
tsukuyomi_*
: つくよみちゃんコーパス + JSUT corpus- https://tyc.rei-yumesaki.net/material/corpus/
- https://sites.google.com/site/shinnosuketakamichi/publication/jsut
csmsc_*
: Chinese Standard Mandarin Speech Corpus- https://www.data-baker.com/open_source.html
#@title Choose English model { run: "auto" }
lang = 'English'
tag = 'kan-bayashi/ljspeech_vits' #@param ["kan-bayashi/ljspeech_tacotron2", "kan-bayashi/ljspeech_fastspeech", "kan-bayashi/ljspeech_fastspeech2", "kan-bayashi/ljspeech_conformer_fastspeech2", "kan-bayashi/ljspeech_vits"] {type:"string"}
vocoder_tag = "none" #@param ["none", "parallel_wavegan/ljspeech_parallel_wavegan.v1", "parallel_wavegan/ljspeech_full_band_melgan.v2", "parallel_wavegan/ljspeech_multi_band_melgan.v2", "parallel_wavegan/ljspeech_hifigan.v1", "parallel_wavegan/ljspeech_style_melgan.v1"] {type:"string"}
#@title Choose Japanese model { run: "auto" }
lang = 'Japanese'
tag = 'kan-bayashi/jsut_full_band_vits_prosody' #@param ["kan-bayashi/jsut_tacotron2", "kan-bayashi/jsut_transformer", "kan-bayashi/jsut_fastspeech", "kan-bayashi/jsut_fastspeech2", "kan-bayashi/jsut_conformer_fastspeech2", "kan-bayashi/jsut_conformer_fastspeech2_accent", "kan-bayashi/jsut_conformer_fastspeech2_accent_with_pause", "kan-bayashi/jsut_vits_accent_with_pause", "kan-bayashi/jsut_full_band_vits_accent_with_pause", "kan-bayashi/jsut_tacotron2_prosody", "kan-bayashi/jsut_transformer_prosody", "kan-bayashi/jsut_conformer_fastspeech2_tacotron2_prosody", "kan-bayashi/jsut_vits_prosody", "kan-bayashi/jsut_full_band_vits_prosody", "kan-bayashi/jvs_jvs010_vits_prosody", "kan-bayashi/tsukuyomi_full_band_vits_prosody"] {type:"string"}
vocoder_tag = 'none' #@param ["none", "parallel_wavegan/jsut_parallel_wavegan.v1", "parallel_wavegan/jsut_multi_band_melgan.v2", "parallel_wavegan/jsut_style_melgan.v1", "parallel_wavegan/jsut_hifigan.v1"] {type:"string"}
#@title Choose Mandarin model { run: "auto" }
lang = 'Mandarin'
tag = 'kan-bayashi/csmsc_full_band_vits' #@param ["kan-bayashi/csmsc_tacotron2", "kan-bayashi/csmsc_transformer", "kan-bayashi/csmsc_fastspeech", "kan-bayashi/csmsc_fastspeech2", "kan-bayashi/csmsc_conformer_fastspeech2", "kan-bayashi/csmsc_vits", "kan-bayashi/csmsc_full_band_vits"] {type: "string"}
vocoder_tag = "none" #@param ["none", "parallel_wavegan/csmsc_parallel_wavegan.v1", "parallel_wavegan/csmsc_multi_band_melgan.v2", "parallel_wavegan/csmsc_hifigan.v1", "parallel_wavegan/csmsc_style_melgan.v1"] {type:"string"}
Model Setup
from espnet2.bin.tts_inference import Text2Speech
from espnet2.utils.types import str_or_none
text2speech = Text2Speech.from_pretrained(
model_tag=str_or_none(tag),
vocoder_tag=str_or_none(vocoder_tag),
device="cuda",
# Only for Tacotron 2 & Transformer
threshold=0.5,
# Only for Tacotron 2
minlenratio=0.0,
maxlenratio=10.0,
use_att_constraint=False,
backward_window=1,
forward_window=3,
# Only for FastSpeech & FastSpeech2 & VITS
speed_control_alpha=1.0,
# Only for VITS
noise_scale=0.667,
noise_scale_dur=0.8,
)
Synthesis
import time
import torch
# decide the input sentence by yourself
print(f"Input your favorite sentence in {lang}.")
x = input()
# synthesis
with torch.no_grad():
start = time.time()
wav = text2speech(x)["wav"]
rtf = (time.time() - start) / (len(wav) / text2speech.fs)
print(f"RTF = {rtf:5f}")
# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=text2speech.fs))
Speech enhancement demo
- ESPnet2-SE: https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/enh1
Author: Chenda Li (@LiChenda), Wangyou Zhang (@Emrys365)
Single-Channel Enhancement, the CHiME example
# Download one utterance from real noisy speech of CHiME4
!gdown --id 1SmrN5NFSg6JuQSs2sfy3ehD8OIcqK6wS -O /content/M05_440C0213_PED_REAL.wav
import os
import soundfile
from IPython.display import display, Audio
mixwav_mc, sr = soundfile.read("/content/M05_440C0213_PED_REAL.wav")
# mixwav.shape: num_samples, num_channels
mixwav_sc = mixwav_mc[:,4]
display(Audio(mixwav_mc.T, rate=sr))
Download and load the pretrained Conv-Tasnet
!gdown --id 17DMWdw84wF3fz3t7ia1zssdzhkpVQGZm -O /content/chime_tasnet_singlechannel.zip
!unzip /content/chime_tasnet_singlechannel.zip -d /content/enh_model_sc
# Load the model
# If you encounter error "No module named 'espnet2'", please re-run the 1st Cell. This might be a colab bug.
import sys
import soundfile
from espnet2.bin.enh_inference import SeparateSpeech
separate_speech = {}
# For models downloaded from GoogleDrive, you can use the following script:
enh_model_sc = SeparateSpeech(
train_config="/content/enh_model_sc/exp/enh_train_enh_conv_tasnet_raw/config.yaml",
model_file="/content/enh_model_sc/exp/enh_train_enh_conv_tasnet_raw/5epoch.pth",
# for segment-wise process on long speech
normalize_segment_scale=False,
show_progressbar=True,
ref_channel=4,
normalize_output_wav=True,
device="cuda:0",
)
Enhance the single-channel real noisy speech in CHiME4
# play the enhanced single-channel speech
wave = enh_model_sc(mixwav_sc[None, ...], sr)
print("Input real noisy speech", flush=True)
display(Audio(mixwav_sc, rate=sr))
print("Enhanced speech", flush=True)
display(Audio(wave[0].squeeze(), rate=sr))
Speech Separation
Model Selection
Please select model shown in espnet_model_zoo
In this demonstration, we will show different speech separation models on wsj0_2mix.
#@title Choose Speech Separation model { run: "auto" }
fs = 8000 #@param {type:"integer"}
tag = "Chenda Li/wsj0_2mix_enh_train_enh_conv_tasnet_raw_valid.si_snr.ave" #@param ["Chenda Li/wsj0_2mix_enh_train_enh_conv_tasnet_raw_valid.si_snr.ave", "Chenda Li/wsj0_2mix_enh_train_enh_rnn_tf_raw_valid.si_snr.ave", "https://zenodo.org/record/4688000/files/enh_train_enh_dprnn_tasnet_raw_valid.si_snr.ave.zip"]
# For models uploaded to Zenodo, you can use the following python script instead:
import sys
import soundfile
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.enh_inference import SeparateSpeech
d = ModelDownloader()
cfg = d.download_and_unpack(tag)
separate_speech = SeparateSpeech(
train_config=cfg["train_config"],
model_file=cfg["model_file"],
# for segment-wise process on long speech
segment_size=2.4,
hop_size=0.8,
normalize_segment_scale=False,
show_progressbar=True,
ref_channel=None,
normalize_output_wav=True,
device="cuda:0",
)
Separate the example in wsj0_2mix testing set
!gdown --id 1ZCUkd_Lb7pO2rpPr4FqYdtJBZ7JMiInx -O /content/447c020t_1.2106_422a0112_-1.2106.wav
import os
import soundfile
from IPython.display import display, Audio
mixwav, sr = soundfile.read("/content/447c020t_1.2106_422a0112_-1.2106.wav")
waves_wsj = separate_speech(mixwav[None, ...], fs=sr)
print("Input mixture", flush=True)
display(Audio(mixwav, rate=sr))
print(f"========= Separated speech with model {tag} =========", flush=True)
print("Separated spk1", flush=True)
display(Audio(waves_wsj[0].squeeze(), rate=sr))
print("Separated spk2", flush=True)
display(Audio(waves_wsj[1].squeeze(), rate=sr))
Full installation
This is a full installation method to perform data preprocess, training, inference, scoring, and so on. for various experiments.
We prepare various ways of installations. We also prepare a docker image as well.
See https://espnet.github.io/espnet/installation.html#step-2-installation-espnet for more details.
Installation of required tools
See https://espnet.github.io/espnet/installation.html#requirements for more details.
# It takes ~10 seconds
!sudo apt-get install cmake sox libsndfile1-dev
Download espnet
# It takes a few seconds
!git clone --depth 5 https://github.com/espnet/espnet
Setup Python environment based on anaconda
There are several other installation methods, but we highly recommend the anaconda-based one.
# It takes 30 seconds
%cd /content/espnet/tools
!./setup_anaconda.sh anaconda espnet 3.8
Install espnet
This includes the installation of PyTorch and other tools.
We just specify CUDA_VERSION=10.2 for the latest PyTorch (1.9.0)
# It may take ~8 minutes
%cd /content/espnet/tools
!make CUDA_VERSION=10.2
Install other speech processing tools
We install NIST SCTK toolkit for scoring
Please manually install other tools if needed.
%cd /content/espnet/tools
!./installers/install_sctk.sh
Check installation
Please check whether torch, torch cuda, and espnet are correctly installed.
If torch, torch cuda, and espnet are successfully installed, it would be no problem.
[x] torch=1.9.0
[x] torch cuda=10.2
:
[x] espnet=0.10.3a3
%cd /content/espnet/tools
!. ./activate_python.sh; python3 check_install.py
Run a recipe example
ESPnet has a number of recipes (73 recipes on Sep. 16, 2021). Let's first check https://github.com/espnet/espnet/blob/master/egs2/README.md
Please also check the general usage of the recipe in https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
CMU AN4 recipe
In this tutorial, we use the CMU an4 recipe. This is a small-scale speech recognition task mainly used for testing.
First, move to the recipe directory
%cd /content/espnet/egs2/an4/asr1
!ls
egs2/an4/asr1/
- conf/ # Configuration files for training, inference, etc.
- scripts/ # Bash utilities of espnet2
- pyscripts/ # Python utilities of espnet2
- steps/ # From Kaldi utilities
- utils/ # From Kaldi utilities
- db.sh # The directory path of each corpora
- path.sh # Setup script for environment variables
- cmd.sh # Configuration for your backend of job scheduler
- run.sh # Entry point
- asr.sh # Invoked by run.sh
ESPnet is designed for various use cases (local machines or cluster machines) based on Kaldi tools. If you use it in the cluster machines, please also check https://kaldi-asr.org/doc/queue.html
The main stages can be parallelized by various jobs.
!cat run.sh
run.sh
can call asr.sh
, which completes the entire speech recognition experiments, including data preparation, training, inference, and scoring. They are based on separate stages (totally 15 stages).
Instead of executing the entire experiments by run.sh
, the following example executes the experiment for each stage to understand the process in each stage.
data preparation
Stage 1: Data preparation for training, validation, and evaluation data
Note that --stage <N>
is to start the stage and --stop_stage <N>
is to stop the stage. We also need to specify training, validation, and test data.
# 30 seconds
!./asr.sh --stage 1 --stop_stage 1 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"
After this stage is finished, please check the data
directory
!ls data
In this recipe, we use train_nodev
as a training set, train_dev
as a validation set (monitor the training progress by checking the validation score). We also use (reuse) test
and train_dev
sets for the final speech recognition evaluation.
Let's check one of the training data directory:
!ls -1 data/train_nodev/
These are the speech and corresponding text and speaker information based on the Kaldi format. Please also check https://kaldi-asr.org/doc/data_prep.html
spk2utt # Speaker information
text # Transcription file
utt2spk # Speaker information
wav.scp # Audio file
Stage 2: Speed perturbation (one of the data augmentation methods)
We do not use speed perturbation for this demo. But you can turn it on by adding an argument --speed_perturb_factors "0.9 1.0 1.1"
to the shell script
!./asr.sh --stage 2 --stop_stage 2 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"
Stage 3: Format wav.scp: data/ -> dump/raw
We dump the data with specified format (flac in this case) for the efficient use of the data.
Note that --nj <N>
means the number of CPU jobs. Please set it appropriately by considering your CPU resources and disk access.
# 30 seconds
!./asr.sh --stage 3 --stop_stage 3 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --nj 4
Stage 4: Remove long/short data: dump/raw/org -> dump/raw
There are too long and too short audio data, which are harmful for our efficient training. Those data are removed from the list.
!./asr.sh --stage 4 --stop_stage 4 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"
Stage 5: Generate token_list from dump/raw/train_nodev/text using BPE.
This is important for text processing. We make a dictionary based on the English character in this example. We use a sentencepiece
toolkit developed by Google.
!./asr.sh --stage 5 --stop_stage 5 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"
Let's check the content of the dictionary. There are several special symbols, e.g.,
<blank> used for CTC
<unk> unknown symbols do not appear in the training data
<sos/eos> start and end sentence symbols
!cat data/token_list/bpe_unigram30/tokens.txt
language modeling (skip in this tutorial)
Stages 6--9: Stages related to language modeling.
We skip the language modeling part in the recipe (stages 6 -- 9) in this tutorial.
End-to-end ASR
Stage 10: ASR collect stats: train_set=dump/raw/train_nodev, valid_set=dump/raw/train_dev
We estimate the mean and variance of the data to normalize the data. We also collect the information of input and output lengths for the efficient mini batch creation.
# 18 seconds
!./asr.sh --stage 10 --stop_stage 10 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --nj 4
Stage 11: ASR Training: train_set=dump/raw/train_nodev, valid_set=dump/raw/train_dev
Main training loop.
Please also monitor the following files
- log file /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/train.log
- loss /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/images/loss.png
- accuracy /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/images/acc.png
# It would take 20-30 min.
!./asr.sh --stage 11 --stop_stage 11 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --ngpu 1
Stage 12: Decoding: training_dir=exp/asr_train_raw_bpe30
Note that we need to make --use_lm false
since we skip the language model.
inference_nj <N>
specifies the number of inference jobs
Let's monitor the log /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/inference_asr_model_valid.acc.ave/train_dev/logdir/asr_inference.1.log
# It would take ~10 minutes
!./asr.sh --inference_nj 4 --stage 12 --stop_stage 12 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --use_lm false
Stage 13: Scoring
You can find word error rate (WER), character error rate (CER), etc. for each test set.
!./asr.sh --stage 13 --stop_stage 13 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --use_lm false
You can also check the break down of the word error rate in /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/inference_asr_model_valid.acc.ave/train_dev/score_wer/result.txt
How to change the training configs?
config file based
All training options are changed by using a config file.
Pleae check https://espnet.github.io/espnet/espnet2_training_option.html
Let's first check config files prepared in the an4
recipe
- LSTM-based E2E ASR /content/espnet/egs2/an4/asr1/conf/train_asr_rnn.yaml
- Transformer based E2E ASR /content/espnet/egs2/an4/asr1/conf/train_asr_transformer.yaml
You can run
RNN
./asr.sh --stage 10 \
--train_set train_nodev \
--valid_set train_dev \
--test_sets "train_dev test" \
--nj 4 \
--inference_nj 4 \
--use_lm false \
----asr_config conf/train_asr_rnn.yaml
Transformer
./asr.sh --stage 10 \
--train_set train_nodev \
--valid_set train_dev \
--test_sets "train_dev test" \
--nj 4 \
--inference_nj 4 \
--use_lm false \
----asr_config conf/train_asr_transformer.yaml
You can also find various configs in espnet/egs2/*/asr1/conf/
, including
- Conformer
espnet/egs2/librispeech/asr1/conf/train_asr_confformer.yaml
- Wav2vec2.0 pre-trained model and fine-tuning
https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/conf/tuning/train_asr_conformer7_wav2vec2_960hr_large.yaml
- HuBERT pre-trained model and fine-tuning
https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/conf/tuning/train_asr_conformer7_hubert_960hr_large.yaml
command line argument based
You can also customize it by editing the file or passing the command line arguments, e.g.,
./run.sh --stage 10 --asr_args "--model_conf ctc_weight=0.3"
./run.sh --stage 10 --asr_args "--optim_conf lr=0.1"
See https://espnet.github.io/espnet/espnet2_tutorial.html#change-the-configuration-for-training
How to make a new recipe?
- Check https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE