CMU 11751/18781 2021: ESPnet Tutorial

ESPnet is an end-to-end speech processing toolkit, initially focused on end-to-end speech recognition and end-to-end text-to-speech, but now extended to various other speech processing. ESPnet uses PyTorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.

This tutorial is based on the collection of espnet notebook demos, espnet documentations in, and in

Author: Shinji Watanabe ([@sw005320](

Run an inference example

  • ESPnet covers various speech applications and their pre-trained models.

  • Please check a model shown in espnet_model_zoo

  • We can play with a demo based on these pre-trained models.

  • What we only need is to install espnet_model_zoo

  • Note that this pip based installation does not include training and so on. The full installation is explained later.

  • You can also find similar demos in HuggingFace Hub

[ ]:
# It takes 1 minute.
!pip install -q espnet_model_zoo

Speech recognition demo

Author: Jiatong Shi ([@ftshijt](

Model Selection

[ ]:
#@title Choose English ASR model { run: "auto" }

lang = 'en'
fs = 16000 #@param {type:"integer"}
tag = 'Shinji Watanabe/spgispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000_valid.acc.ave' #@param ["Shinji Watanabe/spgispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000_valid.acc.ave", "kamo-naoyuki/librispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave"] {type:"string"}
[ ]:
#@title Choose Japanese ASR model { run: "auto" }

lang = 'ja'
fs = 16000 #@param {type:"integer"}
tag = 'Shinji Watanabe/laborotv_asr_train_asr_conformer2_latest33_raw_char_sp_valid.acc.ave' #@param ["Shinji Watanabe/laborotv_asr_train_asr_conformer2_latest33_raw_char_sp_valid.acc.ave"] {type:"string"}
[ ]:
#@title Choose Spanish ASR model { run: "auto" }

lang = 'es'
fs = 16000 #@param {type:"integer"}
tag = 'ftshijt/' #@param ["ftshijt/"] {type:"string"}
[ ]:
#@title Choose Mandrain ASR model { run: "auto" }

lang = 'zh'
fs = 16000 #@param {type:"integer"}
tag = 'Emiru Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave' #@param ["    Emiru Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave"] {type:"string"}
[ ]:
#@title Choose Multilingual ASR model { run: "auto" }

lang = 'multilingual'
fs = 16000 #@param {type:"integer"}
tag = 'ftshijt/open_li52_asr_train_asr_raw_bpe7000_valid.acc.ave_10best' #@param [" ftshijt/open_li52_asr_train_asr_raw_bpe7000_valid.acc.ave_10best"] {type:"string"}

Model Setup

[ ]:
import time
import torch
import string
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_inference import Speech2Text

d = ModelDownloader()
# It may takes a while to download and build models
speech2text = Speech2Text(

def text_normalizer(text):
    text = text.upper()
    return text.translate(str.maketrans('', '', string.punctuation))

Recognize our examples of pre-recorded samples

[ ]:
!git clone

import pandas as pd
import soundfile
import librosa.display
from IPython.display import display, Audio
import matplotlib.pyplot as plt

egs = pd.read_csv("ESPNet_asr_egs/egs.csv")
for index, row in egs.iterrows():
  if row["lang"] == lang or lang == "multilingual":
    speech, rate ="ESPNet_asr_egs/" + row["path"])
    assert fs == int(row["sr"])
    nbests = speech2text(speech)

    text, *_ = nbests[0]
    print(f"Input Speech: ESPNet_asr_egs/{row['path']}")
    # let us listen to samples
    display(Audio(speech, rate=rate))
    librosa.display.waveplot(speech, sr=rate)
    print(f"Reference text: {text_normalizer(row['text'])}")
    print(f"ASR hypothesis: {text_normalizer(text)}")
    print("*" * 50)

Recognize your own live-recordings

  1. Record your own voice

  2. Recognize your voice with the ASR system

[ ]:
# from

from IPython.display import Javascript
from google.colab import output
from base64 import b64decode

RECORD = """
const sleep = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)

def record(sec, filename='audio.wav'):
  s = output.eval_js('record(%d)' % (sec * 1000))
  b = b64decode(s.split(',')[1])
  with open(filename, 'wb+') as f:

audio = 'audio.wav'
second = 5
print(f"Speak to your microphone {second} sec...")
record(second, audio)

import librosa
import librosa.display
speech, rate = librosa.load(audio, sr=16000)
librosa.display.waveplot(speech, sr=rate)

import matplotlib.pyplot as plt

import pysndfile
pysndfile.sndio.write('audio_ds.wav', speech, rate=rate, format='wav', enc='pcm16')

from IPython.display import display, Audio
display(Audio(speech, rate=rate))
[ ]:
nbests = speech2text(speech)
text, *_ = nbests[0]

print(f"ASR hypothesis: {text_normalizer(text)}")

Speech synthesis demo

This notebook provides a demonstration of the realtime E2E-TTS using ESPnet2-TTS and ParallelWaveGAN repo.

Author: Tomoki Hayashi ([@kan-bayashi](


[ ]:
# NOTE: pip shows imcompatible errors due to preinstalled libraries but you do not need to care
# It takes 1 minute
!pip install -q pyopenjtalk==0.1.5 parallel_wavegan==0.5.3

Model Selection

Please select model: English, Japanese, and Mandarin are supported.

You can try end-to-end text2wav model & combination of text2mel and vocoder.
If you use text2wav model, you do not need to use vocoder (automatically disabled).

Text2wav models: - VITS

Text2mel models: - Tacotron2 - Transformer-TTS - (Conformer) FastSpeech - (Conformer) FastSpeech2

Vocoders: - Parallel WaveGAN - Multi-band MelGAN - HiFiGAN - Style MelGAN.

The terms of use follow that of each corpus. We use the following corpora: - ljspeech_*: LJSpeech dataset - - jsut_*: JSUT corpus - - jvs_*: JVS corpus + JSUT corpus - - - tsukuyomi_*: つくよみちゃんコーパス + JSUT corpus - - - csmsc_*: Chinese Standard Mandarin Speech Corpus -

[ ]:
#@title Choose English model { run: "auto" }
lang = 'English'
tag = 'kan-bayashi/ljspeech_vits' #@param ["kan-bayashi/ljspeech_tacotron2", "kan-bayashi/ljspeech_fastspeech", "kan-bayashi/ljspeech_fastspeech2", "kan-bayashi/ljspeech_conformer_fastspeech2", "kan-bayashi/ljspeech_vits"] {type:"string"}
vocoder_tag = "none" #@param ["none", "parallel_wavegan/ljspeech_parallel_wavegan.v1", "parallel_wavegan/ljspeech_full_band_melgan.v2", "parallel_wavegan/ljspeech_multi_band_melgan.v2", "parallel_wavegan/ljspeech_hifigan.v1", "parallel_wavegan/ljspeech_style_melgan.v1"] {type:"string"}
[ ]:
#@title Choose Japanese model { run: "auto" }
lang = 'Japanese'
tag = 'kan-bayashi/jsut_full_band_vits_prosody' #@param ["kan-bayashi/jsut_tacotron2", "kan-bayashi/jsut_transformer", "kan-bayashi/jsut_fastspeech", "kan-bayashi/jsut_fastspeech2", "kan-bayashi/jsut_conformer_fastspeech2", "kan-bayashi/jsut_conformer_fastspeech2_accent", "kan-bayashi/jsut_conformer_fastspeech2_accent_with_pause", "kan-bayashi/jsut_vits_accent_with_pause", "kan-bayashi/jsut_full_band_vits_accent_with_pause", "kan-bayashi/jsut_tacotron2_prosody", "kan-bayashi/jsut_transformer_prosody", "kan-bayashi/jsut_conformer_fastspeech2_tacotron2_prosody", "kan-bayashi/jsut_vits_prosody", "kan-bayashi/jsut_full_band_vits_prosody", "kan-bayashi/jvs_jvs010_vits_prosody", "kan-bayashi/tsukuyomi_full_band_vits_prosody"] {type:"string"}
vocoder_tag = 'none' #@param ["none", "parallel_wavegan/jsut_parallel_wavegan.v1", "parallel_wavegan/jsut_multi_band_melgan.v2", "parallel_wavegan/jsut_style_melgan.v1", "parallel_wavegan/jsut_hifigan.v1"] {type:"string"}
[ ]:
#@title Choose Mandarin model { run: "auto" }
lang = 'Mandarin'
tag = 'kan-bayashi/csmsc_full_band_vits' #@param ["kan-bayashi/csmsc_tacotron2", "kan-bayashi/csmsc_transformer", "kan-bayashi/csmsc_fastspeech", "kan-bayashi/csmsc_fastspeech2", "kan-bayashi/csmsc_conformer_fastspeech2", "kan-bayashi/csmsc_vits", "kan-bayashi/csmsc_full_band_vits"] {type: "string"}
vocoder_tag = "none" #@param ["none", "parallel_wavegan/csmsc_parallel_wavegan.v1", "parallel_wavegan/csmsc_multi_band_melgan.v2", "parallel_wavegan/csmsc_hifigan.v1", "parallel_wavegan/csmsc_style_melgan.v1"] {type:"string"}

Model Setup

[ ]:
from espnet2.bin.tts_inference import Text2Speech
from espnet2.utils.types import str_or_none

text2speech = Text2Speech.from_pretrained(
    # Only for Tacotron 2 & Transformer
    # Only for Tacotron 2
    # Only for FastSpeech & FastSpeech2 & VITS
    # Only for VITS


[ ]:
import time
import torch

# decide the input sentence by yourself
print(f"Input your favorite sentence in {lang}.")
x = input()

# synthesis
with torch.no_grad():
    start = time.time()
    wav = text2speech(x)["wav"]
rtf = (time.time() - start) / (len(wav) / text2speech.fs)
print(f"RTF = {rtf:5f}")

# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=text2speech.fs))

Speech enhancement demo

Author: Chenda Li ([@LiChenda](, Wangyou Zhang ([@Emrys365](

Single-Channel Enhancement, the CHiME example

[ ]:
# Download one utterance from real noisy speech of CHiME4
!gdown --id 1SmrN5NFSg6JuQSs2sfy3ehD8OIcqK6wS -O /content/M05_440C0213_PED_REAL.wav
import os

import soundfile
from IPython.display import display, Audio
mixwav_mc, sr ="/content/M05_440C0213_PED_REAL.wav")
# mixwav.shape: num_samples, num_channels
mixwav_sc = mixwav_mc[:,4]
display(Audio(mixwav_mc.T, rate=sr))

Download and load the pretrained Conv-Tasnet

[ ]:
!gdown --id 17DMWdw84wF3fz3t7ia1zssdzhkpVQGZm -O /content/
!unzip /content/ -d /content/enh_model_sc
[ ]:
# Load the model
# If you encounter error "No module named 'espnet2'", please re-run the 1st Cell. This might be a colab bug.
import sys
import soundfile
from espnet2.bin.enh_inference import SeparateSpeech

separate_speech = {}
# For models downloaded from GoogleDrive, you can use the following script:
enh_model_sc = SeparateSpeech(
  # for segment-wise process on long speech

Enhance the single-channel real noisy speech in CHiME4

[ ]:
# play the enhanced single-channel speech
wave = enh_model_sc(mixwav_sc[None, ...], sr)
print("Input real noisy speech", flush=True)
display(Audio(mixwav_sc, rate=sr))
print("Enhanced speech", flush=True)
display(Audio(wave[0].squeeze(), rate=sr))

Speech Separation

Model Selection

Please select model shown in espnet_model_zoo

In this demonstration, we will show different speech separation models on wsj0_2mix.

[ ]:
#@title Choose Speech Separation model { run: "auto" }

fs = 8000 #@param {type:"integer"}
tag = "Chenda Li/wsj0_2mix_enh_train_enh_conv_tasnet_raw_valid.si_snr.ave" #@param ["Chenda Li/wsj0_2mix_enh_train_enh_conv_tasnet_raw_valid.si_snr.ave", "Chenda Li/wsj0_2mix_enh_train_enh_rnn_tf_raw_valid.si_snr.ave", ""]
[ ]:
# For models uploaded to Zenodo, you can use the following python script instead:
import sys
import soundfile
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.enh_inference import SeparateSpeech

d = ModelDownloader()

cfg = d.download_and_unpack(tag)
separate_speech = SeparateSpeech(
  # for segment-wise process on long speech

Separate the example in wsj0_2mix testing set

[ ]:
!gdown --id 1ZCUkd_Lb7pO2rpPr4FqYdtJBZ7JMiInx -O /content/447c020t_1.2106_422a0112_-1.2106.wav

import os
import soundfile
from IPython.display import display, Audio

mixwav, sr ="/content/447c020t_1.2106_422a0112_-1.2106.wav")
waves_wsj = separate_speech(mixwav[None, ...], fs=sr)

print("Input mixture", flush=True)
display(Audio(mixwav, rate=sr))
print(f"========= Separated speech with model {tag} =========", flush=True)
print("Separated spk1", flush=True)
display(Audio(waves_wsj[0].squeeze(), rate=sr))
print("Separated spk2", flush=True)
display(Audio(waves_wsj[1].squeeze(), rate=sr))

Full installation

Installation of required tools

See for more details.

[ ]:
# It takes ~10 seconds
!sudo apt-get install cmake sox libsndfile1-dev

Download espnet

[ ]:
# It takes a few seconds
!git clone --depth 5

Setup Python environment based on anaconda

There are several other installation methods, but we highly recommend the anaconda-based one.

[ ]:
# It takes 30 seconds
%cd /content/espnet/tools
!./ anaconda espnet 3.8

Install espnet

This includes the installation of PyTorch and other tools.

We just specify CUDA_VERSION=10.2 for the latest PyTorch (1.9.0)

[ ]:
# It may take ~8 minutes
%cd /content/espnet/tools
!make CUDA_VERSION=10.2

Install other speech processing tools

We install NIST SCTK toolkit for scoring

Please manually install other tools if needed.

[ ]:
%cd /content/espnet/tools

Check installation

Please check whether torch, torch cuda, and espnet are correctly installed.

If torch, torch cuda, and espnet are successfully installed, it would be no problem.

[x] torch=1.9.0
[x] torch cuda=10.2
[x] espnet=0.10.3a3
[ ]:
%cd /content/espnet/tools
!. ./; python3

Run a recipe example

ESPnet has a number of recipes (73 recipes on Sep. 16, 2021). Let’s first check

Please also check the general usage of the recipe in

CMU AN4 recipe

In this tutorial, we use the CMU an4 recipe. This is a small-scale speech recognition task mainly used for testing.

First, move to the recipe directory

[ ]:
%cd /content/espnet/egs2/an4/asr1

egs2/an4/asr1/  - conf/      # Configuration files for training, inference, etc.  - scripts/   # Bash utilities of espnet2  - pyscripts/ # Python utilities of espnet2  - steps/     # From Kaldi utilities  - utils/     # From Kaldi utilities  -      # The directory path of each corpora  -    # Setup script for environment variables  -     # Configuration for your backend of job scheduler  -     # Entry point  -     # Invoked by

ESPnet is designed for various use cases (local machines or cluster machines) based on Kaldi tools. If you use it in the cluster machines, please also check

The main stages can be parallelized by various jobs.

[ ]:
!cat can call, which completes the entire speech recognition experiments, including data preparation, training, inference, and scoring. They are based on separate stages (totally 15 stages).

Instead of executing the entire experiments by, the following example executes the experiment for each stage to understand the process in each stage.

data preparation

Stage 1: Data preparation for training, validation, and evaluation data

Note that --stage <N> is to start the stage and --stop_stage <N> is to stop the stage. We also need to specify training, validation, and test data.

[ ]:
# 30 seconds
!./ --stage 1 --stop_stage 1 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"

After this stage is finished, please check the data directory

[ ]:
!ls data

In this recipe, we use train_nodev as a training set, train_dev as a validation set (monitor the training progress by checking the validation score). We also use (reuse) test and train_dev sets for the final speech recognition evaluation.

Let’s check one of the training data directory:

[ ]:
!ls -1 data/train_nodev/

These are the speech and corresponding text and speaker information based on the Kaldi format. Please also check

spk2utt # Speaker information
text    # Transcription file
utt2spk # Speaker information
wav.scp # Audio file

Stage 2: Speed perturbation (one of the data augmentation methods)

We do not use speed perturbation for this demo. But you can turn it on by adding an argument --speed_perturb_factors "0.9 1.0 1.1" to the shell script

[ ]:
!./ --stage 2 --stop_stage 2 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"

Stage 3: Format wav.scp: data/ -> dump/raw

We dump the data with specified format (flac in this case) for the efficient use of the data.

Note that --nj <N> means the number of CPU jobs. Please set it appropriately by considering your CPU resources and disk access.

[ ]:
# 30 seconds
!./ --stage 3 --stop_stage 3 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --nj 4

Stage 4: Remove long/short data: dump/raw/org -> dump/raw

There are too long and too short audio data, which are harmful for our efficient training. Those data are removed from the list.

[ ]:
!./ --stage 4 --stop_stage 4 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"

Stage 5: Generate token_list from dump/raw/train_nodev/text using BPE.

This is important for text processing. We make a dictionary based on the English character in this example. We use a sentencepiece toolkit developed by Google.

[ ]:
!./ --stage 5 --stop_stage 5 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"

Let’s check the content of the dictionary. There are several special symbols, e.g.,

<blank>   used for CTC
<unk>     unknown symbols do not appear in the training data
<sos/eos> start and end sentence symbols
[ ]:
!cat data/token_list/bpe_unigram30/tokens.txt

language modeling (skip in this tutorial)

Stages 6–9: Stages related to language modeling.

We skip the language modeling part in the recipe (stages 6 – 9) in this tutorial.

End-to-end ASR

Stage 10: ASR collect stats: train_set=dump/raw/train_nodev, valid_set=dump/raw/train_dev

We estimate the mean and variance of the data to normalize the data. We also collect the information of input and output lengths for the efficient mini batch creation.

[ ]:
# 18 seconds
!./ --stage 10 --stop_stage 10 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --nj 4

Stage 11: ASR Training: train_set=dump/raw/train_nodev, valid_set=dump/raw/train_dev

Main training loop.

Please also monitor the following files - log file /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/train.log - loss /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/images/loss.png - accuracy /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/images/acc.png

[ ]:
# It would take 20-30 min.
!./ --stage 11 --stop_stage 11 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --ngpu 1

Stage 12: Decoding: training_dir=exp/asr_train_raw_bpe30

Note that we need to make --use_lm false since we skip the language model.

inference_nj <N> specifies the number of inference jobs

Let’s monitor the log /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/inference_asr_model_valid.acc.ave/train_dev/logdir/asr_inference.1.log

[ ]:
# It would take ~10 minutes
!./ --inference_nj 4 --stage 12 --stop_stage 12 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --use_lm false

Stage 13: Scoring

You can find word error rate (WER), character error rate (CER), etc. for each test set.

[ ]:
!./ --stage 13 --stop_stage 13 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --use_lm false

You can also check the break down of the word error rate in /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/inference_asr_model_valid.acc.ave/train_dev/score_wer/result.txt

How to change the training configs?

config file based

All training options are changed by using a config file.

Pleae check

Let’s first check config files prepared in the an4 recipe

- LSTM-based E2E ASR /content/espnet/egs2/an4/asr1/conf/train_asr_rnn.yaml
- Transformer based E2E ASR /content/espnet/egs2/an4/asr1/conf/train_asr_transformer.yaml

You can run


./ --stage 10 \
   --train_set train_nodev \
   --valid_set train_dev \
   --test_sets "train_dev test" \
   --nj 4 \
   --inference_nj 4 \
   --use_lm false \
   ----asr_config conf/train_asr_rnn.yaml


./ --stage 10 \
   --train_set train_nodev \
   --valid_set train_dev \
   --test_sets "train_dev test" \
   --nj 4 \
   --inference_nj 4 \
   --use_lm false \
   ----asr_config conf/train_asr_transformer.yaml

You can also find various configs in espnet/egs2/*/asr1/conf/, including - Conformer espnet/egs2/librispeech/asr1/conf/train_asr_confformer.yaml - Wav2vec2.0 pre-trained model and fine-tuning - HuBERT pre-trained model and fine-tuning

command line argument based

You can also customize it by editing the file or passing the command line arguments, e.g.,

./ --stage 10 --asr_args "--model_conf ctc_weight=0.3"
./ --stage 10 --asr_args "--optim_conf lr=0.1"


How to make a new recipe?