# Pretrained Model

This is the example notebook of how-to-recognize and -synthesize speech using the ESPnet models.

See also:

- Tutorial: https://github.com/espnet/espnet/blob/master/doc/tutorial.md
- Github: https://github.com/espnet

Author: [Takenori Yoshimura](https://github.com/takenori-y)

Last update: 2019/07/28

## Setup envrionment

Let's setup the environmet for the demonstration.
It takes around 10 minues. Please keep waiting for a while.

In [0]:
# OS setup
!sudo apt-get install bc tree sox
!cat /etc/os-release

# espnet setup
!git clone https://github.com/espnet/espnet
!cd espnet; pip install -e .

# warp ctc setup
!git clone https://github.com/espnet/warp-ctc -b pytorch-1.1
!cd warp-ctc && mkdir build && cd build && cmake .. && make -j
!cd warp-ctc/pytorch_binding && python setup.py install 

# kaldi setup
!cd /content/espnet/tools; git clone https://github.com/kaldi-asr/kaldi
!echo "" > ./espnet/tools/kaldi/tools/extras/check_dependencies.sh
!chmod +x ./espnet/tools/kaldi/tools/extras/check_dependencies.sh
!cd ./espnet/tools/kaldi/tools; make sph2pipe sclite
!rm -rf espnet/tools/kaldi/tools/python
!wget https://18-198329952-gh.circle-artifacts.com/0/home/circleci/repo/ubuntu16-featbin.tar.gz
!tar -xf ./ubuntu16-featbin.tar.gz
!cp featbin/* espnet/tools/kaldi/src/featbin/

# sentencepiece setup
!cd espnet/tools; make sentencepiece.done

# make dummy activate
!mkdir -p espnet/tools/venv/bin
!touch espnet/tools/venv/bin/activate

Reading package lists... Done
Building dependency tree 
Reading state information... Done
The following package was automatically installed and is no longer required:
 libnvidia-common-410
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
 libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
 libsox-fmt-base libsox3
Suggested packages:
 file libsox-fmt-all
The following NEW packages will be installed:
 bc libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0
 libsox-fmt-alsa libsox-fmt-base libsox3 sox tree
0 upgraded, 10 newly installed, 0 to remove and 7 not upgraded.
Need to get 887 kB of archives.
After this operation, 7,040 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrnb0 amd64 0.1.3-2.1 [92.0 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrwb0 amd64 0.1.3-2.1 [45.8 kB]
Get:3 http://archive.ubuntu.com/ubu

## Recognize speech using pretrained models

Let's recognize 7-minutes long audio speech as an example. Go to a recipe directory and run `recog_wav.sh` at the directory. 

Available models are summarized [here](https://github.com/espnet/espnet#asr-demo).

In [0]:
!cd espnet/egs/tedlium2/asr1; bash ../../../utils/recog_wav.sh --models tedlium2.tacotron2.v1

stage 0: Data preparation
stage 1: Feature Generation
steps/make_fbank_pitch.sh --cmd run.pl --nj 1 --write_utt2num_frames true decode/TomWujec_2010U/data decode/TomWujec_2010U/log decode/TomWujec_2010U/fbank
steps/make_fbank_pitch.sh: moving decode/TomWujec_2010U/data/feats.scp to decode/TomWujec_2010U/data/.backup
 Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html
 for more information.
utils/validate_data_dir.sh: Successfully validated data-directory decode/TomWujec_2010U/data
steps/make_fbank_pitch.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
steps/make_fbank_pitch.sh: Succeeded creating filterbank and pitch features for data
/content/espnet/egs/tedlium2/asr1/../../../utils/dump.sh --cmd run.pl --nj 1 --do_delta false decode/TomWujec_2010U/data/feats.scp decode/download/tedlium2.tacotron2.v1/data/train_trim_sp/cmvn.ark decode/TomWujec_2010U/log decode/TomWujec_2010U/dump
stage 2: Json Data Preparation
/content/espnet/egs/tedlium2/a

You can see the progress of the recognition.

In [0]:
!cat espnet/egs/tedlium2/asr1/decode/TomWujec_2010U/log/decode.log

# asr_recog.py --config decode/download/tedlium2.tacotron2.v1/conf/decode_streaming.yaml --ngpu 0 --backend pytorch --debugmode 1 --verbose 1 --recog-json decode/TomWujec_2010U/dump/data.json --result-label decode/TomWujec_2010U/result.json --model decode/download/tedlium2.tacotron2.v1/exp/train_trim_sp_pytorch_train4/results/model.acc.best --rnnlm decode/download/tedlium2.tacotron2.v1/exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best 
# Started at Mon Jul 29 04:10:09 UTC 2019
#
2019-07-29 04:10:10,182 (asr_recog:134) INFO: python path = /env/python
2019-07-29 04:10:10,183 (asr_recog:139) INFO: set random seed = 1
2019-07-29 04:10:10,183 (asr_recog:147) INFO: backend = pytorch
 import imp
2019-07-29 04:10:12,364 (deterministic_utils:24) INFO: torch type check is disabled
2019-07-29 04:10:12,364 (asr_utils:310) INFO: reading a config file from decode/download/tedlium2.tacotron2.v1/exp/train_trim_sp_pytorch_train4/results/model.json
2019-07-29 04:10:12,364 (asr:482) INFO: reading mo

You can change E2E model, language model, decoding parameters, etc. For the detail, see `recog_wav.sh`.

In [0]:
!cat espnet/utils/recog_wav.sh

#!/bin/bash

# Copyright 2019 Nagoya University (Takenori Yoshimura)
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)

if [ ! -f path.sh ] || [ ! -f cmd.sh ]; then
 echo "Please change directory to e.g., egs/tedlium2/asr1"
 exit 1
fi

. ./path.sh

# general configuration
backend=pytorch
stage=0 # start from 0 if you need to start from data preparation
stop_stage=100
ngpu=0 # number of gpus ("0" uses cpu, otherwise use gpu)
debugmode=1
verbose=1 # verbose option

# feature configuration
do_delta=false
cmvn=

# rnnlm related
use_lang_model=true
lang_model=

# decoding parameter
recog_model=
decode_config=
decode_dir=decode

# download related
models=tedlium2.tacotron2.v1

help_message=$(cat <

Example:
 rec -c 1 -r 16000 example.wav trim 0 5
 $0 example.wav
EOF
)
. utils/parse_options.sh || exit 1;

# make shellcheck happy
train_cmd=
decode_cmd=

. ./cmd.sh

wav=$1
download_dir=${decode_dir}/download

if [ $# -gt 1 ]; then
 echo $help_message
 exit 1;
fi

set -e
set -u
set -o pi

## Synthesize speech using pretrained models

Let's synthesize speech using an E2E model. Go to a recipe directory and run `synth_wav.sh` at the directory.

Available models are summarized [here](https://github.com/espnet/espnet#tts-demo).

In [0]:
!cd espnet/egs/ljspeech/tts1; \
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example.txt; \
bash ../../../utils/synth_wav.sh --models ljspeech.tacotron2.v1 example.txt

--2019-07-29 04:12:48-- https://drive.google.com/uc?export=download&id=1dKzdaDpOkpx7kWZnvrvx2De7eZEdPHZs
Resolving drive.google.com (drive.google.com)... 74.125.20.138, 74.125.20.102, 74.125.20.101, ...
Connecting to drive.google.com (drive.google.com)|74.125.20.138|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-04-30-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/s5uj0qai6u6ooagkqh3ku4vpis2lstie/1564372800000/04214513489132088126/*/1dKzdaDpOkpx7kWZnvrvx2De7eZEdPHZs?e=download [following]
--2019-07-29 04:12:56-- https://doc-04-30-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/s5uj0qai6u6ooagkqh3ku4vpis2lstie/1564372800000/04214513489132088126/*/1dKzdaDpOkpx7kWZnvrvx2De7eZEdPHZs?e=download
Resolving doc-04-30-docs.googleusercontent.com (doc-04-30-docs.googleusercontent.com)... 74.125.28.132, 2607:f8b0:400e:c04::84
Connecting to doc-04-30-docs.googleusercontent.com (doc-04-30-doc

Let's listen the synthesized speech!

In [0]:
from google.colab import files

files.download('espnet/egs/ljspeech/tts1/decode/example/wav/example.wav')

You can change E2E model, decoding parameters, etc. For the detail, see `synth_wav.sh`.

In [0]:
!cat espnet/utils/synth_wav.sh

#!/bin/bash

# Copyright 2019 Nagoya University (Takenori Yoshimura)
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)

if [ ! -f path.sh ] || [ ! -f cmd.sh ]; then
 echo "Please change directory to e.g., egs/ljspeech/tts1"
 exit 1
fi

. ./path.sh

# general configuration
backend=pytorch
stage=0 # start from 0 if you need to start from data preparation
stop_stage=100
ngpu=0 # number of gpus ("0" uses cpu, otherwise use gpu)
debugmode=1
verbose=1 # verbose option

# feature configuration
fs=22050 # sampling frequency
fmax="" # maximum frequency
fmin="" # minimum frequency
n_mels=80 # number of mel basis
n_fft=1024 # number of fft points
n_shift=256 # number of shift points
win_length="" # window length
cmvn=

# dictionary related
dict=

# embedding related
input_wav=

# decoding related
synth_model=
decode_config=
decode_dir=decode
griffin_lim_iters=1000

# download related
models=ljspeech.transformer.v1

help_message=$(cat <

Example:
 echo \"This is a demonstration of text to 

We have a web storage to put your good trained models. If you want, please contact Shinji Watanabe .