Pretrained Model

About 1 min

Pretrained Model

This is the example notebook of how-to-recognize and -synthesize speech using the ESPnet models.

Setup envrionment

Let's setup the environmet for the demonstration. It takes around 10 minues. Please keep waiting for a while.

# OS setup
!sudo apt-get install bc tree sox
!cat /etc/os-release

# espnet setup
!git clone https://github.com/espnet/espnet
!cd espnet; pip install -e .

# warp ctc setup
!git clone https://github.com/espnet/warp-ctc -b pytorch-1.1
!cd warp-ctc && mkdir build && cd build && cmake .. && make -j
!cd warp-ctc/pytorch_binding && python setup.py install 

# kaldi setup
!cd /content/espnet/tools; git clone https://github.com/kaldi-asr/kaldi
!echo "" > ./espnet/tools/kaldi/tools/extras/check_dependencies.sh
!chmod +x ./espnet/tools/kaldi/tools/extras/check_dependencies.sh
!cd ./espnet/tools/kaldi/tools; make sph2pipe sclite
!rm -rf espnet/tools/kaldi/tools/python
!wget https://18-198329952-gh.circle-artifacts.com/0/home/circleci/repo/ubuntu16-featbin.tar.gz
!tar -xf ./ubuntu16-featbin.tar.gz
!cp featbin/* espnet/tools/kaldi/src/featbin/

# sentencepiece setup
!cd espnet/tools; make sentencepiece.done

# make dummy activate
!mkdir -p espnet/tools/venv/bin
!touch espnet/tools/venv/bin/activate

Recognize speech using pretrained models

Let's recognize 7-minutes long audio speech as an example. Go to a recipe directory and run recog_wav.sh at the directory.

Available models are summarized here.

!cd espnet/egs/tedlium2/asr1; bash ../../../utils/recog_wav.sh --models tedlium2.tacotron2.v1

You can see the progress of the recognition.

!cat espnet/egs/tedlium2/asr1/decode/TomWujec_2010U/log/decode.log

You can change E2E model, language model, decoding parameters, etc. For the detail, see recog_wav.sh.

!cat espnet/utils/recog_wav.sh

Synthesize speech using pretrained models

Let's synthesize speech using an E2E model. Go to a recipe directory and run synth_wav.sh at the directory.

Available models are summarized here.

!cd espnet/egs/ljspeech/tts1; \
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example.txt; \
bash ../../../utils/synth_wav.sh --models ljspeech.tacotron2.v1 example.txt

Let's listen the synthesized speech!

from google.colab import files

files.download('espnet/egs/ljspeech/tts1/decode/example/wav/example.wav')

You can change E2E model, decoding parameters, etc. For the detail, see synth_wav.sh.

!cat espnet/utils/synth_wav.sh

We have a web storage to put your good trained models. If you want, please contact Shinji Watanabe shinjiw@ieee.org.