Speech Recognition (Recipe)
Speech Recognition (Recipe)
Author: Shigeki Karita
July 29 2019
ESPnet Hackathon 2019 @Tokyo
Abstract
This example shows you a practical ASR example using ESPnet as a command line interface, and also as a library.
See also
- documetation https://espnet.github.io/espnet/
- github https://github.com/espnet
Installation
ESPnet depends on Kaldi ASR toolkit and Warp-CTC. This will take a few minutes.
# OS setup
!sudo apt-get install bc tree
!cat /etc/os-release
# espnet setup
!git clone https://github.com/espnet/espnet
!cd espnet; pip install -e .
!mkdir -p espnet/tools/venv/bin; touch espnet/tools/venv/bin/activate
# warp ctc setup
!git clone https://github.com/espnet/warp-ctc -b pytorch-1.1
!cd warp-ctc && mkdir build && cd build && cmake .. && make -j4
!cd warp-ctc/pytorch_binding && python setup.py install
# kaldi setup
!cd ./espnet/tools; git clone https://github.com/kaldi-asr/kaldi
!echo "" > ./espnet/tools/kaldi/tools/extras/check_dependencies.sh # ignore check
!chmod +x ./espnet/tools/kaldi/tools/extras/check_dependencies.sh
!cd ./espnet/tools/kaldi/tools; make sph2pipe sclite
!rm -rf espnet/tools/kaldi/tools/python
![ ! -e ubuntu16-featbin.tar.gz ] && wget https://18-198329952-gh.circle-artifacts.com/0/home/circleci/repo/ubuntu16-featbin.tar.gz
!tar -xf ./ubuntu16-featbin.tar.gz
!cp featbin/* espnet/tools/kaldi/src/featbin/
ESPnet command line usage (espnet/egs/xxx)
You can use the end-to-end script run.sh
for reproducing systems reported in espnet/egs/*/asr1/RESULTS.md
. Typically, we organize run.sh
with several stages:
- Data download (if available)
- Kaldi-style data preparation
- Save python-friendly data (e.g., JSON, HDF5, etc)
- Language model training
- ASR model training
- Decoding and evaluation
!ls espnet/egs
Stage 0 - 2 Data preparation
For example, if you add --stop-stage 2
, you can stop the script before neural network training.
!cd espnet/egs/an4/asr1; ./run.sh --ngpu 1 --stop-stage 2
Kaldi-style directory structure
Always we organize each recipe placed in egs/xxx/asr1
in Kaldi way:
conf/
: kaldi configurations, e.g., speech featuredata/
: almost raw data prepared by Kaldiexp/
: intermidiate files through experiments, e.g., log files, model parametersfbank/
: speech feature binary files, e.g., ark, scpdump/
: ESPnet meta data for tranining, e.g., json, hdf5local/
: corpus specific data preparation scripts- steps/, utils/: Kaldi's helper scripts
!tree -L 1 espnet/egs/an4/asr1
TIPS: essential files in data preparation
To create a new recipe, all you need is stage 1 that creates key-value pair files:
- speech
data/xxx/wav.scp
- text
data/xxx/text
raw speech file list
!head espnet/egs/an4/asr1/data/train/wav.scp
raw text list
!head espnet/egs/an4/asr1/data/train/text
TIPS: explore datasets with data.json
To explore datasets easily, ESPnet stores metadata dump/xxx/data.json
in the stage 2.
import json
import matplotlib.pyplot as plt
import kaldiio
# load 10-th speech/text in data.json
root = "espnet/egs/an4/asr1"
with open(root + "/dump/test/deltafalse/data.json", "r") as f:
test_json = json.load(f)["utts"]
key, info = list(test_json.items())[10]
# plot the speech feature
fbank = kaldiio.load_mat(info["input"][0]["feat"])
plt.matshow(fbank.T[::-1])
plt.title(key + ": " + info["output"][0]["text"])
# print the key-value pair
key, info
Stage 3 - 4 NN Training
Let's go to the most interesting part...
!tail espnet/egs/an4/asr1/conf/train_mtlalpha1.0.yaml
!cd espnet/egs/an4/asr1; ./run.sh --ngpu 1 --stage 3 --stop-stage 4 --train-config ./conf/train_mtlalpha1.0.yaml
TIPS: change_yaml.py
You can tweak YAML config by $(change_yaml.py xxx.yaml -a yyy=zzz)
!cd espnet/egs/an4/asr1; source path.sh; \
./run.sh --ngpu 1 --stage 4 --stop-stage 4 \
--train-config $(change_yaml.py ./conf/train_mtlalpha1.0.yaml -a eunits=100)
TIPS: tensorboard
You can easily monitor effects of the config by tensorboard
Decoding and evaluation
decode config (change_yaml.py
also works)
!cat espnet/egs/an4/asr1/conf/decode_ctcweight1.0.yaml
Command line usage
!cd espnet/egs/an4/asr1; ./run.sh --stage 5
ASR result as data.json
!head -n20 espnet/egs/an4/asr1/exp/train_nodev_pytorch_train_mtlalpha1.0/decode_test_decode_ctcweight1.0_lm_word100/data.json
Recognize speech from python
Let's use ESPnet as a library and the trained model:
!ls espnet/egs/an4/asr1/exp/train_nodev_pytorch_train_mtlalpha1.0/results
recap: load speech from data.json
import json
import matplotlib.pyplot as plt
import kaldiio
# load 10-th speech/text in data.json
root = "espnet/egs/an4/asr1"
with open(root + "/dump/test/deltafalse/data.json", "r") as f:
test_json = json.load(f)["utts"]
key, info = list(test_json.items())[10]
# plot the speech feature
fbank = kaldiio.load_mat(info["input"][0]["feat"])
plt.matshow(fbank.T[::-1])
plt.title(key + ": " + info["output"][0]["text"])
load model
import json
import torch
import argparse
from espnet.bin.asr_recog import get_parser
from espnet.nets.pytorch_backend.e2e_asr import E2E
root = "espnet/egs/an4/asr1"
model_dir = root + "/exp/train_nodev_pytorch_train_mtlalpha1.0/results"
# load model
with open(model_dir + "/model.json", "r") as f:
idim, odim, conf = json.load(f)
model = E2E(idim, odim, argparse.Namespace(**conf))
model.load_state_dict(torch.load(model_dir + "/model.loss.best"))
model.cpu().eval()
# load token dict
with open(root + "/data/lang_1char/train_nodev_units.txt", "r") as f:
token_list = [entry.split()[0] for entry in f]
token_list.insert(0, '<blank>')
token_list.append('<eos>')
# recognize speech
parser = get_parser()
args = parser.parse_args(["--beam-size", "2", "--ctc-weight", "1.0", "--result-label", "out.json", "--model", ""])
result = model.recognize(fbank, args, token_list)
s = "".join(conf["char_list"][y] for y in result[0]["yseq"]).replace("<eos>", "").replace("<space>", " ").replace("<blank>", "")
print("groundtruth:", info["output"][0]["text"])
print("prediction: ", s)
import os
import kaldiio
from IPython.display import Audio
try:
d = os.getcwd()
os.chdir(root)
sr, wav = kaldiio.load_scp("data/test/wav.scp")[key]
finally:
os.chdir(d)
Audio(wav, rate=sr)