Speech Recognition (Library)
Speech Recognition (Library)
This example shows you a practical ASR example using ESPnet as a command line interface and library.
See also
- run in colab
- documetation https://espnet.github.io/espnet/
- github https://github.com/espnet
Author: Shigeki Karita
Installation
ESPnet depends on Kaldi ASR toolkit and Warp-CTC. This cell will take a few minutes.
# TODO(karita): put these lines in ./espnet/tools/setup_colab.sh
# OS setup
!sudo apt-get install bc tree
!cat /etc/os-release
# espnet setup
!git clone https://github.com/espnet/espnet
!cd espnet; pip install -e .
!mkdir espnet/tools/venv/bin; touch espnet/tools/venv/bin/activate
# warp ctc setup
!git clone https://github.com/espnet/warp-ctc -b pytorch-1.1
!cd warp-ctc && mkdir build && cd build && cmake .. && make -j4
!cd warp-ctc/pytorch_binding && python setup.py install
# kaldi setup
!cd ./espnet/tools; git clone https://github.com/kaldi-asr/kaldi
!echo "" > ./espnet/tools/kaldi/tools/extras/check_dependencies.sh # ignore check
!chmod +x ./espnet/tools/kaldi/tools/extras/check_dependencies.sh
!cd ./espnet/tools/kaldi/tools; make sph2pipe sclite
!rm -rf espnet/tools/kaldi/tools/python
![ ! -e ubuntu16-featbin.tar.gz ] && wget https://18-198329952-gh.circle-artifacts.com/0/home/circleci/repo/ubuntu16-featbin.tar.gz
!tar -xf ./ubuntu16-featbin.tar.gz
!cp featbin/* espnet/tools/kaldi/src/featbin/
ESPnet data preparation
You can use the end-to-end script run.sh
for reproducing systems reported in espnet/egs/*/asr1/RESULTS.md
. Typically, we organize run.sh
with several stages:
- Data download (if available)
- Kaldi-style data preparation
- Dump useful data for traning (e.g., JSON, HDF5, etc)
- Lanuage model training
- ASR model training
- Decoding and evaluation
For example, if you add --stop-stage 2
, you can stop the script before neural network training.
!cd espnet/egs/an4/asr1; ./run.sh --ngpu 1 --stop-stage 2
Kaldi-style directories
Always we organize each recipe placed in egs/xxx/asr1
in Kaldi way. For example, the important directories are:
conf/
: kaldi configurations, e.g., speech featuredata/
: almost raw data prepared by Kaldiexp/
: intermidiate files through experiments, e.g., log files, model parametersfbank/
: speech feature binary files, e.g., ark, scpdump/
: ESPnet meta data for tranining, e.g., json, hdf5local/
: corpus specific data preparation scripts- steps/, utils/: Kaldi's helper scripts
!tree -L 1
!ls data/train
ESPnet as a library
Here we use ESPnet as a library to create a simple Python snippet for speech recognition. ESPnet 's training script'asr_train.py
has three parts:
- Load train/dev dataset
- Create minibatches
- Build neural networks
- Update neural networks by iterating datasets
Let's implement these procedures from scratch!
Load train/dev dataset (1/4)
First, we will check how run.sh
organized the JSON files and load the pair of the speech feature and its transcription.
import json
import matplotlib.pyplot as plt
import kaldiio
root = "espnet/egs/an4/asr1"
with open(root + "/dump/train_nodev/deltafalse/data.json", "r") as f:
train_json = json.load(f)["utts"]
with open(root + "/dump/train_dev/deltafalse/data.json", "r") as f:
dev_json = json.load(f)["utts"]
# the first training data for speech recognition
key, info = next(iter(train_json.items()))
# plot the 80-dim fbank + 3-dim pitch speech feature
fbank = kaldiio.load_mat(info["input"][0]["feat"])
plt.matshow(fbank.T[::-1])
plt.title(key + ": " + info["output"][0]["text"])
# print the key-value pair
key, info
Create minibatches (2/4)
To parallelize neural network training, we create minibatches that containes several sequence pairs by splitting datasets.
from espnet.utils.training.batchfy import make_batchset
batch_size = 32
trainset = make_batchset(train_json, batch_size)
devset = make_batchset(dev_json, batch_size)
assert len(devset[0]) == batch_size
devset[0][:3]
Build neural networks (3/4)
For simplicity, we use a predefined model: Transformer.
NOTE: You can also use your custom model in command line tools as asr_train.py --model-module your_module:YourModel
import argparse
from espnet.bin.asr_train import get_parser
from espnet.nets.pytorch_backend.e2e_asr import E2E
parser = get_parser()
parser = E2E.add_arguments(parser)
config = parser.parse_args([
"--mtlalpha", "0.0", # weight for cross entropy and CTC loss
"--outdir", "out", "--dict", ""]) # TODO: allow no arg
idim = info["input"][0]["shape"][1]
odim = info["output"][0]["shape"][1]
setattr(config, "char_list", [])
model = E2E(idim, odim, config)
model
Update neural networks by iterating datasets (4/4)
Finaly, we got the training part.
import numpy
import torch
from torch.nn.utils.rnn import pad_sequence
from torch.nn.utils.clip_grad import clip_grad_norm_
from torch.utils.data import DataLoader
def collate(minibatch):
fbanks = []
tokens = []
for key, info in minibatch[0]:
fbanks.append(torch.tensor(kaldiio.load_mat(info["input"][0]["feat"])))
tokens.append(torch.tensor([int(s) for s in info["output"][0]["tokenid"].split()]))
ilens = torch.tensor([x.shape[0] for x in fbanks])
return pad_sequence(fbanks, batch_first=True), ilens, pad_sequence(tokens, batch_first=True)
train_loader = DataLoader(trainset, collate_fn=collate, shuffle=True, pin_memory=True)
dev_loader = DataLoader(devset, collate_fn=collate, pin_memory=True)
model.cuda()
optim = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.98))
n_iter = len(trainset)
n_epoch = 10
total_iter = n_iter * n_epoch
train_acc = []
valid_acc = []
for epoch in range(n_epoch):
# training
acc = []
model.train()
for data in train_loader:
loss = model(*[d.cuda() for d in data])
optim.zero_grad()
loss.backward()
acc.append(model.acc)
norm = clip_grad_norm_(model.parameters(), 10.0)
optim.step()
train_acc.append(numpy.mean(acc))
# validation
acc = []
model.eval()
for data in dev_loader:
model(*[d.cuda() for d in data])
acc.append(model.acc)
valid_acc.append(numpy.mean(acc))
print(f"epoch: {epoch}, train acc: {train_acc[-1]:.3f}, dev acc: {valid_acc[-1]:.3f}")
import matplotlib.pyplot as plt
plt.plot(range(len(train_acc)), train_acc, label="train acc")
plt.plot(range(len(valid_acc)), valid_acc, label="dev acc")
plt.grid()
plt.legend()
torch.save(model.state_dict(), "best.pt")
Recognize speech
import json
import matplotlib.pyplot as plt
import kaldiio
from espnet.bin.asr_recog import get_parser
# load data
root = "espnet/egs/an4/asr1"
with open(root + "/dump/test/deltafalse/data.json", "r") as f:
test_json = json.load(f)["utts"]
key, info = list(test_json.items())[10]
# plot the 80-dim fbank + 3-dim pitch speech feature
fbank = kaldiio.load_mat(info["input"][0]["feat"])
plt.matshow(fbank.T[::-1])
plt.title(key + ": " + info["output"][0]["text"])
# load token dict
with open(root + "/data/lang_1char/train_nodev_units.txt", "r") as f:
token_list = [entry.split()[0] for entry in f]
token_list.insert(0, '<blank>')
token_list.append('<eos>')
# recognize speech
parser = get_parser()
args = parser.parse_args([
"--beam-size", "1",
"--ctc-weight", "0",
"--result-label", "out.json",
"--model", ""
])
model.cpu()
model.eval()
def to_str(result):
return "".join(token_list[y] for y in result[0]["yseq"]) \
.replace("<eos>", "").replace("<space>", " ").replace("<blank>", "")
print("groundtruth:", info["output"][0]["text"])
print("prediction: ", to_str(model.recognize(fbank, args, token_list)))