Speech Recognition (Recipe)

About 3 min

Speech Recognition (Recipe)

Author: Shigeki Karita

July 29 2019

ESPnet Hackathon 2019 @Tokyo

Abstract

This example shows you a practical ASR example using ESPnet as a command line interface, and also as a library.

Installation

ESPnet depends on Kaldi ASR toolkit and Warp-CTC. This will take a few minutes.

# OS setup
!sudo apt-get install bc tree
!cat /etc/os-release

# espnet setup
!git clone https://github.com/espnet/espnet
!cd espnet; pip install -e .
!mkdir -p espnet/tools/venv/bin; touch espnet/tools/venv/bin/activate

# warp ctc setup
!git clone https://github.com/espnet/warp-ctc -b pytorch-1.1
!cd warp-ctc && mkdir build && cd build && cmake .. && make -j4
!cd warp-ctc/pytorch_binding && python setup.py install 

# kaldi setup
!cd ./espnet/tools; git clone https://github.com/kaldi-asr/kaldi
!echo "" > ./espnet/tools/kaldi/tools/extras/check_dependencies.sh # ignore check
!chmod +x ./espnet/tools/kaldi/tools/extras/check_dependencies.sh
!cd ./espnet/tools/kaldi/tools; make sph2pipe sclite
!rm -rf espnet/tools/kaldi/tools/python
![ ! -e ubuntu16-featbin.tar.gz ] && wget https://18-198329952-gh.circle-artifacts.com/0/home/circleci/repo/ubuntu16-featbin.tar.gz
!tar -xf ./ubuntu16-featbin.tar.gz
!cp featbin/* espnet/tools/kaldi/src/featbin/

ESPnet command line usage (espnet/egs/xxx)

You can use the end-to-end script run.sh for reproducing systems reported in espnet/egs/*/asr1/RESULTS.md. Typically, we organize run.sh with several stages:

Data download (if available)
Kaldi-style data preparation
Save python-friendly data (e.g., JSON, HDF5, etc)
Language model training
ASR model training
Decoding and evaluation

!ls espnet/egs

Stage 0 - 2 Data preparation

For example, if you add --stop-stage 2, you can stop the script before neural network training.

!cd espnet/egs/an4/asr1; ./run.sh  --ngpu 1 --stop-stage 2

Kaldi-style directory structure

Always we organize each recipe placed in egs/xxx/asr1 in Kaldi way:

conf/: kaldi configurations, e.g., speech feature
data/: almost raw data prepared by Kaldi
exp/: intermidiate files through experiments, e.g., log files, model parameters
fbank/: speech feature binary files, e.g., ark, scp
dump/: ESPnet meta data for tranining, e.g., json, hdf5
local/: corpus specific data preparation scripts
steps/, utils/: Kaldi's helper scripts

!tree -L 1 espnet/egs/an4/asr1

TIPS: essential files in data preparation

To create a new recipe, all you need is stage 1 that creates key-value pair files:

speechdata/xxx/wav.scp
textdata/xxx/text

raw speech file list

!head espnet/egs/an4/asr1/data/train/wav.scp

raw text list

!head espnet/egs/an4/asr1/data/train/text

TIPS: explore datasets with data.json

To explore datasets easily, ESPnet stores metadata dump/xxx/data.json in the stage 2.

import json
import matplotlib.pyplot as plt
import kaldiio

# load 10-th speech/text in data.json
root = "espnet/egs/an4/asr1"
with open(root + "/dump/test/deltafalse/data.json", "r") as f:
  test_json = json.load(f)["utts"]
  
key, info = list(test_json.items())[10]

# plot the speech feature
fbank = kaldiio.load_mat(info["input"][0]["feat"])
plt.matshow(fbank.T[::-1])
plt.title(key + ": " + info["output"][0]["text"])

# print the key-value pair
key, info

Stage 3 - 4 NN Training

Let's go to the most interesting part...

!tail espnet/egs/an4/asr1/conf/train_mtlalpha1.0.yaml

!cd espnet/egs/an4/asr1; ./run.sh  --ngpu 1 --stage 3 --stop-stage 4 --train-config ./conf/train_mtlalpha1.0.yaml

TIPS: change_yaml.py

You can tweak YAML config by $(change_yaml.py xxx.yaml -a yyy=zzz)

!cd espnet/egs/an4/asr1; source path.sh; \
  ./run.sh  --ngpu 1 --stage 4 --stop-stage 4 \
  --train-config $(change_yaml.py ./conf/train_mtlalpha1.0.yaml -a eunits=100)

TIPS: tensorboard

You can easily monitor effects of the config by tensorboard

Decoding and evaluation

decode config (change_yaml.py also works)

!cat espnet/egs/an4/asr1/conf/decode_ctcweight1.0.yaml

Command line usage

!cd espnet/egs/an4/asr1; ./run.sh  --stage 5

ASR result as `data.json`

!head -n20 espnet/egs/an4/asr1/exp/train_nodev_pytorch_train_mtlalpha1.0/decode_test_decode_ctcweight1.0_lm_word100/data.json

Recognize speech from python

Let's use ESPnet as a library and the trained model:

!ls espnet/egs/an4/asr1/exp/train_nodev_pytorch_train_mtlalpha1.0/results

recap: load speech from data.json

import json
import matplotlib.pyplot as plt
import kaldiio

# load 10-th speech/text in data.json
root = "espnet/egs/an4/asr1"
with open(root + "/dump/test/deltafalse/data.json", "r") as f:
  test_json = json.load(f)["utts"]
  
key, info = list(test_json.items())[10]

# plot the speech feature
fbank = kaldiio.load_mat(info["input"][0]["feat"])
plt.matshow(fbank.T[::-1])
plt.title(key + ": " + info["output"][0]["text"])

load model

import json
import torch
import argparse
from espnet.bin.asr_recog import get_parser
from espnet.nets.pytorch_backend.e2e_asr import E2E

root = "espnet/egs/an4/asr1"
model_dir = root + "/exp/train_nodev_pytorch_train_mtlalpha1.0/results"

# load model
with open(model_dir + "/model.json", "r") as f:
  idim, odim, conf = json.load(f)
model = E2E(idim, odim, argparse.Namespace(**conf))
model.load_state_dict(torch.load(model_dir + "/model.loss.best"))
model.cpu().eval()

# load token dict
with open(root + "/data/lang_1char/train_nodev_units.txt", "r") as f:
  token_list = [entry.split()[0] for entry in f]
token_list.insert(0, '<blank>')
token_list.append('<eos>')

# recognize speech
parser = get_parser()
args = parser.parse_args(["--beam-size", "2", "--ctc-weight", "1.0", "--result-label", "out.json", "--model", ""])
result = model.recognize(fbank, args, token_list)
s = "".join(conf["char_list"][y] for y in result[0]["yseq"]).replace("<eos>", "").replace("<space>", " ").replace("<blank>", "")

print("groundtruth:", info["output"][0]["text"])
print("prediction: ", s)

import os
import kaldiio
from IPython.display import Audio


try:
  d = os.getcwd()
  os.chdir(root)
  sr, wav = kaldiio.load_scp("data/test/wav.scp")[key]
finally:
  os.chdir(d)
Audio(wav, rate=sr)