Text-to-Speech (Recipe)

About 5 min

Text-to-Speech (Recipe)

This is the example notebook of how-to-run the ESPnet TTS recipe using an4 dataset.
You can understand the overview of TTS recipe through this notebook within an hour!

Setup envrionment

First, let's setup the environmet to run the recipe.
It take around 10 minues. Please keep waiting for a while.

# OS setup
!sudo apt-get install bc tree
!cat /etc/os-release

# espnet setup
!git clone https://github.com/espnet/espnet
!cd espnet; pip install -e .

# warp ctc setup
!git clone https://github.com/espnet/warp-ctc -b pytorch-1.1
!cd warp-ctc && mkdir build && cd build && cmake .. && make -j
!cd warp-ctc/pytorch_binding && python setup.py install 

# kaldi setup
!cd /content/espnet/tools; git clone https://github.com/kaldi-asr/kaldi
!echo "" > ./espnet/tools/kaldi/tools/extras/check_dependencies.sh # ignore check
!chmod +x ./espnet/tools/kaldi/tools/extras/check_dependencies.sh
!cd ./espnet/tools/kaldi/tools; make sph2pipe sclite
!rm -rf espnet/tools/kaldi/tools/python
!wget https://18-198329952-gh.circle-artifacts.com/0/home/circleci/repo/ubuntu16-featbin.tar.gz
!tar -xf ./ubuntu16-featbin.tar.gz # take a few minutes
!cp featbin/* espnet/tools/kaldi/src/featbin/

# make dummy activate
!mkdir -p espnet/tools/venv/bin
!touch espnet/tools/venv/bin/activate

Run the recipe

Now ready to run the recipe!
We use the most simplest recipe egs/an4/tts1 as an example.

Unfortunately, egs/an4/tts1 is too small to generate reasonable speech.
But you can understand the flow or TTS recipe through this recipe since all of the TTS recipes has the exactly same flow.

# Let's go to an4 recipe!
import os
os.chdir("/content/espnet/egs/an4/tts1")

Before running the recipe, let us check the recipe structure.

!tree -L 1

Each recipe has the same structure and files.

run.sh: Main script of the recipe. Once you run this script, all of the processing will be conducted from data download, preparation, feature extraction, training, and decoding.
cmd.sh: Command configuration source file about how-to-run each processing. You can modify this script if you want to run the script through job control system e.g. Slurm or Torque.
path.sh: Path configuration source file. Basically, we do not have to touch.
conf/: Directory containing configuration files.
local/: Directory containing the recipe-specific scripts e.g. data preparation.
steps/ and utils/: Directory containing kaldi tools.

Main script run.sh consists of several stages:

stage -1: Download data if the data is available online.
stage 0: Prepare data to make kaldi-stype data directory.
stage 1: Extract feature vector, calculate statistics, and perform normalization.
stage 2: Prepare a dictionary and make json files for training.
stage 3: Train the E2E-TTS network.
stage 4: Decode mel-spectrogram using the trained network.
stage 5: Generate a waveform from a generated mel-spectrogram using Griffin-Lim.

Currently, we support the following networks:

Tacotron2: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Transformer: Neural Speech Synthesis with Transformer Network
FastSpeech: FastSpeech: Fast, Robust and Controllable Text to Speech

Let us check each stage step-by-step via --stage and --stop_stage options!

Stage -1: Data download

This stage downloads dataset if the dataset is available online.

!./run.sh --stage -1 --stop_stage -1

!tree -L 1
!ls downloads/

You can see downloads directory is cretead, which containing donwloaded an4 dataset.

Stage 0: Data preparation

This stage creates kaldi-style data directories.

!./run.sh --stage 0 --stop_stage 0

!tree -L 1 data

Through the data preparation stage, kaldi-style data directories will be created.
Here, data/train/ is corresponding to training set, and data/test is corresponding to evaluation set.
Each directory has the same following files:

!ls data/*

The above four files are all we have to prepare to create new recipes.
Let's check each file.

!head -n 3 data/train/{wav.scp,text,utt2spk,spk2utt}

Each file contains the following information:

wav.scp: List of audio path. Each line has <utt_id> <wavfile_path or command pipe>. <utt_id> must be unique.
text: List of transcriptions. Each line has <utt_id> <transcription>. In the case of TTS, we assume that <transcription> is cleaned.
utt2spk: List of correspondence table between utterances and speakers. Each line has <utt_id> <speaker_id>.
spk2utt: List of correspondence table between speakers and utterances. Each lien has <speaker_id> <utt_id> ... <utt_id> . This file can be automatically created from utt2spk.

In the ESPnet, speaker information is not used for any processing.
Therefore, utt2spk and spk2utt can be a dummy.

Stage 1: Feature extration

This stage performs the following processing:

Mel-spectrogram extraction
Data split into training and validation set
Statistics (mean and variance) calculation
Normalization

!./run.sh --stage 1 --stop_stage 1 --nj 4

Raw filterbanks are saved in fbank/ directory with ark/scp format.

!ls fbank

.ark is binary file and .scp contain the correspondence between <utt_id> and <path_in_ark>.
Since feature extraction can be performed for split small sets in parallel, raw_fbank is split into raw_fbank_*.{1..N}.{scp,ark}.

!head -n 3 fbank/raw_fbank_train.1.scp

These files can be loaded in python via kaldiio as follows:

import kaldiio
import matplotlib.pyplot as plt

# load scp file
scp_dict = kaldiio.load_scp("fbank/raw_fbank_train.1.scp")
for key in scp_dict:
    plt.imshow(scp_dict[key].T[::-1])
    plt.title(key)
    plt.colorbar()
    plt.show()
    break
    
# load ark file
ark_generator = kaldiio.load_ark("fbank/raw_fbank_train.1.ark")
for key, array in ark_generator:
    plt.imshow(array.T[::-1])
    plt.title(key)
    plt.colorbar()
    plt.show()
    break

After raw mel-spectrogram extraction, some files are added in data/train/.
feats.scp is concatenated scp file of fbank/raw_fbank_train.{1..N}.scp.
utt2num_frames has the number of feature frames of each <utt_id>.

!ls data/train
!head -n 3 data/train/{feats.scp,utt2num_frames}

And data/train/ directory is split into two directory:

data/train_nodev/: data directory for training
data/train_dev/: data directory for validation

!ls data
!ls data/train_*

You can find cmvn.ark in data/train_nodev, which is the calculated statistics file.
This file also can be loaded in python via kaldiio.

# load cmvn.ark file (Be careful not load_ark, but load_mat)
cmvn = kaldiio.load_mat("data/train_nodev/cmvn.ark")

# cmvn consists of mean and variance, the last dimension of mean represents the number of frames.
print("cmvn shape = "+ str(cmvn.shape))

# calculate mean and variance
mu = cmvn[0, :-1] / cmvn[0, -1]
var = cmvn[1, :-1] / cmvn[0, -1]

# show mean
print("mean = " + str(mu))
print("variance = " + str(var))

Normalzed features for training, validation and evaluation set are dumped in dump/{train_nodev,train_dev,test}/.
There ark and scp can be loaded as the same as the above procedure.

!ls dump/*

Stage 2: Dictionary and json preparation

This stage creates dictrionary from data/train_nodev/text and makes json file for training.

!./run.sh --stage 2 --stop_stage 2

Dictrionary file will be created in data/lang_1char/.
Dictionary file consists of <token> <token index>.
Here, <token index> starts from 1 because 0 is used as padding index.

!ls data
!cat data/lang_1char/train_nodev_units.txt

Json file will be created for training / validation /evaludation sets and they are saved as dump/{train_nodev,train_dev,test}/data.json.

!ls dump/*/*.json

Each json file contains all of the information in the data directory.

!head -n 27 dump/train_nodev/data.json

"shape": Shape of the input or output sequence. Here input shape [63, 80] represents the number of frames = 63 and the dimension of mel-spectrogram = 80.
"text": Original transcription.
"token": Token sequence of original transcription.
"tokenid" Token id sequence of original transcription, which is converted using the dictionary.

Now ready to start training!

Stage 3: Network training

This stage performs training of the network.
Network training configurations are written as .yaml format file.
Let us check the default cofiguration conf/train_pytroch_tacotron2.yaml.

!cat conf/train_pytorch_tacotron2.yaml

You can modify this configuration file to change the hyperparameters.
Here, let's change the number of epochs for this demonstration.

# TODO(kan-bayashi): Change here to use change_yaml.py
!cat conf/train_pytorch_tacotron2.yaml | sed -e "s/epochs: 50/epochs: 3/g" > conf/train_pytorch_tacotron2_sample.yaml
!cat conf/train_pytorch_tacotron2_sample.yaml

Let's train the network.
You can specify the config file via --train_config option. It takes several minutes.

!./run.sh --stage 3 --stop_stage 3 --train_config conf/train_pytorch_tacotron2_sample.yaml --verbose 1

You can see the training log in exp/train_*/train.log.

The models are saved in exp/train_*/results/ directory.

!ls exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/{results,results/att_ws}

exp/train_*/results/*.png are the figures of training curve.

from IPython.display import Image, display_png
print("all loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/all_loss.png"))
print("l1 loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/l1_loss.png"))
print("mse loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/mse_loss.png"))
print("bce loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/bce_loss.png"))

exp/train_*/results/att_ws/.png are the figures of attention weights in each epoch.

print("Attention weights of initial epoch")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/att_ws/fash-cen1-b.ep.1.png"))

exp/train_*/results/model.loss.best contains only the model parameters.
On the other hand, exp/train_*/results/snapshot contains the model parameters, optimizer states, and iterator states.
So you can restart from the training by specifying the snapshot file with --resume option.

# resume training from snapshot.ep.2
!./run.sh --stage 3 --stop_stage 3 --train_config conf/train_pytorch_tacotron2_sample.yaml --resume exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/snapshot.ep.2 --verbose 1

!cat exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/train.log

Also, we support tensorboard.
You can see the training log through tensorboard.

%load_ext tensorboard
%tensorboard --logdir tensorboard/train_nodev_pytorch_train_pytorch_tacotron2_sample/

Stage 4: Network decoding

This stage performs decoding using the trained model to generate mel-spectrogram from a given text.

!./run.sh --stage 4 --stop_stage 4 --nj 8 --train_config conf/train_pytorch_tacotron2_sample.yaml

Generated features are saved as ark/scp format.

!ls exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/outputs_model.loss.best_decode/*

We can specify the model or snapshot to be used for decoding via --model.

!./run.sh --stage 4 --stop_stage 4 --nj 8 --train_config conf/train_pytorch_tacotron2_sample.yaml --model snapshot.ep.2

!ls exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/outputs_snapshot.ep.2_decode/*

Stage 5: Waveform synthesis

Finally, in this stage, we generate waveform using Grrifin-Lim algorithm.
First, we perform de-normalization to convert the generated mel-spectrogram into the original scale.
Then we apply Grrifin-Lim algorithm to restore phase components and apply inverse STFT to generate waveforms.

!./run.sh --stage 5 --stop_stage 5 --nj 8 --train_config conf/train_pytorch_tacotron2_sample.yaml --griffin_lim_iters 50

Generated wav files are saved in exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/outputs_model.loss.best_decode_denorm/*/wav

!ls exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/outputs_model.loss.best_decode_denorm/*/wav

!tree -L 3

NEXT step

Try pretrained model to generate speech.
Try a large single speaker dataset recipe egs/ljspeech/tts1.
Try a large multi-speaker recipe egs/libritts/tts1.
Make the original recipe using your own dataset.