Text-to-Speech (Recipe)
Text-to-Speech (Recipe)
This is the example notebook of how-to-run the ESPnet TTS recipe using an4 dataset.
You can understand the overview of TTS recipe through this notebook within an hour!
See also:
- Documentaion: https://espnet.github.io/espnet
- Github: https://github.com/espnet
Author: Tomoki Hayashi
Last update: 2019/07/25
Setup envrionment
First, let's setup the environmet to run the recipe.
It take around 10 minues. Please keep waiting for a while.
# OS setup
!sudo apt-get install bc tree
!cat /etc/os-release
# espnet setup
!git clone https://github.com/espnet/espnet
!cd espnet; pip install -e .
# warp ctc setup
!git clone https://github.com/espnet/warp-ctc -b pytorch-1.1
!cd warp-ctc && mkdir build && cd build && cmake .. && make -j
!cd warp-ctc/pytorch_binding && python setup.py install
# kaldi setup
!cd /content/espnet/tools; git clone https://github.com/kaldi-asr/kaldi
!echo "" > ./espnet/tools/kaldi/tools/extras/check_dependencies.sh # ignore check
!chmod +x ./espnet/tools/kaldi/tools/extras/check_dependencies.sh
!cd ./espnet/tools/kaldi/tools; make sph2pipe sclite
!rm -rf espnet/tools/kaldi/tools/python
!wget https://18-198329952-gh.circle-artifacts.com/0/home/circleci/repo/ubuntu16-featbin.tar.gz
!tar -xf ./ubuntu16-featbin.tar.gz # take a few minutes
!cp featbin/* espnet/tools/kaldi/src/featbin/
# make dummy activate
!mkdir -p espnet/tools/venv/bin
!touch espnet/tools/venv/bin/activateRun the recipe
Now ready to run the recipe!
We use the most simplest recipe egs/an4/tts1 as an example.
Unfortunately,
egs/an4/tts1is too small to generate reasonable speech.
But you can understand the flow or TTS recipe through this recipe since all of the TTS recipes has the exactly same flow.
# Let's go to an4 recipe!
import os
os.chdir("/content/espnet/egs/an4/tts1")Before running the recipe, let us check the recipe structure.
!tree -L 1Each recipe has the same structure and files.
- run.sh: Main script of the recipe. Once you run this script, all of the processing will be conducted from data download, preparation, feature extraction, training, and decoding.
- cmd.sh: Command configuration source file about how-to-run each processing. You can modify this script if you want to run the script through job control system e.g. Slurm or Torque.
- path.sh: Path configuration source file. Basically, we do not have to touch.
- conf/: Directory containing configuration files.
- local/: Directory containing the recipe-specific scripts e.g. data preparation.
- steps/ and utils/: Directory containing kaldi tools.
Main script run.sh consists of several stages:
- stage -1: Download data if the data is available online.
- stage 0: Prepare data to make kaldi-stype data directory.
- stage 1: Extract feature vector, calculate statistics, and perform normalization.
- stage 2: Prepare a dictionary and make json files for training.
- stage 3: Train the E2E-TTS network.
- stage 4: Decode mel-spectrogram using the trained network.
- stage 5: Generate a waveform from a generated mel-spectrogram using Griffin-Lim.
Currently, we support the following networks:
- Tacotron2: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
- Transformer: Neural Speech Synthesis with Transformer Network
- FastSpeech: FastSpeech: Fast, Robust and Controllable Text to Speech
Let us check each stage step-by-step via --stage and --stop_stage options!
Stage -1: Data download
This stage downloads dataset if the dataset is available online.
!./run.sh --stage -1 --stop_stage -1!tree -L 1
!ls downloads/You can see downloads directory is cretead, which containing donwloaded an4 dataset.
Stage 0: Data preparation
This stage creates kaldi-style data directories.
!./run.sh --stage 0 --stop_stage 0!tree -L 1 dataThrough the data preparation stage, kaldi-style data directories will be created.
Here, data/train/ is corresponding to training set, and data/test is corresponding to evaluation set.
Each directory has the same following files:
!ls data/*The above four files are all we have to prepare to create new recipes.
Let's check each file.
!head -n 3 data/train/{wav.scp,text,utt2spk,spk2utt}Each file contains the following information:
- wav.scp: List of audio path. Each line has
<utt_id> <wavfile_path or command pipe>.<utt_id>must be unique. - text: List of transcriptions. Each line has
<utt_id> <transcription>. In the case of TTS, we assume that<transcription>is cleaned. - utt2spk: List of correspondence table between utterances and speakers. Each line has
<utt_id> <speaker_id>. - spk2utt: List of correspondence table between speakers and utterances. Each lien has
<speaker_id> <utt_id> ... <utt_id>. This file can be automatically created from utt2spk.
In the ESPnet, speaker information is not used for any processing.
Therefore, utt2spk and spk2utt can be a dummy.
Stage 1: Feature extration
This stage performs the following processing:
- Mel-spectrogram extraction
- Data split into training and validation set
- Statistics (mean and variance) calculation
- Normalization
!./run.sh --stage 1 --stop_stage 1 --nj 4Raw filterbanks are saved in fbank/ directory with ark/scp format.
!ls fbank.ark is binary file and .scp contain the correspondence between <utt_id> and <path_in_ark>.
Since feature extraction can be performed for split small sets in parallel, raw_fbank is split into raw_fbank_*.{1..N}.{scp,ark}.
!head -n 3 fbank/raw_fbank_train.1.scpThese files can be loaded in python via kaldiio as follows:
import kaldiio
import matplotlib.pyplot as plt
# load scp file
scp_dict = kaldiio.load_scp("fbank/raw_fbank_train.1.scp")
for key in scp_dict:
plt.imshow(scp_dict[key].T[::-1])
plt.title(key)
plt.colorbar()
plt.show()
break
# load ark file
ark_generator = kaldiio.load_ark("fbank/raw_fbank_train.1.ark")
for key, array in ark_generator:
plt.imshow(array.T[::-1])
plt.title(key)
plt.colorbar()
plt.show()
breakAfter raw mel-spectrogram extraction, some files are added in data/train/.
feats.scp is concatenated scp file of fbank/raw_fbank_train.{1..N}.scp.
utt2num_frames has the number of feature frames of each <utt_id>.
!ls data/train
!head -n 3 data/train/{feats.scp,utt2num_frames}And data/train/ directory is split into two directory:
- data/train_nodev/: data directory for training
- data/train_dev/: data directory for validation
!ls data
!ls data/train_*You can find cmvn.ark in data/train_nodev, which is the calculated statistics file.
This file also can be loaded in python via kaldiio.
# load cmvn.ark file (Be careful not load_ark, but load_mat)
cmvn = kaldiio.load_mat("data/train_nodev/cmvn.ark")
# cmvn consists of mean and variance, the last dimension of mean represents the number of frames.
print("cmvn shape = "+ str(cmvn.shape))
# calculate mean and variance
mu = cmvn[0, :-1] / cmvn[0, -1]
var = cmvn[1, :-1] / cmvn[0, -1]
# show mean
print("mean = " + str(mu))
print("variance = " + str(var))Normalzed features for training, validation and evaluation set are dumped in dump/{train_nodev,train_dev,test}/.
There ark and scp can be loaded as the same as the above procedure.
!ls dump/*Stage 2: Dictionary and json preparation
This stage creates dictrionary from data/train_nodev/text and makes json file for training.
!./run.sh --stage 2 --stop_stage 2Dictrionary file will be created in data/lang_1char/.
Dictionary file consists of <token> <token index>.
Here, <token index> starts from 1 because 0 is used as padding index.
!ls data
!cat data/lang_1char/train_nodev_units.txtJson file will be created for training / validation /evaludation sets and they are saved as dump/{train_nodev,train_dev,test}/data.json.
!ls dump/*/*.jsonEach json file contains all of the information in the data directory.
!head -n 27 dump/train_nodev/data.json- "shape": Shape of the input or output sequence. Here input shape [63, 80] represents the number of frames = 63 and the dimension of mel-spectrogram = 80.
- "text": Original transcription.
- "token": Token sequence of original transcription.
- "tokenid" Token id sequence of original transcription, which is converted using the dictionary.
Now ready to start training!
Stage 3: Network training
This stage performs training of the network.
Network training configurations are written as .yaml format file.
Let us check the default cofiguration conf/train_pytroch_tacotron2.yaml.
!cat conf/train_pytorch_tacotron2.yamlYou can modify this configuration file to change the hyperparameters.
Here, let's change the number of epochs for this demonstration.
# TODO(kan-bayashi): Change here to use change_yaml.py
!cat conf/train_pytorch_tacotron2.yaml | sed -e "s/epochs: 50/epochs: 3/g" > conf/train_pytorch_tacotron2_sample.yaml
!cat conf/train_pytorch_tacotron2_sample.yamlLet's train the network.
You can specify the config file via --train_config option. It takes several minutes.
!./run.sh --stage 3 --stop_stage 3 --train_config conf/train_pytorch_tacotron2_sample.yaml --verbose 1You can see the training log in exp/train_*/train.log.
The models are saved in exp/train_*/results/ directory.
!ls exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/{results,results/att_ws}exp/train_*/results/*.png are the figures of training curve.
from IPython.display import Image, display_png
print("all loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/all_loss.png"))
print("l1 loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/l1_loss.png"))
print("mse loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/mse_loss.png"))
print("bce loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/bce_loss.png"))exp/train_*/results/att_ws/.png are the figures of attention weights in each epoch.
print("Attention weights of initial epoch")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/att_ws/fash-cen1-b.ep.1.png"))exp/train_*/results/model.loss.best contains only the model parameters.
On the other hand, exp/train_*/results/snapshot contains the model parameters, optimizer states, and iterator states.
So you can restart from the training by specifying the snapshot file with --resume option.
# resume training from snapshot.ep.2
!./run.sh --stage 3 --stop_stage 3 --train_config conf/train_pytorch_tacotron2_sample.yaml --resume exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/snapshot.ep.2 --verbose 1!cat exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/train.logAlso, we support tensorboard.
You can see the training log through tensorboard.
%load_ext tensorboard
%tensorboard --logdir tensorboard/train_nodev_pytorch_train_pytorch_tacotron2_sample/Stage 4: Network decoding
This stage performs decoding using the trained model to generate mel-spectrogram from a given text.
!./run.sh --stage 4 --stop_stage 4 --nj 8 --train_config conf/train_pytorch_tacotron2_sample.yamlGenerated features are saved as ark/scp format.
!ls exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/outputs_model.loss.best_decode/*We can specify the model or snapshot to be used for decoding via --model.
!./run.sh --stage 4 --stop_stage 4 --nj 8 --train_config conf/train_pytorch_tacotron2_sample.yaml --model snapshot.ep.2!ls exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/outputs_snapshot.ep.2_decode/*Stage 5: Waveform synthesis
Finally, in this stage, we generate waveform using Grrifin-Lim algorithm.
First, we perform de-normalization to convert the generated mel-spectrogram into the original scale.
Then we apply Grrifin-Lim algorithm to restore phase components and apply inverse STFT to generate waveforms.
!./run.sh --stage 5 --stop_stage 5 --nj 8 --train_config conf/train_pytorch_tacotron2_sample.yaml --griffin_lim_iters 50Generated wav files are saved in exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/outputs_model.loss.best_decode_denorm/*/wav
!ls exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/outputs_model.loss.best_decode_denorm/*/wav!tree -L 3NEXT step
- Try pretrained model to generate speech.
- Try a large single speaker dataset recipe egs/ljspeech/tts1.
- Try a large multi-speaker recipe egs/libritts/tts1.
- Make the original recipe using your own dataset.
