Text-to-Speech

About 21 min

Text-to-Speech

This is a template of TTS recipe for ESPnet2.

Text-to-Speech

Recipe flow

TTS recipe consists of 9 stages.

Data preparation

Data preparation stage. You have two methods to generate the data:

ESPnet format:

It calls local/data.sh to creates Kaldi-style data directories in data/ for training, validation, and evaluation sets.

See also:

About Kaldi-style data directory

(New) MFA Aligments generation

You can generate aligments using the Montreal-Forced-Aligner tool Use the script scripts/mfa.sh to generate the required mfa aligments and train a model that employs these alignments.

Because the script scripts/mfa.sh prepares the data, it is not required to execute local/data.sh previously. However, you will need to set some additional flags, such as --split_sets, --samplerate, or --acoustic_model:

./scripts/mfa.sh --split_sets "train_set dev_set test_set" \
    --stage 1 \
    --stop-stage 2 \
    --train true --nj 36 --g2p_model espeak_ng_english_vits

You can find a reference at egs2/ljspeech/tts1/local/run_mfa.sh.

The script scripts/mfa.sh will generate the aligments using a given g2p_model & acoustic_model and store it in the <split_sets>_phn directory. This script download a pretrained model (if --train false) or trains the mfa g2p and acoustic model (if --train true), for then generate the aligments.

Then, you can continue the training on the main script:

./run.sh --train-set train_set_phn \
         --dev-set dev_set_phn \
         --test_sets "dev_set_phn test_set_phn" \
         --stage 2 \
         --g2p none \
         --cleaner none \
         --teacher_dumpdir "data"

Wav dump / Embedding preparation

Wav dumping stage. This stage reformats wav.scp in data directories.

Additionally, We support speaker embedding extraction in this stage as you can use in ESPnet1. If you specify --use_spk_embed true (Default: use_spk_embed=false), we extract speaker embeddings. You can select the type of toolkit to use (kaldi, speechbrain, or espnet) when you specify --spk_embed_tool <option> (Default: spk_embed_tool=espnet). If you specify kaldi, then we additionally extract mfcc features and vad decision. In that case, spk_embed_tag will be set to xvector automatically. This processing requires the compiled kaldi, please be careful.

Also, speaker ID embedding and language ID embedding preparation will be performed in this stage if you specify --use_sid true and --use_lid true options. Note that this processing assume that utt2spk or utt2lang are correctly created in stage 1, please be careful.

Extract speaker embeddings

Extract speaker embeddings.

Removal of long / short data

Processing stage to remove long and short utterances from the training and validation data. You can change the threshold values via --min_wav_duration and --max_wav_duration.

Token list generation

Token list generation stage. It generates token list (dictionary) from srctexts. You can change the tokenization type via --token_type option. token_type=char and token_type=phn are supported. If --cleaner option is specified, the input text will be cleaned with the specified cleaner. If token_type=phn, the input text will be converted with G2P module specified by --g2p option.

See also:

Supported text cleaner.
Supported text frontend.

TTS statistics collection

Statistics calculation stage. It collects the shape information of the input and output and calculates statistics for feature normalization (mean and variance over training data).

TTS training

TTS model training stage. You can change the training setting via --train_config and --train_args options.

See also:

TTS decoding

TTS model decoding stage. You can change the decoding setting via --inference_config and --inference_args.

See also:

Change the configuration for training

TTS eval using versa

TTS model eval stage using versa.

The default metrics is below in conf/versa.yaml:

You can find more detail and more metrics on VERSA README.

(Optional) Pack results for upload

Packing stage. It packs the trained model files as a preparation for uploading to Hugging Face.

(Optional) Upload model to Hugging Face

Upload the trained model to Hugging Face for sharing. Additional information at Docs.

How to run

Here, we show the procedure to run the recipe using egs2/ljspeech/tts1.

Move on the recipe directory.

$ cd egs2/ljspeech/tts1

Modify LJSPEECH variable in db.sh if you want to change the download directory.

$ vim db.sh

Modify cmd.sh and conf/*.conf if you want to use job scheduler. See the detail in using job scheduling system.

$ vim cmd.sh

Run run.sh, which conducts all of the stages explained above.

$ ./run.sh

As a default, we train Tacotron2 (conf/train.yaml) with feats_type=raw + token_type=phn.

Then, you can get the following directories in the recipe directory.

├── data/ # Kaldi-style data directory
│   ├── dev/        # validation set
│   ├── eval1/      # evaluation set
│   └── tr_no_dev/  # training set
├── dump/ # feature dump directory
│   ├── token_list/    # token list (dictionary)
│   └── raw/
│       ├── org/
│       │    ├── tr_no_dev/ # training set before filtering
│       │    └── dev/       # validation set before filtering
│       ├── srctexts   # text to create token list
│       ├── eval1/     # evaluation set
│       ├── dev/       # validation set after filtering
│       └── tr_no_dev/ # training set after filtering
└── exp/ # experiment directory
    ├── tts_stats_raw_phn_tacotron_g2p_en_no_space # statistics
    └── tts_train_raw_phn_tacotron_g2p_en_no_space # model
        ├── att_ws/                # attention plot during training
        ├── tensorboard/           # tensorboard log
        ├── images/                # plot of training curves
        ├── decode_train.loss.ave/ # decoded results
        │    ├── dev/   # validation set
        │    └── eval1/ # evaluation set
        │        ├── att_ws/      # attention plot in decoding
        │        ├── probs/       # stop probability plot in decoding
        │        ├── norm/        # generated features
        │        ├── denorm/      # generated denormalized features
        │        ├── wav/         # generated wav via Griffin-Lim
        │        ├── log/         # log directory
        │        ├── durations    # duration of each input tokens
        │        ├── feats_type   # feature type
        │        ├── focus_rates  # focus rate
        │        └── speech_shape # shape info of generated features
        ├── config.yaml             # config used for the training
        ├── train.log               # training log
        ├── *epoch.pth              # model parameter file
        ├── checkpoint.pth          # model + optimizer + scheduler parameter file
        ├── latest.pth              # symlink to latest model parameter
        ├── *.ave_5best.pth         # model averaged parameters
        └── *.best.pth              # symlink to the best model parameter loss

In decoding, we use Griffin-Lim for waveform generation as a default (End-to-end text-to-wav model can generate waveform directly such as VITS and Joint training model). If you want to combine with neural vocoders, you can combine with kan-bayashi/ParallelWaveGAN.

# Make sure you already install parallel_wavegan repo
$ . ./path.sh && pip install -U parallel_wavegan

# Use parallel_wavegan provided pretrained ljspeech style melgan as a vocoder
$ ./run.sh --stage 8 --inference_args "--vocoder_tag parallel_wavegan/ljspeech_style_melgan.v1" --inference_tag decode_with_ljspeech_style_melgan.v1

# Use the vocoder trained by `parallel_wavegan` repo manually
$ ./run.sh --stage 8 --vocoder_file /path/to/checkpoint-xxxxxxsteps.pkl --inference_tag decode_with_my_vocoder

If you want to generate waveform from dumped features, please check decoding with ESPnet-TTS model's feature.

For the first time, we recommend performing each stage step-by-step via --stage and --stop-stage options.

$ ./run.sh --stage 1 --stop-stage 1
$ ./run.sh --stage 2 --stop-stage 2
...
$ ./run.sh --stage 8 --stop-stage 8

This might helps you to understand each stage's processing and directory structure.

FastSpeech training

If you want to train FastSpeech, additional steps with the teacher model are needed. Please make sure you already finished the training of the teacher model (Tacotron2 or Transformer-TTS).

First, decode all of data including training, validation, and evaluation set.

# specify teacher model directory via --tts_exp option
$ ./run.sh --stage 8 \
    --tts_exp exp/tts_train_raw_phn_tacotron_g2p_en_no_space \
    --test_sets "tr_no_dev dev eval1"

This will generate durations for training, validation, and evaluation sets in exp/tts_train_raw_phn_tacotron_g2p_en_no_space/decode_train.loss.ave.

Then, you can train FastSpeech by specifying the directory including durations via --teacher_dumpdir option.

$ ./run.sh --stage 7 \
    --train_config conf/tuning/train_fastspeech.yaml \
    --teacher_dumpdir exp/tts_train_raw_phn_tacotron_g2p_en_no_space/decode_train.loss.ave

In the above example, we use generated mel-spectrogram as the target, which is known as knowledge distillation training. If you want to use groundtruth mel-spectrogram as the target, we need to use teacher forcing in decoding.

$ ./run.sh --stage 8 \
    --tts_exp exp/tts_train_raw_phn_tacotron_g2p_en_no_space \
    --inference_args "--use_teacher_forcing true" \
    --test_sets "tr_no_dev dev eval1"

You can get the groundtruth aligned durations in exp/tts_train_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave.

Then, you can train FastSpeech without knowledge distillation.

$ ./run.sh --stage 7 \
    --train_config conf/tuning/train_fastspeech.yaml \
    --teacher_dumpdir exp/tts_train_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave

FastSpeech2 training

The procedure is almost the same as FastSpeech but we MUST use teacher forcing in decoding.

$ ./run.sh --stage 8 \
    --tts_exp exp/tts_train_raw_phn_tacotron_g2p_en_no_space \
    --inference_args "--use_teacher_forcing true" \
    --test_sets "tr_no_dev dev eval1"

To train FastSpeech2, we use additional feature (F0 and energy). Therefore, we need to start from stage 5 to calculate additional statistics.

$ ./run.sh --stage 6 \
    --train_config conf/tuning/train_fastspeech2.yaml \
    --teacher_dumpdir exp/tts_train_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave \
    --tts_stats_dir exp/tts_train_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats \
    --write_collected_feats true

where --tts_stats_dir is the option to specify the directory to dump Statistics, and --write_collected_feats is the option to dump features in statistics calculation. The use of --write_collected_feats is optional but it helps to accelerate the training.

Multi-speaker model with speaker embedding training

First, you need to run from the stage 2 and 3 with --use_spk_embed true to extract speaker embedding.

$ ./run.sh --stage 3 --stop-stage 4 --use_spk_embed true

You can find the extracted speaker embedding in dump/${spk_embed_tag}/*/${spk_embed_tag}.{ark,scp}. Then, you can run the training with the config which has spk_embed_dim: 512 in tts_conf.

# e.g.
tts_conf:
  spk_embed_dim: 512 # dimension of speaker embedding
  spk_embed_integration_type: add # how to integrate speaker embedding

(Optional) Train on speaker-averaged speaker embeddings

Models trained using speaker-averaged speaker embeddings may generalise better to inference tasks where the utterance-specific speaker embedding is unknown, compared to models trained using embeddings derived from individual training utterances. After you perform the above extraction step, if you want to train and evaluate using speaker-averaged speaker embeddings, you can use the following command to replace utterance-level speaker embeddings with speaker-averaged values. Make sure to set your train_set dev_set and test_set variables beforehand:

for dset in "${train_set}" "${dev_set}" "${test_set}"
do
    ./pyscripts/utils/convert_to_avg_spk_embed.py \
        --utt-embed-path dump/${spk_embed_tag}/${dset}/${spk_embed_tag}.scp \
        --utt2spk data/${dset}/utt2spk \
        --spk-embed-path dump/${spk_embed_tag}/${dset}/spk_${spk_embed_tag}.scp
done

The original ${spk_embed_tag}.scp files are renamed to xvector.scp.bak in case you wish to revert the changes.

Once you've performed extraction and optionally the speaker-averaged replacement step, please run the training from stage 6.

$ ./run.sh --stage 7 --use_xvector true --train_config /path/to/your_xvector_config.yaml

You can find the example config in egs2/vctk/tts1/conf/tuning.

Multi-speaker model with speaker ID embedding training

First, you need to run from the stage 2 and 3 with --use_sid true to extract speaker ID.

$ ./run.sh --stage 3 --stop-stage 4 --use_sid true

You can find the speaker ID file in dump/raw/*/utt2sid. Note that you need to correctly create utt2spk in data prep stage to generate utt2sid. Then, you can run the training with the config which has spks: #spks in tts_conf.

# e.g.
tts_conf:
  spks: 128 # Number of speakers

Please run the training from stage 6.

$ ./run.sh --stage 7 --use_sid true --train_config /path/to/your_multi_spk_config.yaml

Multi-language model with language ID embedding training

First, you need to run from the stage 2 and 3 with --use_lid true to extract speaker ID.

$ ./run.sh --stage 3 --stop-stage 4 --use_lid true

You can find the speaker ID file in dump/raw/*/utt2lid. Note that you need to additionally create utt2lang file in data prep stage to generate utt2lid. Then, you can run the training with the config which has langs: #langs in tts_conf.

# e.g.
tts_conf:
  langs: 4 # Number of languages

Please run the training from stage 6.

$ ./run.sh --stage 7 --use_lid true --train_config /path/to/your_multi_lang_config.yaml

Of course you can further combine with x-vector or speaker ID embedding. If you want to use both sid and lid, the process should be like this:

$ ./run.sh --stage 3 --stop-stage 4 --use_lid true --use_sid true

Make your config.

# e.g.
tts_conf:
  langs: 4 # Number of languages
  spks: 128 # Number of speakers

Please run the training from stage 6.

$ ./run.sh --stage 7 --use_lid true --use_sid true --train_config /path/to/your_multi_spk_multi_lang_config.yaml

VITS training

First, the VITS config is hard coded for 22.05 khz or 44.1 khz and use different feature extraction method. (Note that you can use any feature extraction method but the default method is linear_spectrogram.) If you want to use it with 24 khz or 16 khz dataset, please be careful about these point.

# Assume that data prep stage (stage 1) is finished
$ ./run.sh --stage 1 --stop-stage 1

# Single speaker 22.05 khz case
$ ./run.sh \
    --stage 2 \
    --ngpu 4 \
    --fs 22050 \
    --n_fft 1024 \
    --n_shift 256 \
    --win_length null \
    --dumpdir dump/22k \
    --expdir exp/22k \
    --tts_task gan_tts \
    --feats_extract linear_spectrogram \
    --feats_normalize none \
    --train_config ./conf/tuning/train_vits.yaml \
    --inference_config ./conf/tuning/decode_vits.yaml \
    --inference_model latest.pth

# Single speaker 44.1 khz case
$ ./run.sh \
    --stage 2 \
    --ngpu 4 \
    --fs 44100 \
    --n_fft 2048 \
    --n_shift 512 \
    --win_length null \
    --dumpdir dump/44k \
    --expdir exp/44k \
    --tts_task gan_tts \
    --feats_extract linear_spectrogram \
    --feats_normalize none \
    --train_config ./conf/tuning/train_full_band_vits.yaml \
    --inference_config ./conf/tuning/decode_vits.yaml \
    --inference_model latest.pth

# Multi speaker with SID 22.05 khz case
$ ./run.sh \
    --stage 2 \
    --use_sid true \
    --ngpu 4 \
    --fs 22050 \
    --n_fft 1024 \
    --n_shift 256 \
    --win_length null \
    --dumpdir dump/22k \
    --expdir exp/22k \
    --tts_task gan_tts \
    --feats_extract linear_spectrogram \
    --feats_normalize none \
    --train_config ./conf/tuning/train_multi_spk_vits.yaml \
    --inference_config ./conf/tuning/decode_vits.yaml \
    --inference_model latest.pth

# Multi speaker with SID 44.1 khz case
$ ./run.sh \
    --stage 2 \
    --use_sid true \
    --ngpu 4 \
    --fs 44100 \
    --n_fft 2048 \
    --n_shift 512 \
    --win_length null \
    --dumpdir dump/44k \
    --expdir exp/44k \
    --tts_task gan_tts \
    --feats_extract linear_spectrogram \
    --feats_normalize none \
    --train_config ./conf/tuning/train_full_band_multi_spk_vits.yaml \
    --inference_config ./conf/tuning/decode_vits.yaml \
    --inference_model latest.pth

# Multi speaker with speaker embedding 22.05 khz case (need compiled kaldi to run if use Kaldi toolkit)
$ ./run.sh \
    --stage 2 \
    --use_spk_embed true \
    --ngpu 4 \
    --fs 22050 \
    --n_fft 1024 \
    --n_shift 256 \
    --win_length null \
    --dumpdir dump/22k \
    --expdir exp/22k \
    --tts_task gan_tts \
    --feats_extract linear_spectrogram \
    --feats_normalize none \
    --train_config ./conf/tuning/train_xvector_vits.yaml \
    --inference_config ./conf/tuning/decode_vits.yaml \
    --inference_model latest.pth

The training time requires long times (around several weeks) but around 100k samples can generate a reasonable sounds.

You can find the example configs in:

During VITS and JETS training, you can monitor pseudo MOS values predicted by a MOS prediction model. You can enable it by setting tts_conf.plot_pred_mos: true in training configs. Take a look at egs2/ljspeech/tts1/conf/tuning/train_vits.yaml to see how to set the flag.

Joint text2wav training

Joint training enables us to train both text2mel and vocoder model jointly with GAN-based training. Currently, we tested on only for non-autoregressive text2mel models with ljspeech dataset but the following models and vocoders are supported.

Text2mel

Tacotron2
Transformer
FastSpeech
FastSpeech2

Vocoder

ParallelWaveGAN G / D
(Multi-band) MelGAN G / D
HiFiGAN G / D
StyleMelGAN G / D

Here, we show the example procedure to train conformer fastspeech2 + hifigan jointly with two training strategy (training from scratch and fine-tuning of pretrained text2mel and vocoder).

# Make sure you are ready to train fastspeech2 (already prepared durations file with teacher model)
$ ...

# Case 1: Train conformer fastspeech2 + hifigan G + hifigan D from scratch
$ ./run.sh \
    --stage 7 \
    --tts_task gan_tts \
    --train_config ./conf/tuning/train_joint_conformer_fastspeech2_hifigan.yaml

# Case 2: Fine-tuning of pretrained conformer fastspeech2 + hifigan G + hifigan D
# (a) Prepare pretrained models as follows
$ tree -L 2 exp
exp
...
├── ljspeech_hifigan.v1  # pretrained vocoder
│   ├── checkpoint-2500000steps.pkl
│   ├── config.yml
│   └── stats.h5
├── tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space  # pretrained text2mel
│   ├── config.yaml
│   ├── images
│   └── train.loss.ave_5best.pth
...
# If you want to use the same files of this example
$ ipython
# Download text2mel model
[ins] In [1]: from espnet_model_zoo.downloader import ModelDownloader
[ins] In [2]: d = ModelDownloader("./downloads")
[ins] In [3]: d.download_and_unpack("kan-bayashi/ljspeech_conformer_fastspeech2")
# Download vocoder
[ins] In [4]: from parallel_wavegan.utils import download_pretrained_model
[ins] In [5]: download_pretrained_model("ljspeech_hifigan.v1", "downloads")
# Move them to exp directory
$ mv download/59c43ac0d40b121060bd71dd418f5ece/exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space exp
$ mv downloads/ljspeech_hifigan.v1 exp

# (b) Convert .pkl checkpoint to espnet loadable format
$ ipython
[ins] In [1]: import torch
[ins] In [2]: d = torch.load("./exp/ljspeech_hifigan.v1/checkpoint-2500000steps.pkl")
[ins] In [3]: torch.save(d["model"]["generator"], "generator.pth")
[ins] In [4]: torch.save(d["model"]["discriminator"], "discriminator.pth")

# (c) Prepare configuration
$ vim conf/tuning/finetune_joint_conformer_fastspeech2_hifigan.yaml
# edit text2mel_params / generator_params / discriminator_params to be the same as the pretrained model
# edit init_param part to specify the correct path of the pretrained model

# (d) Run training
$ ./run.sh \
    --stage 7 \
    --tts_task gan_tts \
    --train_config ./conf/tuning/finetune_joint_conformer_fastspeech2_hifigan.yaml

You can find the example configs in:

Evaluation

We provide five objective evaluation metrics:

Mel-cepstral distortion (MCD)
Log-F0 root mean square error (log-F0 RMSE)
Character error rate (CER)
Conditional Fréchet Speech Distance (CFSD)
Speaker Embedding Cosine Similarity (SECS)
Discrete speech metrics

MCD and log-F0 RMSE reflect speaker, prosody, and phonetic content similarities, and CER can reflect the intelligibility. For MCD and log-F0 RMSE, we apply dynamic time-warping (DTW) to match the length difference between ground-truth speech and generated speech. Discrete speech metrics better correlate with human subjective judgements than MCD.

Here we show the example command to calculate objective metrics:

cd egs2/&lt;recipe_name&gt;/tts1
. ./path.sh

# Evaluate MCD
./pyscripts/utils/evaluate_mcd.py \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav.scp \
    dump/raw/eval1/wav.scp

# Evaluate log-F0 RMSE
./pyscripts/utils/evaluate_f0.py \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav.scp \
    dump/raw/eval1/wav.scp

# If you want to calculate more precisely, limit the F0 range
./pyscripts/utils/evaluate_f0.py \
    --f0min xxx \
    --f0max yyy \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav.scp \
    dump/raw/eval1/wav.scp

# Evaluate with automatic MOS prediction models.
./pyscripts/utils/evaluate_pseudomos.py \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav.scp \
    dump/raw/eval1/wav.scp

# Evaluate CER
./scripts/utils/evaluate_asr.sh \
    --model_tag &lt;asr_model_tag&gt; \
    --nj 1 \
    --inference_args "--beam_size 10 --ctc_weight 0.4 --lm_weight 0.0" \
    --gt_text "dump/raw/eval1/text" \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav.scp \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/asr_results

# You can also use openai whisper for evaluation
./scripts/utils/evaluate_asr.sh \
    --whisper_tag base \
    --nj 1 \
    --gt_text "dump/raw/eval1/text" \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav.scp \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/asr_results

# Since ASR model does not use punctuation, it is better to remove punctuations if it contains
./scripts/utils/remove_punctuation.pl &lt; dump/raw/eval1/text &gt; dump/raw/eval1/text.no_punc
./scripts/utils/evaluate_asr.sh \
    --model_tag &lt;asr_model_tag&gt; \
    --nj 1 \
    --inference_args "--beam_size 10 --ctc_weight 0.4 --lm_weight 0.0" \
    --gt_text "dump/raw/eval1/text.no_punc" \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav.scp \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/asr_results

# Some ASR models assume the existence of silence at the beginning and the end of audio
# Then, you can perform silence padding with sox to get more reasonable ASR results
awk &lt; "exp/<model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav.scp" \
    '{print $1 " sox " $2 " -t wav - pad 0.25 0.25 |"}' \
    > exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav_pad.scp
./scripts/utils/evaluate_asr.sh \
    --model_tag &lt;asr_model_tag&gt; \
    --nj 1 \
    --inference_args "--beam_size 10 --ctc_weight 0.4 --lm_weight 0.0" \
    --gt_text "dump/raw/eval1/text.no_punc" \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav_pad.scp \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/asr_results

# Evaluate CFSD
./pyscripts/utils/evaluate_cfsd.py \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav.scp \
    dump/raw/eval1/wav.scp

# Evaluate SECS
./pyscripts/utils/evaluate_secs.py \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav.scp \
    dump/raw/eval1/wav.scp

# Evaluate SpeechBERTScore
./pyscripts/utils/evaluate_speechbertscore.py \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav.scp \
    dump/raw/eval1/wav.scp

# Evaluate SpeechBLEU
./pyscripts/utils/evaluate_speechbleu.py \
    exp/&lt;model_dir_name&gt;/&lt;decode_dir_name&gt;/eval1/wav/wav.scp \
    dump/raw/eval1/wav.scp

While these objective metrics can estimate the quality of synthesized speech, it is still difficult to fully determine human perceptual quality from these values, especially with high-fidelity generated speech. Therefore, we recommend performing the subjective evaluation if you want to check perceptual quality in detail.

You can refer this page to launch web-based subjective evaluation system with webMUSHRA.

Supported text frontend

You can change via --g2p option in tts.sh.

none: Just separate by space
- e.g.: HH AH0 L OW1 <space> W ER1 L D -> [HH, AH0, L, OW1, <space>, W, ER1, L D]
g2p_en: espnet/g2p
- e.g. Hello World -> [HH, AH0, L, OW1, <space>, W, ER1, L D]
g2p_en_no_space: espnet/g2p
- Same G2P but do not use word separator
- e.g. Hello World -> [HH, AH0, L, OW1, W, ER1, L, D]
pyopenjtalk: r9y9/pyopenjtalk
- e.g. こ、こんにちは -> [k, o, pau, k, o, N, n, i, ch, i, w, a]
pyopenjtalk_kana: r9y9/pyopenjtalk
- Use kana instead of phoneme
- e.g. こ、こんにちは -> [コ, 、, コ, ン, ニ, チ, ワ]
pyopenjtalk_accent: r9y9/pyopenjtalk
- Add accent labels in addition to phoneme labels
- Based on Developing a Japanese End-to-End Speech Synthesis Server Considering Accent Phrases
- e.g. こ、こんにちは -> [k, 1, 0, o, 1, 0, k, 5, -4, o, 5, -4, N, 5, -3, n, 5, -2, i, 5, -2, ch, 5, -1, i, 5, -1, w, 5, 0, a, 5, 0]
pyopenjtalk_accent_with_pause: r9y9/pyopenjtalk
- Add a pause label in addition to phoneme and accent labels
- Based on Developing a Japanese End-to-End Speech Synthesis Server Considering Accent Phrases
- e.g. こ、こんにちは -> [k, 1, 0, o, 1, 0, pau, k, 5, -4, o, 5, -4, N, 5, -3, n, 5, -2, i, 5, -2, ch, 5, -1, i, 5, -1, w, 5, 0, a, 5, 0]
pyopenjtalk_prosody: r9y9/pyopenjtalk
- Use special symbols for prosody control
- Based on Prosodic features control by symbols as input of sequence-to-sequence acoustic modeling for neural TTS
- e.g. こ、こんにちは -> [^, k, #, o, _, k, o, [, N, n, i, ch, i, w, a, $]
pypinyin: mozillanzg/python-pinyin
- e.g. 卡尔普陪外孙玩滑梯。 -> [ka3, er3, pu3, pei2, wai4, sun1, wan2, hua2, ti1, 。]
pypinyin_phone: mozillanzg/python-pinyin
- Separate into first and last parts
- e.g. 卡尔普陪外孙玩滑梯。 -> [k, a3, er3, p, u3, p, ei2, wai4, s, un1, uan2, h, ua2, t, i1, 。]
espeak_ng_arabic: espeak-ng/espeak-ng
- e.g. السلام عليكم -> [ʔ, a, s, s, ˈa, l, aː, m, ʕ, l, ˈiː, k, m]
- This result provided by the wrapper library bootphon/phonemizer
espeak_ng_german: espeak-ng/espeak-ng
- e.g. Das hört sich gut an. -> [d, a, s, h, ˈœ, ɾ, t, z, ɪ, ç, ɡ, ˈuː, t, ˈa, n, .]
- This result provided by the wrapper library bootphon/phonemizer
espeak_ng_french: espeak-ng/espeak-ng
- e.g. Bonjour le monde. -> [b, ɔ̃, ʒ, ˈu, ʁ, l, ə-, m, ˈɔ̃, d, .]
- This result provided by the wrapper library bootphon/phonemizer
espeak_ng_spanish: espeak-ng/espeak-ng
- e.g. Hola Mundo. -> [ˈo, l, a, m, ˈu, n, d, o, .]
- This result provided by the wrapper library bootphon/phonemizer
espeak_ng_russian: espeak-ng/espeak-ng
- e.g. Привет мир. -> [p, rʲ, i, vʲ, ˈe, t, mʲ, ˈi, r, .]
- This result provided by the wrapper library bootphon/phonemizer
espeak_ng_greek: espeak-ng/espeak-ng
- e.g. Γειά σου Κόσμε. -> [j, ˈa, s, u, k, ˈo, s, m, e, .]
- This result provided by the wrapper library bootphon/phonemizer
espeak_ng_finnish: espeak-ng/espeak-ng
- e.g. Hei maailma. -> [h, ˈei, m, ˈaː, ɪ, l, m, a, .]
- This result provided by the wrapper library bootphon/phonemizer
espeak_ng_hungarian: espeak-ng/espeak-ng
- e.g. Helló Világ. -> [h, ˈɛ, l, l, oː, v, ˈi, l, aː, ɡ, .]
- This result provided by the wrapper library bootphon/phonemizer
espeak_ng_dutch: espeak-ng/espeak-ng
- e.g. Hallo Wereld. -> [h, ˈɑ, l, oː, ʋ, ˈɪː, r, ə, l, t, .]
- This result provided by the wrapper library bootphon/phonemizer
espeak_ng_hindi: espeak-ng/espeak-ng
- e.g. नमस्ते दुनिया -> [n, ə, m, ˈʌ, s, t, eː, d, ˈʊ, n, ɪ, j, ˌaː]
- This result provided by the wrapper library bootphon/phonemizer
espeak_ng_italian: espeak-ng/espeak-ng
- e.g. Ciao mondo. -> [tʃ, ˈa, o, m, ˈo, n, d, o, .]
- This result provided by the wrapper library bootphon/phonemizer
espeak_ng_polish: espeak-ng/espeak-ng
- e.g. Witaj świecie. -> [v, ˈi, t, a, j, ɕ, fʲ, ˈɛ, tɕ, ɛ, .]
- This result provided by the wrapper library bootphon/phonemizer
espeak_ng_english_us_vits: espeak-ng/espeak-ng
- VITS official implementation-like processing (https://github.com/jaywalnut310/vits)
- e.g. Hello World. -> [h, ə, l, ˈ, o, ʊ, , <space>, w, ˈ, ɜ, ː, l, d, .]
- This result provided by the wrapper library bootphon/phonemizer
g2pk: Kyubyong/g2pK
- e.g. 안녕하세요 세계입니다. -> [ᄋ, ᅡ, ᆫ, ᄂ, ᅧ, ᆼ, ᄒ, ᅡ, ᄉ, ᅦ, ᄋ, ᅭ, , ᄉ, ᅦ, ᄀ, ᅨ, ᄋ, ᅵ, ᆷ, ᄂ, ᅵ, ᄃ, ᅡ, .]
g2pk_no_space: Kyubyong/g2pK
- Same G2P but do not use word separator
- e.g. 안녕하세요 세계입니다. -> [ᄋ, ᅡ, ᆫ, ᄂ, ᅧ, ᆼ, ᄒ, ᅡ, ᄉ, ᅦ, ᄋ, ᅭ, ᄉ, ᅦ, ᄀ, ᅨ, ᄋ, ᅵ, ᆷ, ᄂ, ᅵ, ᄃ, ᅡ, .]
g2pk_explicit_space: Kyubyong/g2pK
- Same G2P but use explicit word separator
- e.g. 안녕하세요 세계입니다. -> [ᄋ, ᅡ, ᆫ, ᄂ, ᅧ, ᆼ, ᄒ, ᅡ, ᄉ, ᅦ, ᄋ, ᅭ, <space>, ᄉ, ᅦ, ᄀ, ᅨ, ᄋ, ᅵ, ᆷ, ᄂ, ᅵ, ᄃ, ᅡ, .]
korean_jaso: jdongian/python-jamo
- e.g. 나는 학교에 갑니다. -> [ᄂ, ᅡ, ᄂ, ᅳ, ᆫ, <space>, ᄒ, ᅡ, ᆨ, ᄀ, ᅭ, ᄋ, ᅦ, <space>, ᄀ, ᅡ, ᆸ, ᄂ, ᅵ, ᄃ, ᅡ, .]
korean_jaso_no_space: jdongian/python-jamo
- e.g. 나는 학교에 갑니다. -> [ᄂ, ᅡ, ᄂ, ᅳ, ᆫ, ᄒ, ᅡ, ᆨ, ᄀ, ᅭ, ᄋ, ᅦ, ᄀ, ᅡ, ᆸ, ᄂ, ᅵ, ᄃ, ᅡ, .]

You can see the code example from here.

Supported text cleaner

You can change via --cleaner option in tts.sh.

none: No text cleaner.
tacotron: keithito/tacotron
- e.g."(Hello-World); & jr. & dr." ->HELLO WORLD, AND JUNIOR AND DOCTOR
jaconv: kazuhikoarase/jaconv
- e.g. ”あらゆる”　現実を　〜　’すべて’ 自分の　ほうへ　ねじ曲げたのだ。" -> "あらゆる" 現実をー \'すべて\' 自分のほうへねじ曲げたのだ。

You can see the code example from here.

Supported Models

You can train the following models by changing *.yaml config for --train_config option in tts.sh.

Single speaker model

Tacotron 2
Transformer-TTS
FastSpeech
FastSpeech2 (FastPitch)
Conformer-based FastSpeech / FastSpeech2
VITS
JETS

You can find example configs of the above models in egs2/ljspeech/tts1/conf/tuning.

Multi speaker model extension

You can use / combine the following embedding to build multi-speaker model:

X-Vector
GST
Speaker ID embedding (One-hot vector -> Continuous embedding)
Language ID embedding (One-hot vector -> Continuous embedding)

X-Vector is provided by kaldi and pretrained with VoxCeleb corpus. You can find example configs of the above models in:

And now we support other toolkit's speaker embeddings: Please check the following options.

https://github.com/espnet/espnet/blob/df053b8c13c26fe289fc882751801fd781e9d43e/egs2/TEMPLATE/tts1/tts.sh#L69-L71

FAQ

ESPnet1 model is compatible with ESPnet2?

No. We cannot use the ESPnet1 model in ESPnet2.

How to change minibatch size in training?

See change mini-batch type. As a default, we use batch_type=numel and batch_bins instead of batch_size to enable us to use dynamic batch size. See the following config as an example. https://github.com/espnet/espnet/blob/96b2fd08d4fd9276aabd7ad41ec5e02a88b30958/egs2/ljspeech/tts1/conf/tuning/train_tacotron2.yaml#L61-L62

How to make a new recipe for my own dataset?

See how to make/port new recipe.

How to add a new g2p module?

Add a new module in espnet2/text/phoneme_tokenizer.py and add it to g2p_choices in espnet2/text/phoneme_tokenizer.py.

We have the wrapper module of bootphon/phonemizer. You can find the module espnet2/text/phoneme_tokenizer.py. If the g2p you wanted is implemented in bootphon/phonemizer, we can easily add it like this (Note that you need to update the choice as I mentioned the above).

Example PRs may help you:

How to add a new cleaner module?

Update espnet2/text/cleaner.py to add new module. Then, add new choice in the argument parser of espnet2/bin/tokenize_text.py and espnet2/tasks/tts.py.

How to use trained model in python?

from espnet2.bin.tts_inference import Text2Speech
# without vocoder
tts = Text2Speech.from_pretrained(model_file="/path/to/model.pth")
wav = tts("Hello, world")["wav"]

# with local vocoder
tts = Text2Speech.from_pretrained(model_file="/path/to/model.pth", vocoder_file="/path/to/vocoder.pkl")
wav = tts("Hello, world")["wav"]

# with pretrained vocoder (use ljseepch style melgan as an example)
tts = Text2Speech.from_pretrained(model_file="/path/to/model.pth", vocoder_tag="parallel_wavegan/ljspeech_style_melgan.v1")
wav = tts("Hello, world")["wav"]

See use a pretrained model for inference.

How to get pretrained models?

Use ESPnet model zoo. You can find the all of the pretrained model list from here or search for pretrained models at Hugging Face.

If you want to use pretrained models written in egs2/hogehoge/tts1/README.md, go to Zenodo URL and copy the URL of download in the below of the page. Then, you can use as follows:

from espnet2.bin.tts_inference import Text2Speech

# provide copied URL directly
tts = Text2Speech.from_pretrained(
    "https://zenodo.org/record/5414980/files/tts_train_vits_raw_phn_jaconv_pyopenjtalk_accent_with_pause_train.total_count.ave.zip?download=1",
)
wav = tts("こんにちは、世界。")["wav"]

How to load the pretrained parameters?

Please use --init_param option or add it in training config (*.yaml).

# Usage
--init_param &lt;file_path&gt;:&lt;src_key&gt;:&lt;dst_key&gt;:&lt;exclude_keys&gt;

# Load all parameters
python -m espnet2.bin.tts_train --init_param model.pth

# Load only the parameters starting with "decoder"
python -m espnet2.bin.tts_train --init_param model.pth:tts.dec

# Load only the parameters starting with "decoder" and set it to model.tts.dec
python -m espnet2.bin.tts_train --init_param model.pth:decoder:tts.dec

# Set parameters to model.tts.dec
python -m espnet2.bin.tts_train --init_param decoder.pth::tts.dec

# Load all parameters excluding "tts.enc.embed"
python -m espnet2.bin.tts_train --init_param model.pth:::tts.enc.embed

# Load all parameters excluding "tts.enc.embed" and "tts.dec"
python -m espnet2.bin.tts_train --init_param model.pth:::tts.enc.embed,tts.dec

How to finetune the pretrained model?

See jvs recipe example.

How to add a new model?

Under construction.

How to test my model with an arbitrary given text?

See Google Colab demo notebook:

If you want to try in local:

from espnet2.bin.tts_inference import Text2Speech

# with local model
tts = Text2Speech.from_pretrained(model_file="/path/to/model.pth")
wav = tts("Hello, world")["wav"]

# with local model and local vocoder
tts = Text2Speech.from_pretrained(model_file="/path/to/model.pth", vocoder_file="/path/to/vocoder.pkl")
wav = tts("Hello, world")["wav"]

# with local model and pretrained vocoder (use ljseepch as an example)
tts = Text2Speech.from_pretrained(model_file="/path/to/model.pth", vocoder_tag="parallel_wavegan/ljspeech_style_melgan.v1")
wav = tts("Hello, world")["wav"]

# with pretrained model and pretrained vocoder (use ljseepch as an example)
tts = Text2Speech.from_pretrained(model_tag="kan-bayashi/ljspeech_conformer_fastspeech2", vocoder_tag="parallel_wavegan/ljspeech_style_melgan.v1")
wav = tts("Hello, world")["wav"]

How to train vocoder?

Please use kan-bayashi/ParallelWaveGAN, which provides the recipes to train various GAN-based vocoders. If the recipe is not prepared, you can quickly start the training with espnet2 TTS recipe. See Run training using ESPnet2-TTS recipe within 5 minutes.

Or you can try joint training of text2mel & vocoder.

The trained vocoder can be used as follows:

With python

from espnet2.bin.tts_inference import Text2Speech
tts = Text2Speech.from_pretrained(model_file="/path/to/model.pth", vocoder_file="/path/to/your_trained_vocoder_checkpoint.pkl")
wav = tts("Hello, world")["wav"]

With TTS recipe

$ ./run.sh --stage 8 --vocoder_file /path/to/your_trained_vocoder_checkpoint.pkl --inference_tag decode_with_my_vocoder

With command line

How to train vocoder with text2mel GTA outputs?

Sometimes, we want to finetune the vocoder with text2mel groundtruth aligned (GTA) outputs. See Run finetuning using ESPnet2-TTS GTA outputs.

How to handle the errors in validate_data_dir.sh?

utils/validate_data_dir.sh: text contains N lines with non-printable characters which occurs at this line

This is caused by the recent change in kaldi. We recommend modifying the following part in utils/validate_data_dir.sh to be non_print=true.

https://github.com/kaldi-asr/kaldi/blob/40c71c5ee3ee5dffa1ad2c53b1b089e16d967bb5/egs/wsj/s5/utils/validate_data_dir.sh#L9

utils/validate_text.pl: The line for utterance xxx contains disallowed Unicode whitespaces > utils/validate_text.pl: ERROR: text file 'data/xxx' contains disallowed UTF-8 whitespace character(s)

The use of zenkaku whitespace in text is not allowed. Please changes it to hankaku whitespace or the other symbol.

Why the model generate meaningless speech at the end?

This is because the model failed to predict the stop token. There are several solutions to solve this issue:

Use attention constraint in the inference (use_attention_constraint=True in inference config, only for Tacotron 2).
Train the model with a large bce_pos_weight (e.g., bce_pos_weight=10.0).
Use non-autoregressive models (FastSpeech or FastSpeech2)

Why the model cannot be trained well with my own dataset?

The most of the problems are caused by the bad cleaning of the dataset. Please check the following items carefully:

Check the attention plot during the training. Loss value is not so meaningful in TTS.
- You can check this PR as an example.
Remove the silence at the beginning and end of the speech.
- You can use silence trimming scripts in this example.
Separate speech if it contains a long silence at the middle of speech.
Use phonemes instead of characters if G2P is available.
Clean the text as possible as you can (abbreviation, number, etc...)
Add the pose symbol in text if the speech contains the silence.
If the dataset is small, please consider the use of adaptation with pretrained model.
If the dataset is small, please consider the use of large reduction factor, which helps the attention learning.

Why the outputs contains metallic noise when combining neural vocoder?

This will be happened especially when the neural vocoders did not use noise as the input (e.g., MelGAN, HiFiGAN), which are less robust to the mismatch of acoustic features. The metallic sound can reduce by performing vocoder finetuning with text2mel GTA outputs or joint training / finetuning of text2mel and vocoder.

How is the duration for FastSpeech2 generated?

We use the teacher model attention weight to calculate the duration as the same as FastSpeech. See more info in FastSpeech paper.

Why the output of Tacotron2 or Transformer is non-deterministic?

This is because we use prenet in the decoder, which always applies dropout. See more info in Tacotron2 paper.

If you want to fix the results, you can use --always_fix_seed option.