espnet-tts-sample

ljspeech.tacotron2.v3

Creator

Abstract

This is tts demo of The LJ Speech Dataset [0].

tts1 recipe

tts1 recipe is based on Tacotron2 [1] (spectrogram prediction network) w/o WaveNet. Tacotron2 generates log mel-filter bank from text and then converts it to linear spectrogram using inverse mel-basis. Finally, phase components are recovered with Griffin-Lim.

(2019/06/16) we also support TTS-Transformer [3].
(2019/06/17) we also support Feed-forward Transformer [4].

tts2 recipe

tts2 recipe is based on Tacotron2’s spectrogram prediction network [1] and Tacotron’s CBHG module [2]. Instead of using inverse mel-basis, CBHG module is used to convert log mel-filter bank to linear spectrogram. The recovery of the phase components is the same as tts1.

Model

v.0.4.0: tacotron2.v3

Environments

Model files

Audio samples

  1. ground_truth: Recorded speech
  2. tacotron2.v3_GL: Synthesized speech (Feature generetion:tacotron2.v3, Waveform synthesis: Griffin-Lim algorithm)
  3. tacotron2.v3_WNV: Synthesized speech (Feature generetion:tacotron2.v3, Waveform synthesis: WaveNet vocoder)

* The recommended browser for Audio player: Google Chrome

Sample1

LJ050-0029 “THAT IS REFLECTED IN DEFINITE AND COMPREHENSIVE OPERATING PROCEDURES. “

ground_truth tacotron2.v3_GL tacotron2.v3_WNV
Attention wight Probility

Sample2

LJ050-0030 “THE COMMISSION ALSO RECOMMENDS “

ground_truth tacotron2.v3_GL tacotron2.v3_WNV
Attention wight Probility

Sample3

LJ050-0031 “THAT THE SECRET SERVICE CONSCIOUSLY SET ABOUT THE TASK OF INCULCATING AND MAINTAINING THE HIGHEST STANDARD OF EXCELLENCE AND ESPRIT, FOR ALL OF ITS PERSONNEL. “

ground_truth tacotron2.v3_GL tacotron2.v3_WNV
Attention wight Probility

Sample4

LJ050-0032 “THIS INVOLVES TIGHT AND UNSWERVING DISCIPLINE AS WELL AS THE PROMOTION OF AN OUTSTANDING DEGREE OF DEDICATION AND LOYALTY TO DUTY. “

ground_truth tacotron2.v3_GL tacotron2.v3_WNV
Attention wight Probility

Sample5

LJ050-0033 “THE COMMISSION EMPHASIZES THAT IT FINDS NO CAUSAL CONNECTION BETWEEN THE ASSASSINATION “

ground_truth tacotron2.v3_GL tacotron2.v3_WNV
Attention wight Probility

Other samples

https://drive.google.com/open?id=18JgsOCWiP_JkhONasTplnHS7yaF_konr

Synthesize speech by arbitrary text

  1. Go to Google colab (created by Github)
  2. Run “0. Installation”
  3. Run “3. Demonstration of the use of pretrained models”

* The recommended browser for Google colab: Google Chrome

Please modify the option about tts model
Before: !../../../utils/synth_wav.sh --models ljspeech.fastspeech.v1 example.txt
After: !../../../utils/synth_wav.sh --models ljspeech.tacotron2.v3 example.txt

References