ESPnet2-TTS: Extending the Edge of TTS Research

https://arxiv.org/abs/2110.07840

Submitted to ICASSP2022.

Authors

  • Tomoki Hayashi (Human Dataware Lab. Co., Ltd. / Nagoya University)
  • Ryuichi Yamamoto (LINE Corp.)
  • Takenori Yoshimura (Nagoya Institute of Technology)
  • Peter Wu (Carnegie Mellon University)
  • Jiatong Shi (Carnegie Mellon University)
  • Takaaki Saeki (The University of Tokyo)
  • Yooncheol Ju (AIRS Company, Hyundai Motor Group)
  • Yusuke Yasuda (Nagoya University)
  • Shinnosuke Takamichi (The University of Tokyo)
  • Shinji Watanabe (Carnegie Mellon University)

Abstract

This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art E2E-TTS results. We also provide many pre-trained models in a unified Python interface for inference, offering a quick means for users to generate baseline samples and build demos. Experimental evaluations with English and Japanese corpora demonstrate that our provided models synthesize utterances comparable to ground-truth ones, achieving state-of-the-art TTS performance. The toolkit is available online at https://github.com/espnet/espnet.

Demo

You can try real-time demos online:

Example of LJSpeech (English single speaker)

  • Transformer-TTS (icassp2020): Transformer-TTS + Mixture of density WaveNet vocoder with noise shaping. Our preivous best model.
  • CFS2: Conformer-FastSpeech2 + HiFiGAN. Each model was separately trained.
  • CFS2 (ft): Same as the above model, but HiFi-GAN was fine-tuned with ground-truth aligned mel spectrograms.
  • CFS2 (joint-ft): Same as the above model, but both models were jointly fine-tuned.
  • CFS2 (joint-tr): Same as the above model, but both models were jointly trained from the scratch.
  • VITS: End-to-end text-to-waveform model, VITS.

All models used g2p_en as the G2P function.
You can find the detailed configurations in egs2/ljspeech/tts1/conf/tuning.

LJ050-0030: The Commission also recommends

Groundtruth Transformer-TTS (icassp2020)
CFS2 CFS2 (ft)
CFS2 (joint-ft) CFS2 (joint-tr)
VITS

LJ050-0040: and reports from other agencies which independently evaluate their information for potential sources of danger.

Groundtruth Transformer-TTS (icassp2020)
CFS2 CFS2 (ft)
CFS2 (joint-ft) CFS2 (joint-tr)
VITS

LJ050-0050: As a result of these studies, the planning document submitted by the Secretary of the Treasury to the Bureau of the Budget on August thirty-one,

Groundtruth Transformer-TTS (icassp2020)
CFS2 CFS2 (ft)
CFS2 (joint-ft) CFS2 (joint-tr)
VITS

Analysis of wrong G2P results

  • CFS2 (g2p_en): Conformer-FastSpeech2 + HiFiGAN with g2p_en as the G2P function.
  • CFS2 (espeak_ng): Conformer-FastSpeech2 + HiFiGAN with espeak_ng as the G2P function.
  • VITS (g2p_en): VITS with g2p_en as the G2P function.
  • VITS (espeak_ng): VITS with espeak_ng as the G2P function.

LJ050-0069: the Secret Service had received from the FBI some nine thousand reports on members of the Communist Party.

g2p_en

DH AH0 <space> S IY1 K R AH0 T <space> S ER1 V AH0 S <space> HH AE1 D <space> R AH0 S IY1 V D <space> F R AH1 M <space> DH AH0 <space> B AY1 <space> S AH1 M <space> N AY1 N <space> TH AW1 Z AH0 N D <space> R IH0 P AO1 R T S <space> AA1 N <space> M EH1 M B ER0 Z <space> AH1 V <space> DH AH0 <space> K AA1 M Y AH0 N AH0 S T <space> P AA1 R T IY0 <space> .

espeak_ng

ð ə <space> s ˈ i ː k ɹ ᵻ t <space> s ˈ ɜ ː v ɪ s <space> h æ d <space> ɹ ᵻ s ˈ i ː v d <space> f ɹ ʌ m ð ɪ <space> ˌ ɛ f b ˌ i ː ˈ a ɪ <space> s ˌ ʌ m <space> n ˈ a ɪ n <space> θ ˈ a ʊ z ə n d <space> ɹ ᵻ p ˈ o ː ɹ t s <space> ˌ ɔ n <space> m ˈ ɛ m b ɚ z <space> ʌ v ð ə <space> k ˈ ɑ ː m j u ː n ˌ ɪ s t <space> p ˈ ɑ ː ɹ ɾ i .`
CFS2 (g2p_en) CFS2 (espeak_ng)
VITS (g2p_en) VITS (espeak_ng)

LJ050-0070: The FBI now transmits information on all defectors, a category which would, of course, have included Oswald.

g2p_en

DH AH0 <space> B AY1 <space> N AW1 <space> T R AE0 N Z M IH1 T S <space> IH2 N F ER0 M EY1 SH AH0 N <space> AA1 N <space> AO1 L <space> D IH0 F EH1 K T ER0 Z <space> , <space> AH0 <space> K AE1 T AH0 G AO2 R IY0 <space> W IH1 CH <space> W UH1 D <space> , <space> AH1 V <space> K AO1 R S <space> , <space> HH AE1 V <space> IH0 N K L UW1 D AH0 D <space> AO1 Z W AO0 L D <space> .

espeak_ng

ð ɪ <space> ˌ ɛ f b ˌ i ː ˈ a ɪ <space> n ˈ a ʊ <space> t ɹ æ n s m ˈ ɪ t s <space> ˌ ɪ n f ɚ m ˈ e ɪ ʃ ə n <space> ˌ ɔ n <space> ˈ ɔ ː l <space> d ᵻ f ˈ ɛ k t ɚ z , <space> ɐ <space> k ˈ æ ɾ ɪ ɡ ɚ ɹ i <space> w ˌ ɪ t ʃ <space> w ˈ ʊ d , <space> ʌ v <space> k ˈ o ː ɹ s , <space> h æ v <space> ɪ ŋ k l ˈ u ː d ᵻ d <space> ˈ ɑ ː s w ə l d .
CFS2 (g2p_en) CFS2 (espeak_ng)
VITS (g2p_en) VITS (espeak_ng)

Example of VCTK (English multi-speaker)

  • Groundtruth: Groundtruth speech.
  • SID-VITS: VITS with one-hot speaker ID (SID) embeddings. Since this model cannot deal with unknown speakers, we trained it with all of the speakers.
  • X-VITS (avg): VITS with pre-trained X-vectors instead of one-hot speaker ID embeddings. This model was trained with all speakers except for the evaluation ones in the unseen speaker condition. For inference, we used X-vectors averaged over all the utterances of the target speaker except for the evaluation utterances.
  • X-VITS (random): The same as the above model except it used X-vectors extracted from a single utterance of the target speaker. This datapoint was randomly selected from all utterances of the speaker excluding the evaluation utterances.

All models used espeak_ng as the G2P function.
You can find the detailed configurations in egs2/vctk/tts1/conf/tuning.

Seen-speaker condition

p241_364: We have to be sure that the taxation system can work.

GT SID-VITS
X-VITS (avg) X-VITS (random)

p245_350: Despite his senior position, he did not know in advance.

GT SID-VITS
X-VITS (avg) X-VITS (random)

p265_343: It’s not pretty, but it’s effective.

GT SID-VITS
X-VITS (avg) X-VITS (random)

p333_416: Their courage, and their honesty, should be respected.

GT SID-VITS
X-VITS (avg) X-VITS (random)

Unseen-speaker condition

p227_393: You have to rely on each other.

GT SID-VITS
X-VITS (avg) X-VITS (random)

p228_362: It took about an hour for the gas to clear.

GT SID-VITS
X-VITS (avg) X-VITS (random)

p300_391: Form and structure are his music.

GT SID-VITS
X-VITS (avg) X-VITS (random)

p304_415: Parliament was evenly divided on the issue.

GT SID-VITS
X-VITS (avg) X-VITS (random)

Example of JSUT (Japanese single speaker)

  • Groundtruth (22k): Groundtruth with 22.05 kHz sampling rate.
  • Groundtruth (44k): Groundtruth with 44.1 kHz sampling rate.
  • Tacotron2: Tacotron2 + HiFiGAN. Each model was separately trained.
  • Transformer-TTS: Transformer-TTS + HiFiGAN. Each model was separately trained.
  • CFS2: Conformer-FastSpeech2 + HiFiGAN. Each model was separately trained.
  • CFS2 (ft): Same as the above model, but HiFi-GAN was fine-tuned with ground-truth aligned mel spectrograms.
  • VITS: VITS trained with 22.05 kHz sampling rate.
  • FB-VITS: Full-band VITS trained with 44.1 kHz sampling rate.

All models used pyopenjtalk_prosody as the G2P function.
You can find the detailed configurations in egs2/jsut/tts1/conf/tuning.

BASIC5000_0001: 水をマレーシアから買わなくてはならないのです。

GT (22k) GT (44k)
Tacotron2 Transformer-TTS
CFS2 CFS2 (ft)
VITS FB-VITS

BASIC5000_0005: 血圧は、健康のパロメーターとして重要である。

GT (22k) GT (44k)
Tacotron2 Transformer-TTS
CFS2 CFS2 (ft)
VITS FB-VITS

BASIC5000_0009: 無罪の人々は、もちろん放免された。

GT (22k) GT (44k)
Tacotron2 Transformer-TTS
CFS2 CFS2 (ft)
VITS FB-VITS

Example of JVS (Japanese single speaker adaptation)

  • Groundtruth: Groundtruth speech..
  • VITS: VITS adapted with 100 utteracnes. The base model was trained on JSUT with 22.05 kHz.

All models used pyopenjtalk_prosody as the G2P function.
You can find the detailed configurations in egs2/jvs/tts1/conf/tuning.

BASIC5000_0408: 私もパーティーに来るべきだ、と、彼はつけ加えた。

GT (jvs001) VITS (jvs001)

BASIC5000_0261: 紙をとじるのに、ホチキスはとても便利だ。

GT (jvs010) VITS (jvs010)

BASIC5000_0238: 西欧諸国は、この問題に対する日本の姿勢を、激しく非難しています。

GT (jvs054) VITS (jvs054)

BASIC5000_0012: 溺れかかっていた乗客は、すべて救助された。

GT (jvs092) VITS (jvs092)

Contact