ESPnet2-TTS: Extending the Edge of TTS Research
https://arxiv.org/abs/2110.07840
Submitted to ICASSP2022.
Authors
- Tomoki Hayashi (Human Dataware Lab. Co., Ltd. / Nagoya University)
- Ryuichi Yamamoto (LINE Corp.)
- Takenori Yoshimura (Nagoya Institute of Technology)
- Peter Wu (Carnegie Mellon University)
- Jiatong Shi (Carnegie Mellon University)
- Takaaki Saeki (The University of Tokyo)
- Yooncheol Ju (AIRS Company, Hyundai Motor Group)
- Yusuke Yasuda (Nagoya University)
- Shinnosuke Takamichi (The University of Tokyo)
- Shinji Watanabe (Carnegie Mellon University)
Abstract
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art E2E-TTS results. We also provide many pre-trained models in a unified Python interface for inference, offering a quick means for users to generate baseline samples and build demos. Experimental evaluations with English and Japanese corpora demonstrate that our provided models synthesize utterances comparable to ground-truth ones, achieving state-of-the-art TTS performance. The toolkit is available online at https://github.com/espnet/espnet.
Demo
You can try real-time demos online:
Example of LJSpeech (English single speaker)
- Transformer-TTS (icassp2020): Transformer-TTS + Mixture of density WaveNet vocoder with noise shaping. Our preivous best model.
- CFS2: Conformer-FastSpeech2 + HiFiGAN. Each model was separately trained.
- CFS2 (ft): Same as the above model, but HiFi-GAN was fine-tuned with ground-truth aligned mel spectrograms.
- CFS2 (joint-ft): Same as the above model, but both models were jointly fine-tuned.
- CFS2 (joint-tr): Same as the above model, but both models were jointly trained from the scratch.
- VITS: End-to-end text-to-waveform model, VITS.
All models used g2p_en
as the G2P function.
You can find the detailed configurations in egs2/ljspeech/tts1/conf/tuning
.
LJ050-0030: The Commission also recommends
Groundtruth | Transformer-TTS (icassp2020) |
CFS2 | CFS2 (ft) |
CFS2 (joint-ft) | CFS2 (joint-tr) |
VITS | |
LJ050-0040: and reports from other agencies which independently evaluate their information for potential sources of danger.
Groundtruth | Transformer-TTS (icassp2020) |
CFS2 | CFS2 (ft) |
CFS2 (joint-ft) | CFS2 (joint-tr) |
VITS | |
LJ050-0050: As a result of these studies, the planning document submitted by the Secretary of the Treasury to the Bureau of the Budget on August thirty-one,
Groundtruth | Transformer-TTS (icassp2020) |
CFS2 | CFS2 (ft) |
CFS2 (joint-ft) | CFS2 (joint-tr) |
VITS | |
Analysis of wrong G2P results
- CFS2 (g2p_en): Conformer-FastSpeech2 + HiFiGAN with
g2p_en
as the G2P function. - CFS2 (espeak_ng): Conformer-FastSpeech2 + HiFiGAN with
espeak_ng
as the G2P function. - VITS (g2p_en): VITS with
g2p_en
as the G2P function. - VITS (espeak_ng): VITS with
espeak_ng
as the G2P function.
LJ050-0069: the Secret Service had received from the FBI some nine thousand reports on members of the Communist Party.
g2p_en
DH AH0 <space> S IY1 K R AH0 T <space> S ER1 V AH0 S <space> HH AE1 D <space> R AH0 S IY1 V D <space> F R AH1 M <space> DH AH0 <space> B AY1 <space> S AH1 M <space> N AY1 N <space> TH AW1 Z AH0 N D <space> R IH0 P AO1 R T S <space> AA1 N <space> M EH1 M B ER0 Z <space> AH1 V <space> DH AH0 <space> K AA1 M Y AH0 N AH0 S T <space> P AA1 R T IY0 <space> .
espeak_ng
ð ə <space> s ˈ i ː k ɹ ᵻ t <space> s ˈ ɜ ː v ɪ s <space> h æ d <space> ɹ ᵻ s ˈ i ː v d <space> f ɹ ʌ m ð ɪ <space> ˌ ɛ f b ˌ i ː ˈ a ɪ <space> s ˌ ʌ m <space> n ˈ a ɪ n <space> θ ˈ a ʊ z ə n d <space> ɹ ᵻ p ˈ o ː ɹ t s <space> ˌ ɔ n <space> m ˈ ɛ m b ɚ z <space> ʌ v ð ə <space> k ˈ ɑ ː m j u ː n ˌ ɪ s t <space> p ˈ ɑ ː ɹ ɾ i .`
CFS2 (g2p_en) | CFS2 (espeak_ng) |
VITS (g2p_en) | VITS (espeak_ng) |
LJ050-0070: The FBI now transmits information on all defectors, a category which would, of course, have included Oswald.
g2p_en
DH AH0 <space> B AY1 <space> N AW1 <space> T R AE0 N Z M IH1 T S <space> IH2 N F ER0 M EY1 SH AH0 N <space> AA1 N <space> AO1 L <space> D IH0 F EH1 K T ER0 Z <space> , <space> AH0 <space> K AE1 T AH0 G AO2 R IY0 <space> W IH1 CH <space> W UH1 D <space> , <space> AH1 V <space> K AO1 R S <space> , <space> HH AE1 V <space> IH0 N K L UW1 D AH0 D <space> AO1 Z W AO0 L D <space> .
espeak_ng
ð ɪ <space> ˌ ɛ f b ˌ i ː ˈ a ɪ <space> n ˈ a ʊ <space> t ɹ æ n s m ˈ ɪ t s <space> ˌ ɪ n f ɚ m ˈ e ɪ ʃ ə n <space> ˌ ɔ n <space> ˈ ɔ ː l <space> d ᵻ f ˈ ɛ k t ɚ z , <space> ɐ <space> k ˈ æ ɾ ɪ ɡ ɚ ɹ i <space> w ˌ ɪ t ʃ <space> w ˈ ʊ d , <space> ʌ v <space> k ˈ o ː ɹ s , <space> h æ v <space> ɪ ŋ k l ˈ u ː d ᵻ d <space> ˈ ɑ ː s w ə l d .
CFS2 (g2p_en) | CFS2 (espeak_ng) |
VITS (g2p_en) | VITS (espeak_ng) |
Example of VCTK (English multi-speaker)
- Groundtruth: Groundtruth speech.
- SID-VITS: VITS with one-hot speaker ID (SID) embeddings. Since this model cannot deal with unknown speakers, we trained it with all of the speakers.
- X-VITS (avg): VITS with pre-trained X-vectors instead of one-hot speaker ID embeddings. This model was trained with all speakers except for the evaluation ones in the unseen speaker condition. For inference, we used X-vectors averaged over all the utterances of the target speaker except for the evaluation utterances.
- X-VITS (random): The same as the above model except it used X-vectors extracted from a single utterance of the target speaker. This datapoint was randomly selected from all utterances of the speaker excluding the evaluation utterances.
All models used espeak_ng
as the G2P function.
You can find the detailed configurations in egs2/vctk/tts1/conf/tuning
.
Seen-speaker condition
p241_364: We have to be sure that the taxation system can work.
GT | SID-VITS |
X-VITS (avg) | X-VITS (random) |
p245_350: Despite his senior position, he did not know in advance.
GT | SID-VITS |
X-VITS (avg) | X-VITS (random) |
p265_343: It’s not pretty, but it’s effective.
GT | SID-VITS |
X-VITS (avg) | X-VITS (random) |
p333_416: Their courage, and their honesty, should be respected.
GT | SID-VITS |
X-VITS (avg) | X-VITS (random) |
Unseen-speaker condition
p227_393: You have to rely on each other.
GT | SID-VITS |
X-VITS (avg) | X-VITS (random) |
p228_362: It took about an hour for the gas to clear.
GT | SID-VITS |
X-VITS (avg) | X-VITS (random) |
p300_391: Form and structure are his music.
GT | SID-VITS |
X-VITS (avg) | X-VITS (random) |
p304_415: Parliament was evenly divided on the issue.
GT | SID-VITS |
X-VITS (avg) | X-VITS (random) |
Example of JSUT (Japanese single speaker)
- Groundtruth (22k): Groundtruth with 22.05 kHz sampling rate.
- Groundtruth (44k): Groundtruth with 44.1 kHz sampling rate.
- Tacotron2: Tacotron2 + HiFiGAN. Each model was separately trained.
- Transformer-TTS: Transformer-TTS + HiFiGAN. Each model was separately trained.
- CFS2: Conformer-FastSpeech2 + HiFiGAN. Each model was separately trained.
- CFS2 (ft): Same as the above model, but HiFi-GAN was fine-tuned with ground-truth aligned mel spectrograms.
- VITS: VITS trained with 22.05 kHz sampling rate.
- FB-VITS: Full-band VITS trained with 44.1 kHz sampling rate.
All models used pyopenjtalk_prosody
as the G2P function.
You can find the detailed configurations in egs2/jsut/tts1/conf/tuning
.
BASIC5000_0001: 水をマレーシアから買わなくてはならないのです。
GT (22k) | GT (44k) |
Tacotron2 | Transformer-TTS |
CFS2 | CFS2 (ft) |
VITS | FB-VITS |
BASIC5000_0005: 血圧は、健康のパロメーターとして重要である。
GT (22k) | GT (44k) |
Tacotron2 | Transformer-TTS |
CFS2 | CFS2 (ft) |
VITS | FB-VITS |
BASIC5000_0009: 無罪の人々は、もちろん放免された。
GT (22k) | GT (44k) |
Tacotron2 | Transformer-TTS |
CFS2 | CFS2 (ft) |
VITS | FB-VITS |
Example of JVS (Japanese single speaker adaptation)
- Groundtruth: Groundtruth speech..
- VITS: VITS adapted with 100 utteracnes. The base model was trained on JSUT with 22.05 kHz.
All models used pyopenjtalk_prosody
as the G2P function.
You can find the detailed configurations in egs2/jvs/tts1/conf/tuning
.
BASIC5000_0408: 私もパーティーに来るべきだ、と、彼はつけ加えた。
GT (jvs001) | VITS (jvs001) |
BASIC5000_0261: 紙をとじるのに、ホチキスはとても便利だ。
GT (jvs010) | VITS (jvs010) |
BASIC5000_0238: 西欧諸国は、この問題に対する日本の姿勢を、激しく非難しています。
GT (jvs054) | VITS (jvs054) |
BASIC5000_0012: 溺れかかっていた乗客は、すべて救助された。
GT (jvs092) | VITS (jvs092) |
Contact
- Tomoki Hayashi (hayashi.tomoki
g.sp.m.is.nagoya-u.ac.jp)