CMU 11492/11692 Spring 2023: Text to Speech
CMU 11492/11692 Spring 2023: Text to Speech
In this demonstration, we will show you some demonstrations of text to speech systems in ESPnet.
Main references:
Author:
- Siddhant Arora (siddhana@andrew.cmu.edu)
The notebook is adapted from this Colab
❗Important Notes❗
- We are using Colab to show the demo. However, Colab has some constraints on the total GPU runtime. If you use too much GPU time, you may not be able to use GPU for some time.
- There are multiple in-class checkpoints ✅ throughout this tutorial. Your participation points are based on these tasks. Please try your best to follow all the steps! If you encounter issues, please notify the TAs as soon as possible so that we can make an adjustment for you.
- Please submit PDF files of your completed notebooks to Gradescope. You can print the notebook using
File -> Print
in the menu bar.
Installation
# NOTE: pip shows imcompatible errors due to preinstalled libraries but you do not need to care
!pip install typeguard==2.13.3
!git clone --depth 5 -b spoken_dialog_demo https://github.com/siddhu001/espnet.git
!cd espnet && pip install .
!pip install parallel_wavegan==0.5.4
!pip install pyopenjtalk==0.2
!pip install pypinyin==0.44.0
!pip install parallel_wavegan==0.5.4
!pip install gdown==4.4.0
!pip install espnet_model_zoo
Single speaker TTS model demo
TTS Model
You can try end-to-end text2wav model & combination of text2mel and vocoder.
If you use text2wav model, you do not need to use vocoder (automatically disabled).
Text2wav models:
- VITS
Text2mel models:
- Tacotron2
- Transformer-TTS
- (Conformer) FastSpeech
- (Conformer) FastSpeech2
Vocoders:
- Griffin Lim
- Parallel WaveGAN
- Multi-band MelGAN
- HiFiGAN
- Style MelGAN.
In this demo, we will only experiment with the English TTS model, but ESPnet-TTS supports multiple languages like Japanese and Mandarin.
The terms of use follow that of each corpus. ESPnet-TTS use the following corpora:
ljspeech_*
: LJSpeech dataset- https://keithito.com/LJ-Speech-Dataset/
jsut_*
: JSUT corpus- https://sites.google.com/site/shinnosuketakamichi/publication/jsut
jvs_*
: JVS corpus + JSUT corpus- https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus
- https://sites.google.com/site/shinnosuketakamichi/publication/jsut
tsukuyomi_*
: つくよみちゃんコーパス + JSUT corpus- https://tyc.rei-yumesaki.net/material/corpus/
- https://sites.google.com/site/shinnosuketakamichi/publication/jsut
csmsc_*
: Chinese Standard Mandarin Speech Corpus- https://www.data-baker.com/open_source.html
#@title Download English model { run: "auto" }
lang = 'English'
tag = "kan-bayashi/ljspeech_vits" #@param ["kan-bayashi/ljspeech_tacotron2", "kan-bayashi/ljspeech_fastspeech", "kan-bayashi/ljspeech_vits"]
vocoder_tag = "none" #@param ["none", "parallel_wavegan/ljspeech_parallel_wavegan.v1"]
!gdown --id "1PjT9FX13d7Mv6loCs-wv5R_v3QrmLixf&confirm=t" -O /content/tts_model.zip
!unzip /content/tts_model.zip -d /content/tts_model
Model Setup
from espnet2.bin.tts_inference import Text2Speech
from espnet2.utils.types import str_or_none
text2speech = Text2Speech.from_pretrained(
train_config="/content/tts_model/exp/tts_train_vits_raw_phn_tacotron_g2p_en_no_space/config.yaml",
model_file="/content/tts_model/exp/tts_train_vits_raw_phn_tacotron_g2p_en_no_space/train.total_count.ave_10best.pth",
vocoder_tag=str_or_none(vocoder_tag),
device="cuda",
# Only for Tacotron 2 & Transformer
threshold=0.5,
# Only for Tacotron 2
minlenratio=0.0,
maxlenratio=10.0,
use_att_constraint=False,
backward_window=1,
forward_window=3,
# Only for FastSpeech & FastSpeech2 & VITS
speed_control_alpha=1.0,
# Only for VITS
noise_scale=0.333,
noise_scale_dur=0.333,
)
Synthesis (✅ Checkpoint 1 (2 point))
Run inference of pretrained single-speaker TTS model. Please experiment with running TTS model on different utterances. Provide some examples of failure cases and plot spectrogram and waveform of the utterances for both successful and failure cases. (1 point)
Please also discuss possible explanation of these failure cases. (1 point)
import time
import torch
# decide the input sentence by yourself
print(f"Input your favorite sentence in {lang}.")
x = input()
# synthesis
with torch.no_grad():
start = time.time()
wav = text2speech(x)["wav"]
rtf = (time.time() - start) / (len(wav) / text2speech.fs)
print(f"RTF = {rtf:5f}")
# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=text2speech.fs))
TTS Model selection
Question2 (✅ Checkpoint 2 (1 point))
Please experiment with running different TTS models like Tacotron or FastSpeech. Please also experiment both with Griffin Lim and Parallel WaveGAN vocoder. Please discuss which is better and why.
#@title Download English model { run: "auto" }
lang = 'English'
tag = "kan-bayashi/ljspeech_tacotron2" #@param ["kan-bayashi/ljspeech_tacotron2", "kan-bayashi/ljspeech_fastspeech", "kan-bayashi/ljspeech_vits"]
vocoder_tag = "none" #@param ["none", "parallel_wavegan/ljspeech_parallel_wavegan.v1"]
# when vocoder_tag is none, Griffin Lim algorithm is used
!gdown --id "1PXsSaulipN31HnQ8YWwsi9Ndb3B2My-J&confirm=t" -O /content/tts_tacotron_model.zip
!unzip /content/tts_tacotron_model.zip -d /content/tts_tacotron_model
#For fastspeech model run the commented lines below
#!gdown --id "13Jek_NbI8Qai42v4GKYxx3-jXOun5m2-&confirm=t" -O /content/tts_fastspeech_model.zip
#!unzip /content/tts_fastspeech_model.zip -d /content/tts_fastspeech_model
from espnet2.bin.tts_inference import Text2Speech
from espnet2.utils.types import str_or_none
!ln -sf /content/tts_tacotron_model/exp .
text2speech = Text2Speech.from_pretrained(
# model_tag=str_or_none(tag),
train_config="/content/tts_tacotron_model/exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/config.yaml",
model_file="/content/tts_tacotron_model/exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/199epoch.pth",
vocoder_tag=str_or_none(vocoder_tag),
device="cuda",
# Only for Tacotron 2 & Transformer
threshold=0.5,
# Only for Tacotron 2
minlenratio=0.0,
maxlenratio=10.0,
use_att_constraint=False,
backward_window=1,
forward_window=3,
# Only for FastSpeech & FastSpeech2 & VITS
speed_control_alpha=1.0,
# Only for VITS
noise_scale=0.333,
noise_scale_dur=0.333,
)
# For fastspeech model run the commented lines below
# from espnet2.bin.tts_inference import Text2Speech
# from espnet2.utils.types import str_or_none
# !ln -sf /content/tts_fastspeech_model/exp .
# text2speech = Text2Speech.from_pretrained(
# # model_tag=str_or_none(tag),
# train_config="/content/tts_fastspeech_model/exp/tts_train_fastspeech_raw_phn_tacotron_g2p_en_no_space/config.yaml",
# model_file="/content/tts_fastspeech_model/exp/tts_train_fastspeech_raw_phn_tacotron_g2p_en_no_space/1000epoch.pth",
# vocoder_tag=str_or_none(vocoder_tag),
# device="cuda",
# # Only for Tacotron 2 & Transformer
# threshold=0.5,
# # Only for Tacotron 2
# minlenratio=0.0,
# maxlenratio=10.0,
# use_att_constraint=False,
# backward_window=1,
# forward_window=3,
# # Only for FastSpeech & FastSpeech2 & VITS
# speed_control_alpha=1.0,
# # Only for VITS
# noise_scale=0.333,
# noise_scale_dur=0.333,
# )
import time
import torch
# decide the input sentence by yourself
print(f"Input your favorite sentence in {lang}.")
x = input()
# synthesis
with torch.no_grad():
start = time.time()
wav = text2speech(x)["wav"]
rtf = (time.time() - start) / (len(wav) / text2speech.fs)
print(f"RTF = {rtf:5f}")
# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=text2speech.fs))
Multi-speaker Model Demo
Model Selection
Now we provide only English multi-speaker pretrained model.
The terms of use follow that of each corpus. We use the following corpora:
libritts_*
: LibriTTS corpus- http://www.openslr.org/60
vctk_*
: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit- http://www.udialogue.org/download/cstr-vctk-corpus.html
#@title English multi-speaker pretrained model { run: "auto" }
lang = 'English'
tag = 'kan-bayashi/vctk_full_band_multi_spk_vits' #@param ["kan-bayashi/vctk_gst_tacotron2", "kan-bayashi/vctk_gst_transformer", "kan-bayashi/vctk_xvector_tacotron2", "kan-bayashi/vctk_xvector_transformer", "kan-bayashi/vctk_xvector_conformer_fastspeech2", "kan-bayashi/vctk_gst+xvector_tacotron2", "kan-bayashi/vctk_gst+xvector_transformer", "kan-bayashi/vctk_gst+xvector_conformer_fastspeech2", "kan-bayashi/vctk_multi_spk_vits", "kan-bayashi/vctk_full_band_multi_spk_vits", "kan-bayashi/libritts_xvector_transformer", "kan-bayashi/libritts_xvector_conformer_fastspeech2", "kan-bayashi/libritts_gst+xvector_transformer", "kan-bayashi/libritts_gst+xvector_conformer_fastspeech2", "kan-bayashi/libritts_xvector_vits"] {type:"string"}
vocoder_tag = "none" #@param ["none", "parallel_wavegan/vctk_parallel_wavegan.v1.long", "parallel_wavegan/vctk_multi_band_melgan.v2", "parallel_wavegan/vctk_style_melgan.v1", "parallel_wavegan/vctk_hifigan.v1", "parallel_wavegan/libritts_parallel_wavegan.v1.long", "parallel_wavegan/libritts_multi_band_melgan.v2", "parallel_wavegan/libritts_hifigan.v1", "parallel_wavegan/libritts_style_melgan.v1"] {type:"string"}
!gdown --id "1fzyyjLvrT_jldw4lfOD1P8FK2MGoIZO_&confirm=t" -O /content/tts_multi-speaker_model.zip
!unzip /content/tts_multi-speaker_model.zip -d /content/tts_multi-speaker_model
Model Setup
from espnet2.bin.tts_inference import Text2Speech
from espnet2.utils.types import str_or_none
text2speech = Text2Speech.from_pretrained(
train_config="/content/tts_multi-speaker_model/exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/config.yaml",
model_file="/content/tts_multi-speaker_model/exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/train.total_count.ave_10best.pth",
vocoder_tag=str_or_none(vocoder_tag),
device="cuda",
# Only for Tacotron 2 & Transformer
threshold=0.5,
# Only for Tacotron 2
minlenratio=0.0,
maxlenratio=10.0,
use_att_constraint=False,
backward_window=1,
forward_window=3,
# Only for FastSpeech & FastSpeech2 & VITS
speed_control_alpha=1.0,
# Only for VITS
noise_scale=0.333,
noise_scale_dur=0.333,
)
Speaker selection
For multi-speaker model, we need to provide X-vector and/or the reference speech to decide the speaker characteristics.
For X-vector, you can select the speaker from the dumped x-vectors.
For the reference speech, you can use any speech but please make sure the sampling rate is matched.
import glob
import os
import numpy as np
import kaldiio
# Get model directory path
from espnet_model_zoo.downloader import ModelDownloader
d = ModelDownloader()
# model_dir = os.path.dirname(d.download_and_unpack(tag)["train_config"])
# X-vector selection
spembs = None
if text2speech.use_spembs:
xvector_ark = [p for p in glob.glob(f"/content/tts_multi-speaker_model/dump/**/spk_xvector.ark", recursive=True) if "tr" in p][0]
xvectors = {k: v for k, v in kaldiio.load_ark(xvector_ark)}
spks = list(xvectors.keys())
# randomly select speaker
random_spk_idx = np.random.randint(0, len(spks))
spk = spks[random_spk_idx]
spembs = xvectors[spk]
print(f"selected spk: {spk}")
# Speaker ID selection
sids = None
if text2speech.use_sids:
spk2sid = glob.glob(f"/content/tts_multi-speaker_model/dump/**/spk2sid", recursive=True)[0]
with open(spk2sid) as f:
lines = [line.strip() for line in f.readlines()]
sid2spk = {int(line.split()[1]): line.split()[0] for line in lines}
# randomly select speaker
sids = np.array(np.random.randint(1, len(sid2spk)))
spk = sid2spk[int(sids)]
print(f"selected spk: {spk}")
# Reference speech selection for GST
speech = None
if text2speech.use_speech:
# you can change here to load your own reference speech
# e.g.
# import soundfile as sf
# speech, fs = sf.read("/path/to/reference.wav")
# speech = torch.from_numpy(speech).float()
speech = torch.randn(50000,) * 0.01
Synthesis(✅ Checkpoint3 (2 point))
Run inference of pretrained multi-speaker TTS model on more than one speaker id. Plot spectrogram and waveform of the synthesized speech for these speaker ids.
import time
import torch
# decide the input sentence by yourself
print(f"Input your favorite sentence in {lang}.")
x = input()
# synthesis
with torch.no_grad():
start = time.time()
wav = text2speech(x, speech=speech, spembs=spembs, sids=sids)["wav"]
rtf = (time.time() - start) / (len(wav) / text2speech.fs)
print(f"RTF = {rtf:5f}")
# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=text2speech.fs))