{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "espnet2_tts_demo", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "SMSw_r1uRm4a" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "MuhqhYSToxl7" }, "source": [ "# ESPnet2-TTS realtime demonstration\n", "\n", "This notebook provides a demonstration of the realtime E2E-TTS using ESPnet2-TTS and ParallelWaveGAN repo.\n", "\n", "- ESPnet2-TTS: https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1\n", "- ParallelWaveGAN: https://github.com/kan-bayashi/ParallelWaveGAN\n", "\n", "Author: Tomoki Hayashi ([@kan-bayashi](https://github.com/kan-bayashi))" ] }, { "cell_type": "markdown", "metadata": { "id": "9e_i_gdgAFNJ" }, "source": [ "## Installation" ] }, { "cell_type": "code", "metadata": { "id": "fjJ5zkyaoy29" }, "source": [ "# NOTE: pip shows imcompatible errors due to preinstalled libraries but you do not need to care\n", "!pip install -q espnet==202308 pypinyin==0.44.0 parallel_wavegan==0.5.4 gdown==4.4.0 espnet_model_zoo\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "BYLn3bL-qQjN" }, "source": [ "## Single speaker model demo\n" ] }, { "cell_type": "markdown", "metadata": { "id": "as4iFXid0m4f" }, "source": [ "### Model Selection\n", "\n", "Please select model: English, Japanese, and Mandarin are supported.\n", "\n", "You can try end-to-end text2wav model & combination of text2mel and vocoder. \n", "If you use text2wav model, you do not need to use vocoder (automatically disabled).\n", "\n", "**Text2wav models**:\n", "- VITS\n", "\n", "**Text2mel models**:\n", "- Tacotron2\n", "- Transformer-TTS\n", "- (Conformer) FastSpeech\n", "- (Conformer) FastSpeech2\n", "\n", "**Vocoders**:\n", "- Parallel WaveGAN\n", "- Multi-band MelGAN\n", "- HiFiGAN\n", "- Style MelGAN.\n", "\n", "\n", "> The terms of use follow that of each corpus. We use the following corpora:\n", "- `ljspeech_*`: LJSpeech dataset \n", " - https://keithito.com/LJ-Speech-Dataset/\n", "- `jsut_*`: JSUT corpus\n", " - https://sites.google.com/site/shinnosuketakamichi/publication/jsut\n", "- `jvs_*`: JVS corpus + JSUT corpus\n", " - https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus\n", " - https://sites.google.com/site/shinnosuketakamichi/publication/jsut\n", "- `tsukuyomi_*`: つくよみちゃんコーパス + JSUT corpus\n", " - https://tyc.rei-yumesaki.net/material/corpus/\n", " - https://sites.google.com/site/shinnosuketakamichi/publication/jsut\n", "- `csmsc_*`: Chinese Standard Mandarin Speech Corpus\n", " - https://www.data-baker.com/open_source.html \n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "J-Bvca5mE7bT" }, "source": [ "#@title Choose English model { run: \"auto\" }\n", "lang = 'English'\n", "tag = 'kan-bayashi/ljspeech_vits' #@param [\"kan-bayashi/ljspeech_tacotron2\", \"kan-bayashi/ljspeech_fastspeech\", \"kan-bayashi/ljspeech_fastspeech2\", \"kan-bayashi/ljspeech_conformer_fastspeech2\", \"kan-bayashi/ljspeech_joint_finetune_conformer_fastspeech2_hifigan\", \"kan-bayashi/ljspeech_joint_train_conformer_fastspeech2_hifigan\", \"kan-bayashi/ljspeech_vits\"] {type:\"string\"}\n", "vocoder_tag = \"none\" #@param [\"none\", \"parallel_wavegan/ljspeech_parallel_wavegan.v1\", \"parallel_wavegan/ljspeech_full_band_melgan.v2\", \"parallel_wavegan/ljspeech_multi_band_melgan.v2\", \"parallel_wavegan/ljspeech_hifigan.v1\", \"parallel_wavegan/ljspeech_style_melgan.v1\"] {type:\"string\"}" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "2BczOBfvE7bU" }, "source": [ "#@title Choose Japanese model { run: \"auto\" }\n", "lang = 'Japanese'\n", "tag = 'kan-bayashi/jsut_full_band_vits_prosody' #@param [\"kan-bayashi/jsut_tacotron2\", \"kan-bayashi/jsut_transformer\", \"kan-bayashi/jsut_fastspeech\", \"kan-bayashi/jsut_fastspeech2\", \"kan-bayashi/jsut_conformer_fastspeech2\", \"kan-bayashi/jsut_conformer_fastspeech2_accent\", \"kan-bayashi/jsut_conformer_fastspeech2_accent_with_pause\", \"kan-bayashi/jsut_vits_accent_with_pause\", \"kan-bayashi/jsut_full_band_vits_accent_with_pause\", \"kan-bayashi/jsut_tacotron2_prosody\", \"kan-bayashi/jsut_transformer_prosody\", \"kan-bayashi/jsut_conformer_fastspeech2_tacotron2_prosody\", \"kan-bayashi/jsut_vits_prosody\", \"kan-bayashi/jsut_full_band_vits_prosody\", \"kan-bayashi/jvs_jvs010_vits_prosody\", \"kan-bayashi/tsukuyomi_full_band_vits_prosody\"] {type:\"string\"}\n", "vocoder_tag = 'none' #@param [\"none\", \"parallel_wavegan/jsut_parallel_wavegan.v1\", \"parallel_wavegan/jsut_multi_band_melgan.v2\", \"parallel_wavegan/jsut_style_melgan.v1\", \"parallel_wavegan/jsut_hifigan.v1\"] {type:\"string\"}" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "k0jWteQ1E7bU" }, "source": [ "#@title Choose Mandarin model { run: \"auto\" }\n", "lang = 'Mandarin'\n", "tag = 'kan-bayashi/csmsc_full_band_vits' #@param [\"kan-bayashi/csmsc_tacotron2\", \"kan-bayashi/csmsc_transformer\", \"kan-bayashi/csmsc_fastspeech\", \"kan-bayashi/csmsc_fastspeech2\", \"kan-bayashi/csmsc_conformer_fastspeech2\", \"kan-bayashi/csmsc_vits\", \"kan-bayashi/csmsc_full_band_vits\"] {type: \"string\"}\n", "vocoder_tag = \"none\" #@param [\"none\", \"parallel_wavegan/csmsc_parallel_wavegan.v1\", \"parallel_wavegan/csmsc_multi_band_melgan.v2\", \"parallel_wavegan/csmsc_hifigan.v1\", \"parallel_wavegan/csmsc_style_melgan.v1\"] {type:\"string\"}" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "g9S-SFPe0z0w" }, "source": [ "### Model Setup" ] }, { "cell_type": "code", "metadata": { "id": "z64fD2UgjJ6Q" }, "source": [ "from espnet2.bin.tts_inference import Text2Speech\n", "from espnet2.utils.types import str_or_none\n", "\n", "text2speech = Text2Speech.from_pretrained(\n", " model_tag=str_or_none(tag),\n", " vocoder_tag=str_or_none(vocoder_tag),\n", " device=\"cuda\",\n", " # Only for Tacotron 2 & Transformer\n", " threshold=0.5,\n", " # Only for Tacotron 2\n", " minlenratio=0.0,\n", " maxlenratio=10.0,\n", " use_att_constraint=False,\n", " backward_window=1,\n", " forward_window=3,\n", " # Only for FastSpeech & FastSpeech2 & VITS\n", " speed_control_alpha=1.0,\n", " # Only for VITS\n", " noise_scale=0.333,\n", " noise_scale_dur=0.333,\n", ")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "FMaT0Zev021a" }, "source": [ "### Synthesis" ] }, { "cell_type": "code", "metadata": { "id": "vrRM57hhgtHy" }, "source": [ "import time\n", "import torch\n", "\n", "# decide the input sentence by yourself\n", "print(f\"Input your favorite sentence in {lang}.\")\n", "x = input()\n", "\n", "# synthesis\n", "with torch.no_grad():\n", " start = time.time()\n", " wav = text2speech(x)[\"wav\"]\n", "rtf = (time.time() - start) / (len(wav) / text2speech.fs)\n", "print(f\"RTF = {rtf:5f}\")\n", "\n", "# let us listen to generated samples\n", "from IPython.display import display, Audio\n", "display(Audio(wav.view(-1).cpu().numpy(), rate=text2speech.fs))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "3TTAygALqY6T" }, "source": [ "## Multi-speaker Model Demo" ] }, { "cell_type": "markdown", "metadata": { "id": "rSEZYh22n4gn" }, "source": [ "### Model Selection\n", "\n", "Now we provide only English multi-speaker pretrained model.\n", "\n", "> The terms of use follow that of each corpus. We use the following corpora:\n", "- `libritts_*`: LibriTTS corpus\n", " - http://www.openslr.org/60\n", "- `vctk_*`: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit\n", " - http://www.udialogue.org/download/cstr-vctk-corpus.html\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "NSKLmDN9E7bW" }, "source": [ "#@title English multi-speaker pretrained model { run: \"auto\" }\n", "lang = 'English'\n", "tag = 'kan-bayashi/vctk_full_band_multi_spk_vits' #@param [\"kan-bayashi/vctk_gst_tacotron2\", \"kan-bayashi/vctk_gst_transformer\", \"kan-bayashi/vctk_xvector_tacotron2\", \"kan-bayashi/vctk_xvector_transformer\", \"kan-bayashi/vctk_xvector_conformer_fastspeech2\", \"kan-bayashi/vctk_gst+xvector_tacotron2\", \"kan-bayashi/vctk_gst+xvector_transformer\", \"kan-bayashi/vctk_gst+xvector_conformer_fastspeech2\", \"kan-bayashi/vctk_multi_spk_vits\", \"kan-bayashi/vctk_full_band_multi_spk_vits\", \"kan-bayashi/libritts_xvector_transformer\", \"kan-bayashi/libritts_xvector_conformer_fastspeech2\", \"kan-bayashi/libritts_gst+xvector_transformer\", \"kan-bayashi/libritts_gst+xvector_conformer_fastspeech2\", \"kan-bayashi/libritts_xvector_vits\"] {type:\"string\"}\n", "vocoder_tag = \"none\" #@param [\"none\", \"parallel_wavegan/vctk_parallel_wavegan.v1.long\", \"parallel_wavegan/vctk_multi_band_melgan.v2\", \"parallel_wavegan/vctk_style_melgan.v1\", \"parallel_wavegan/vctk_hifigan.v1\", \"parallel_wavegan/libritts_parallel_wavegan.v1.long\", \"parallel_wavegan/libritts_multi_band_melgan.v2\", \"parallel_wavegan/libritts_hifigan.v1\", \"parallel_wavegan/libritts_style_melgan.v1\"] {type:\"string\"}" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "GcshmgYpoVzh" }, "source": [ "### Model Setup" ] }, { "cell_type": "code", "metadata": { "id": "kfJFD4QroNhJ" }, "source": [ "from espnet2.bin.tts_inference import Text2Speech\n", "from espnet2.utils.types import str_or_none\n", "\n", "text2speech = Text2Speech.from_pretrained(\n", " model_tag=str_or_none(tag),\n", " vocoder_tag=str_or_none(vocoder_tag),\n", " device=\"cuda\",\n", " # Only for Tacotron 2 & Transformer\n", " threshold=0.5,\n", " # Only for Tacotron 2\n", " minlenratio=0.0,\n", " maxlenratio=10.0,\n", " use_att_constraint=False,\n", " backward_window=1,\n", " forward_window=3,\n", " # Only for FastSpeech & FastSpeech2 & VITS\n", " speed_control_alpha=1.0,\n", " # Only for VITS\n", " noise_scale=0.333,\n", " noise_scale_dur=0.333,\n", ")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "gdaMNwrtuZhY" }, "source": [ "### Speaker selection\n", "\n", "For multi-speaker model, we need to provide X-vector and/or the reference speech to decide the speaker characteristics. \n", "For X-vector, you can select the speaker from the dumped x-vectors. \n", "For the reference speech, you can use any speech but please make sure the sampling rate is matched." ] }, { "cell_type": "code", "metadata": { "id": "ZzoAd1rgObcP" }, "source": [ "import glob\n", "import os\n", "import numpy as np\n", "import kaldiio\n", "\n", "# Get model directory path\n", "from espnet_model_zoo.downloader import ModelDownloader\n", "d = ModelDownloader()\n", "model_dir = os.path.dirname(d.download_and_unpack(tag)[\"train_config\"])\n", "\n", "# X-vector selection\n", "spembs = None\n", "if text2speech.use_spembs:\n", " xvector_ark = [p for p in glob.glob(f\"{model_dir}/../../dump/**/spk_xvector.ark\", recursive=True) if \"tr\" in p][0]\n", " xvectors = {k: v for k, v in kaldiio.load_ark(xvector_ark)}\n", " spks = list(xvectors.keys())\n", "\n", " # randomly select speaker\n", " random_spk_idx = np.random.randint(0, len(spks))\n", " spk = spks[random_spk_idx]\n", " spembs = xvectors[spk]\n", " print(f\"selected spk: {spk}\")\n", "\n", "# Speaker ID selection\n", "sids = None\n", "if text2speech.use_sids:\n", " spk2sid = glob.glob(f\"{model_dir}/../../dump/**/spk2sid\", recursive=True)[0]\n", " with open(spk2sid) as f:\n", " lines = [line.strip() for line in f.readlines()]\n", " sid2spk = {int(line.split()[1]): line.split()[0] for line in lines}\n", " \n", " # randomly select speaker\n", " sids = np.array(np.random.randint(1, len(sid2spk)))\n", " spk = sid2spk[int(sids)]\n", " print(f\"selected spk: {spk}\")\n", "\n", "# Reference speech selection for GST\n", "speech = None\n", "if text2speech.use_speech:\n", " # you can change here to load your own reference speech\n", " # e.g.\n", " # import soundfile as sf\n", " # speech, fs = sf.read(\"/path/to/reference.wav\")\n", " # speech = torch.from_numpy(speech).float()\n", " speech = torch.randn(50000,) * 0.01" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "W6G-1YW9ocYV" }, "source": [ "### Synthesis" ] }, { "cell_type": "code", "metadata": { "id": "o87zK1NLobne" }, "source": [ "import time\n", "import torch\n", "\n", "# decide the input sentence by yourself\n", "print(f\"Input your favorite sentence in {lang}.\")\n", "x = input()\n", "\n", "# synthesis\n", "with torch.no_grad():\n", " start = time.time()\n", " wav = text2speech(x, speech=speech, spembs=spembs, sids=sids)[\"wav\"]\n", "rtf = (time.time() - start) / (len(wav) / text2speech.fs)\n", "print(f\"RTF = {rtf:5f}\")\n", "\n", "# let us listen to generated samples\n", "from IPython.display import display, Audio\n", "display(Audio(wav.view(-1).cpu().numpy(), rate=text2speech.fs))" ], "execution_count": null, "outputs": [] } ] }