{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "espnet2_tutorial_2021_CMU_11751_18781.ipynb", "provenance": [], "collapsed_sections": [ "gGg1N9jufpf2", "bdMq932Em7oF", "7mWwpjxSqy4Q", "5L2M_Un-seKS", "qTJUKl90kw7l", "7I0a87Est2bQ", "QM-KLOYKuM7G", "_Y5IlGWz7sXd", "rYzNLITz7wyG", "tvNphJbcCQUA" ] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "kNbxe3gyfbPi" }, "source": [ "# CMU 11751/18781 2021: ESPnet Tutorial \n", "\n", "ESPnet is an end-to-end speech processing toolkit, initially focused on end-to-end speech recognition and end-to-end text-to-speech, but now extended to various other speech processing. ESPnet uses PyTorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.\n", "\n", "This tutorial is based on the collection of espnet notebook demos https://github.com/espnet/notebook, espnet documentations in https://espnet.github.io/espnet/, and README.md in https://github.com/espnet/espnet\n", "\n", "Author: Shinji Watanabe ([@sw005320](https://github.com/sw005320))\n" ] }, { "cell_type": "markdown", "metadata": { "id": "gGg1N9jufpf2" }, "source": [ "## Useful links\n", "\n", "- Installation https://espnet.github.io/espnet/installation.html\n", "- Usage https://espnet.github.io/espnet/espnet2_tutorial.html\n" ] }, { "cell_type": "markdown", "metadata": { "id": "xuIdPF2akLpM" }, "source": [ "# Run an inference example\n", "\n", "- ESPnet covers various speech applications and their pre-trained models. \n", "\n", "- Please check a model shown in [espnet_model_zoo](https://github.com/espnet/espnet_model_zoo/blob/master/espnet_model_zoo/table.csv)\n", "\n", "- We can play with a demo based on these pre-trained models.\n", "\n", "- What we only need is to install `espnet_model_zoo`\n", "\n", "- Note that this `pip` based installation does not include training and so on. The full installation is explained later.\n", "\n", "- You can also find similar demos in HuggingFace Hub https://huggingface.co/espnet\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "q0P87aVxnvx8" }, "source": [ "# It takes 1 minute.\n", "!pip install -q espnet_model_zoo" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "tbulk8kRkeYi" }, "source": [ "## Speech recognition demo\n", "\n", "Author: Jiatong Shi ([@ftshijt](https://github.com/ftshijt))\n", "\n", "### Model Selection\n", "\n", "- Please select the model shown in [espnet_model_zoo](https://github.com/espnet/espnet_model_zoo/blob/master/espnet_model_zoo/table.csv).\n", "\n", "- They are stored in zenodo https://zenodo.org/communities/espnet or HuggingFace Hub https://huggingface.co/espnet\n", "\n", "\n", "- In this demonstration, we will show English, Japanese, Spanish, Mandrain, and Multilingual ASR models, respectively" ] }, { "cell_type": "code", "metadata": { "id": "1EOnS853kjb9" }, "source": [ "#@title Choose English ASR model { run: \"auto\" }\n", "\n", "lang = 'en'\n", "fs = 16000 #@param {type:\"integer\"}\n", "tag = 'Shinji Watanabe/spgispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000_valid.acc.ave' #@param [\"Shinji Watanabe/spgispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000_valid.acc.ave\", \"kamo-naoyuki/librispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave\"] {type:\"string\"}" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ubyCqCz5mdUb" }, "source": [ "#@title Choose Japanese ASR model { run: \"auto\" }\n", "\n", "lang = 'ja'\n", "fs = 16000 #@param {type:\"integer\"}\n", "tag = 'Shinji Watanabe/laborotv_asr_train_asr_conformer2_latest33_raw_char_sp_valid.acc.ave' #@param [\"Shinji Watanabe/laborotv_asr_train_asr_conformer2_latest33_raw_char_sp_valid.acc.ave\"] {type:\"string\"}" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "WX6n2I_zmeWp" }, "source": [ "#@title Choose Spanish ASR model { run: \"auto\" }\n", "\n", "lang = 'es'\n", "fs = 16000 #@param {type:\"integer\"}\n", "tag = 'ftshijt/mls_asr_transformer_valid.acc.best' #@param [\"ftshijt/mls_asr_transformer_valid.acc.best\"] {type:\"string\"}" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Pf5cHlCjmiG_" }, "source": [ "#@title Choose Mandrain ASR model { run: \"auto\" }\n", "\n", "lang = 'zh'\n", "fs = 16000 #@param {type:\"integer\"}\n", "tag = 'Emiru Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave' #@param [\"\tEmiru Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave\"] {type:\"string\"}" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "5vrwJsF0mkDf" }, "source": [ "#@title Choose Multilingual ASR model { run: \"auto\" }\n", "\n", "lang = 'multilingual'\n", "fs = 16000 #@param {type:\"integer\"}\n", "tag = 'ftshijt/open_li52_asr_train_asr_raw_bpe7000_valid.acc.ave_10best' #@param [\"\tftshijt/open_li52_asr_train_asr_raw_bpe7000_valid.acc.ave_10best\"] {type:\"string\"}" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "bdMq932Em7oF" }, "source": [ "### Model Setup" ] }, { "cell_type": "code", "metadata": { "id": "OxcQ0PNSnEZU" }, "source": [ "import time\n", "import torch\n", "import string\n", "from espnet_model_zoo.downloader import ModelDownloader\n", "from espnet2.bin.asr_inference import Speech2Text\n", "\n", "\n", "d = ModelDownloader()\n", "# It may takes a while to download and build models\n", "speech2text = Speech2Text(\n", " **d.download_and_unpack(tag),\n", " device=\"cuda\",\n", " minlenratio=0.0,\n", " maxlenratio=0.0,\n", " ctc_weight=0.3,\n", " beam_size=10,\n", " batch_size=0,\n", " nbest=1\n", ")\n", "\n", "def text_normalizer(text):\n", " text = text.upper()\n", " return text.translate(str.maketrans('', '', string.punctuation))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "wdIdVkjcnMKX" }, "source": [ "### Recognize our examples of pre-recorded samples" ] }, { "cell_type": "code", "metadata": { "id": "HHlgfuv9nPrW" }, "source": [ "!git clone https://github.com/ftshijt/ESPNet_asr_egs.git\n", "\n", "import pandas as pd\n", "import soundfile\n", "import librosa.display\n", "from IPython.display import display, Audio\n", "import matplotlib.pyplot as plt\n", "\n", "\n", "egs = pd.read_csv(\"ESPNet_asr_egs/egs.csv\")\n", "for index, row in egs.iterrows():\n", " if row[\"lang\"] == lang or lang == \"multilingual\":\n", " speech, rate = soundfile.read(\"ESPNet_asr_egs/\" + row[\"path\"])\n", " assert fs == int(row[\"sr\"])\n", " nbests = speech2text(speech)\n", "\n", " text, *_ = nbests[0]\n", " print(f\"Input Speech: ESPNet_asr_egs/{row['path']}\")\n", " # let us listen to samples\n", " display(Audio(speech, rate=rate))\n", " librosa.display.waveplot(speech, sr=rate)\n", " plt.show()\n", " print(f\"Reference text: {text_normalizer(row['text'])}\")\n", " print(f\"ASR hypothesis: {text_normalizer(text)}\")\n", " print(\"*\" * 50)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "cYRenwS6nX9h" }, "source": [ "### Recognize your own live-recordings\n", "\n", "\n", "\n", "1. Record your own voice\n", "2. Recognize your voice with the ASR system\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "09RTZFIMnc3T" }, "source": [ "# from https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be\n", "\n", "from IPython.display import Javascript\n", "from google.colab import output\n", "from base64 import b64decode\n", "\n", "RECORD = \"\"\"\n", "const sleep = time => new Promise(resolve => setTimeout(resolve, time))\n", "const b2text = blob => new Promise(resolve => {\n", " const reader = new FileReader()\n", " reader.onloadend = e => resolve(e.srcElement.result)\n", " reader.readAsDataURL(blob)\n", "})\n", "var record = time => new Promise(async resolve => {\n", " stream = await navigator.mediaDevices.getUserMedia({ audio: true })\n", " recorder = new MediaRecorder(stream)\n", " chunks = []\n", " recorder.ondataavailable = e => chunks.push(e.data)\n", " recorder.start()\n", " await sleep(time)\n", " recorder.onstop = async ()=>{\n", " blob = new Blob(chunks)\n", " text = await b2text(blob)\n", " resolve(text)\n", " }\n", " recorder.stop()\n", "})\n", "\"\"\"\n", "\n", "def record(sec, filename='audio.wav'):\n", " display(Javascript(RECORD))\n", " s = output.eval_js('record(%d)' % (sec * 1000))\n", " b = b64decode(s.split(',')[1])\n", " with open(filename, 'wb+') as f:\n", " f.write(b)\n", "\n", "audio = 'audio.wav'\n", "second = 5\n", "print(f\"Speak to your microphone {second} sec...\")\n", "record(second, audio)\n", "print(\"Done!\")\n", "\n", "\n", "import librosa\n", "import librosa.display\n", "speech, rate = librosa.load(audio, sr=16000)\n", "librosa.display.waveplot(speech, sr=rate)\n", "\n", "import matplotlib.pyplot as plt\n", "plt.show()\n", "\n", "import pysndfile\n", "pysndfile.sndio.write('audio_ds.wav', speech, rate=rate, format='wav', enc='pcm16')\n", "\n", "from IPython.display import display, Audio\n", "display(Audio(speech, rate=rate))" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "uOWIMmE-nf2q" }, "source": [ "nbests = speech2text(speech)\n", "text, *_ = nbests[0]\n", "\n", "print(f\"ASR hypothesis: {text_normalizer(text)}\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "OddMx6mskjyI" }, "source": [ "## Speech synthesis demo\n", "\n", "This notebook provides a demonstration of the realtime E2E-TTS using ESPnet2-TTS and ParallelWaveGAN repo.\n", "\n", "- ESPnet2-TTS: https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1\n", "- ParallelWaveGAN: https://github.com/kan-bayashi/ParallelWaveGAN\n", "\n", "Author: Tomoki Hayashi ([@kan-bayashi](https://github.com/kan-bayashi))" ] }, { "cell_type": "markdown", "metadata": { "id": "sInrM9qvpt4T" }, "source": [ "### Installation" ] }, { "cell_type": "code", "metadata": { "id": "PXXbWM5Gko_U" }, "source": [ "# NOTE: pip shows imcompatible errors due to preinstalled libraries but you do not need to care\n", "# It takes 1 minute\n", "!pip install -q pyopenjtalk==0.1.5 parallel_wavegan==0.5.3 " ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "7mWwpjxSqy4Q" }, "source": [ "### Model Selection\n", "\n", "Please select model: English, Japanese, and Mandarin are supported.\n", "\n", "You can try end-to-end text2wav model & combination of text2mel and vocoder. \n", "If you use text2wav model, you do not need to use vocoder (automatically disabled).\n", "\n", "**Text2wav models**:\n", "- VITS\n", "\n", "**Text2mel models**:\n", "- Tacotron2\n", "- Transformer-TTS\n", "- (Conformer) FastSpeech\n", "- (Conformer) FastSpeech2\n", "\n", "**Vocoders**:\n", "- Parallel WaveGAN\n", "- Multi-band MelGAN\n", "- HiFiGAN\n", "- Style MelGAN.\n", "\n", "\n", "> The terms of use follow that of each corpus. We use the following corpora:\n", "- `ljspeech_*`: LJSpeech dataset \n", " - https://keithito.com/LJ-Speech-Dataset/\n", "- `jsut_*`: JSUT corpus\n", " - https://sites.google.com/site/shinnosuketakamichi/publication/jsut\n", "- `jvs_*`: JVS corpus + JSUT corpus\n", " - https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus\n", " - https://sites.google.com/site/shinnosuketakamichi/publication/jsut\n", "- `tsukuyomi_*`: つくよみちゃんコーパス + JSUT corpus\n", " - https://tyc.rei-yumesaki.net/material/corpus/\n", " - https://sites.google.com/site/shinnosuketakamichi/publication/jsut\n", "- `csmsc_*`: Chinese Standard Mandarin Speech Corpus\n", " - https://www.data-baker.com/open_source.html \n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "rwKv_KIprDtX" }, "source": [ "#@title Choose English model { run: \"auto\" }\n", "lang = 'English'\n", "tag = 'kan-bayashi/ljspeech_vits' #@param [\"kan-bayashi/ljspeech_tacotron2\", \"kan-bayashi/ljspeech_fastspeech\", \"kan-bayashi/ljspeech_fastspeech2\", \"kan-bayashi/ljspeech_conformer_fastspeech2\", \"kan-bayashi/ljspeech_vits\"] {type:\"string\"}\n", "vocoder_tag = \"none\" #@param [\"none\", \"parallel_wavegan/ljspeech_parallel_wavegan.v1\", \"parallel_wavegan/ljspeech_full_band_melgan.v2\", \"parallel_wavegan/ljspeech_multi_band_melgan.v2\", \"parallel_wavegan/ljspeech_hifigan.v1\", \"parallel_wavegan/ljspeech_style_melgan.v1\"] {type:\"string\"}" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "9NFmifKZrJQy" }, "source": [ "#@title Choose Japanese model { run: \"auto\" }\n", "lang = 'Japanese'\n", "tag = 'kan-bayashi/jsut_full_band_vits_prosody' #@param [\"kan-bayashi/jsut_tacotron2\", \"kan-bayashi/jsut_transformer\", \"kan-bayashi/jsut_fastspeech\", \"kan-bayashi/jsut_fastspeech2\", \"kan-bayashi/jsut_conformer_fastspeech2\", \"kan-bayashi/jsut_conformer_fastspeech2_accent\", \"kan-bayashi/jsut_conformer_fastspeech2_accent_with_pause\", \"kan-bayashi/jsut_vits_accent_with_pause\", \"kan-bayashi/jsut_full_band_vits_accent_with_pause\", \"kan-bayashi/jsut_tacotron2_prosody\", \"kan-bayashi/jsut_transformer_prosody\", \"kan-bayashi/jsut_conformer_fastspeech2_tacotron2_prosody\", \"kan-bayashi/jsut_vits_prosody\", \"kan-bayashi/jsut_full_band_vits_prosody\", \"kan-bayashi/jvs_jvs010_vits_prosody\", \"kan-bayashi/tsukuyomi_full_band_vits_prosody\"] {type:\"string\"}\n", "vocoder_tag = 'none' #@param [\"none\", \"parallel_wavegan/jsut_parallel_wavegan.v1\", \"parallel_wavegan/jsut_multi_band_melgan.v2\", \"parallel_wavegan/jsut_style_melgan.v1\", \"parallel_wavegan/jsut_hifigan.v1\"] {type:\"string\"}" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "1o174YeGrNM8" }, "source": [ "#@title Choose Mandarin model { run: \"auto\" }\n", "lang = 'Mandarin'\n", "tag = 'kan-bayashi/csmsc_full_band_vits' #@param [\"kan-bayashi/csmsc_tacotron2\", \"kan-bayashi/csmsc_transformer\", \"kan-bayashi/csmsc_fastspeech\", \"kan-bayashi/csmsc_fastspeech2\", \"kan-bayashi/csmsc_conformer_fastspeech2\", \"kan-bayashi/csmsc_vits\", \"kan-bayashi/csmsc_full_band_vits\"] {type: \"string\"}\n", "vocoder_tag = \"none\" #@param [\"none\", \"parallel_wavegan/csmsc_parallel_wavegan.v1\", \"parallel_wavegan/csmsc_multi_band_melgan.v2\", \"parallel_wavegan/csmsc_hifigan.v1\", \"parallel_wavegan/csmsc_style_melgan.v1\"] {type:\"string\"}" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Q4oIJ9tcrRE_" }, "source": [ "### Model Setup" ] }, { "cell_type": "code", "metadata": { "id": "w0bV4WkdrRzu" }, "source": [ "from espnet2.bin.tts_inference import Text2Speech\n", "from espnet2.utils.types import str_or_none\n", "\n", "text2speech = Text2Speech.from_pretrained(\n", " model_tag=str_or_none(tag),\n", " vocoder_tag=str_or_none(vocoder_tag),\n", " device=\"cuda\",\n", " # Only for Tacotron 2 & Transformer\n", " threshold=0.5,\n", " # Only for Tacotron 2\n", " minlenratio=0.0,\n", " maxlenratio=10.0,\n", " use_att_constraint=False,\n", " backward_window=1,\n", " forward_window=3,\n", " # Only for FastSpeech & FastSpeech2 & VITS\n", " speed_control_alpha=1.0,\n", " # Only for VITS\n", " noise_scale=0.667,\n", " noise_scale_dur=0.8,\n", ")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "lW89JIQdrUTl" }, "source": [ "### Synthesis" ] }, { "cell_type": "code", "metadata": { "id": "dz7RSJunraqd" }, "source": [ "import time\n", "import torch\n", "\n", "# decide the input sentence by yourself\n", "print(f\"Input your favorite sentence in {lang}.\")\n", "x = input()\n", "\n", "# synthesis\n", "with torch.no_grad():\n", " start = time.time()\n", " wav = text2speech(x)[\"wav\"]\n", "rtf = (time.time() - start) / (len(wav) / text2speech.fs)\n", "print(f\"RTF = {rtf:5f}\")\n", "\n", "# let us listen to generated samples\n", "from IPython.display import display, Audio\n", "display(Audio(wav.view(-1).cpu().numpy(), rate=text2speech.fs))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "WXmXR3LMkn4V" }, "source": [ "## Speech enhancement demo\n", "\n", "- ESPnet2-SE: https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/enh1\n", "\n", "Author: Chenda Li ([@LiChenda](https://github.com/LiChenda)), Wangyou Zhang ([@Emrys365](https://github.com/Emrys365))\n" ] }, { "cell_type": "markdown", "metadata": { "id": "vmf2MSENsF-V" }, "source": [ "### Single-Channel Enhancement, the CHiME example\n" ] }, { "cell_type": "code", "metadata": { "id": "CffTLkNLkwnT" }, "source": [ "# Download one utterance from real noisy speech of CHiME4\n", "!gdown --id 1SmrN5NFSg6JuQSs2sfy3ehD8OIcqK6wS -O /content/M05_440C0213_PED_REAL.wav\n", "import os\n", "\n", "import soundfile\n", "from IPython.display import display, Audio\n", "mixwav_mc, sr = soundfile.read(\"/content/M05_440C0213_PED_REAL.wav\")\n", "# mixwav.shape: num_samples, num_channels\n", "mixwav_sc = mixwav_mc[:,4]\n", "display(Audio(mixwav_mc.T, rate=sr))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "80aufSw8sQuJ" }, "source": [ "#### Download and load the pretrained Conv-Tasnet\n" ] }, { "cell_type": "code", "metadata": { "id": "NDCM54U_sUdk" }, "source": [ "!gdown --id 17DMWdw84wF3fz3t7ia1zssdzhkpVQGZm -O /content/chime_tasnet_singlechannel.zip\n", "!unzip /content/chime_tasnet_singlechannel.zip -d /content/enh_model_sc" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "O9KM2dyXsa4n" }, "source": [ "# Load the model\n", "# If you encounter error \"No module named 'espnet2'\", please re-run the 1st Cell. This might be a colab bug.\n", "import sys\n", "import soundfile\n", "from espnet2.bin.enh_inference import SeparateSpeech\n", "\n", "\n", "separate_speech = {}\n", "# For models downloaded from GoogleDrive, you can use the following script:\n", "enh_model_sc = SeparateSpeech(\n", " train_config=\"/content/enh_model_sc/exp/enh_train_enh_conv_tasnet_raw/config.yaml\",\n", " model_file=\"/content/enh_model_sc/exp/enh_train_enh_conv_tasnet_raw/5epoch.pth\",\n", " # for segment-wise process on long speech\n", " normalize_segment_scale=False,\n", " show_progressbar=True,\n", " ref_channel=4,\n", " normalize_output_wav=True,\n", " device=\"cuda:0\",\n", ")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "5L2M_Un-seKS" }, "source": [ "#### Enhance the single-channel real noisy speech in CHiME4\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "Nmrvb9ejthHz" }, "source": [ "# play the enhanced single-channel speech\n", "wave = enh_model_sc(mixwav_sc[None, ...], sr)\n", "print(\"Input real noisy speech\", flush=True)\n", "display(Audio(mixwav_sc, rate=sr))\n", "print(\"Enhanced speech\", flush=True)\n", "display(Audio(wave[0].squeeze(), rate=sr))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "qTJUKl90kw7l" }, "source": [ "### Speech Separation" ] }, { "cell_type": "markdown", "metadata": { "id": "7I0a87Est2bQ" }, "source": [ "\n", "#### Model Selection\n", "\n", "Please select model shown in [espnet_model_zoo](https://github.com/espnet/espnet_model_zoo/blob/master/espnet_model_zoo/table.csv)\n", "\n", "In this demonstration, we will show different speech separation models on wsj0_2mix.\n" ] }, { "cell_type": "code", "metadata": { "id": "1aGbjmpkkyqp" }, "source": [ "#@title Choose Speech Separation model { run: \"auto\" }\n", "\n", "fs = 8000 #@param {type:\"integer\"}\n", "tag = \"Chenda Li/wsj0_2mix_enh_train_enh_conv_tasnet_raw_valid.si_snr.ave\" #@param [\"Chenda Li/wsj0_2mix_enh_train_enh_conv_tasnet_raw_valid.si_snr.ave\", \"Chenda Li/wsj0_2mix_enh_train_enh_rnn_tf_raw_valid.si_snr.ave\", \"https://zenodo.org/record/4688000/files/enh_train_enh_dprnn_tasnet_raw_valid.si_snr.ave.zip\"]" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "LQ-QUynlt7bR" }, "source": [ "# For models uploaded to Zenodo, you can use the following python script instead:\n", "import sys\n", "import soundfile\n", "from espnet_model_zoo.downloader import ModelDownloader\n", "from espnet2.bin.enh_inference import SeparateSpeech\n", "\n", "d = ModelDownloader()\n", "\n", "cfg = d.download_and_unpack(tag)\n", "separate_speech = SeparateSpeech(\n", " train_config=cfg[\"train_config\"],\n", " model_file=cfg[\"model_file\"],\n", " # for segment-wise process on long speech\n", " segment_size=2.4,\n", " hop_size=0.8,\n", " normalize_segment_scale=False,\n", " show_progressbar=True,\n", " ref_channel=None,\n", " normalize_output_wav=True,\n", " device=\"cuda:0\",\n", ")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "QM-KLOYKuM7G" }, "source": [ "#### Separate the example in wsj0_2mix testing set" ] }, { "cell_type": "code", "metadata": { "id": "w-OosNQjuPE3" }, "source": [ "!gdown --id 1ZCUkd_Lb7pO2rpPr4FqYdtJBZ7JMiInx -O /content/447c020t_1.2106_422a0112_-1.2106.wav\n", "\n", "import os\n", "import soundfile\n", "from IPython.display import display, Audio\n", "\n", "mixwav, sr = soundfile.read(\"/content/447c020t_1.2106_422a0112_-1.2106.wav\")\n", "waves_wsj = separate_speech(mixwav[None, ...], fs=sr)\n", "\n", "print(\"Input mixture\", flush=True)\n", "display(Audio(mixwav, rate=sr))\n", "print(f\"========= Separated speech with model {tag} =========\", flush=True)\n", "print(\"Separated spk1\", flush=True)\n", "display(Audio(waves_wsj[0].squeeze(), rate=sr))\n", "print(\"Separated spk2\", flush=True)\n", "display(Audio(waves_wsj[1].squeeze(), rate=sr))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "S0LaeC3RzECk" }, "source": [ "# Full installation\n", "\n", "- This is a full installation method to perform data preprocess, training, inference, scoring, and so on. for various experiments.\n", "\n", "- We prepare various ways of installations. We also prepare a [docker image](https://github.com/espnet/espnet/blob/master/docker/README.md) as well.\n", "\n", "- See https://espnet.github.io/espnet/installation.html#step-2-installation-espnet for more details.\n", "\n", "**Installation of required tools**\n", "\n", "See https://espnet.github.io/espnet/installation.html#requirements for more details.\n" ] }, { "cell_type": "code", "metadata": { "id": "35x4ge-JylTM" }, "source": [ "# It takes ~10 seconds\n", "!sudo apt-get install cmake sox libsndfile1-dev" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "n1G9slDo0AuF" }, "source": [ "**Download espnet**\n", "\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "St7lke2P0GUP" }, "source": [ "# It takes a few seconds\n", "!git clone --depth 5 https://github.com/espnet/espnet" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "lZGnBSEaz1Zt" }, "source": [ "**Setup Python environment based on anaconda**\n", "\n", "There are several other installation methods, but **we highly recommend the anaconda-based one**.\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "4F80yqAIz86B" }, "source": [ "# It takes 30 seconds\n", "%cd /content/espnet/tools\n", "!./setup_anaconda.sh anaconda espnet 3.8" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Wd-_lSQv1ML4" }, "source": [ "**Install espnet**\n", "\n", "This includes the installation of PyTorch and other tools.\n", "\n", "We just specify CUDA_VERSION=10.2 for the latest PyTorch (1.9.0)\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "i_E98PPZ1PbB" }, "source": [ "# It may take ~8 minutes\n", "%cd /content/espnet/tools\n", "!make CUDA_VERSION=10.2" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "mLwMHc7J3gxW" }, "source": [ "**Install other speech processing tools**\n", "\n", "We install NIST SCTK toolkit for scoring\n", "\n", "Please manually install other tools if needed.\n" ] }, { "cell_type": "code", "metadata": { "id": "U8e2gMdp3ll3" }, "source": [ "%cd /content/espnet/tools\n", "!./installers/install_sctk.sh" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "YEMCSyUi4Nw0" }, "source": [ "**Check installation**\n", "\n", "Please check whether torch, torch cuda, and espnet are correctly installed.\n", "\n", "If torch, torch cuda, and espnet are successfully installed, it would be no problem.\n", "\n", "```\n", "[x] torch=1.9.0\n", "[x] torch cuda=10.2\n", ":\n", "[x] espnet=0.10.3a3\n", "```" ] }, { "cell_type": "code", "metadata": { "id": "nHclDIXA4SjH" }, "source": [ "%cd /content/espnet/tools\n", "!. ./activate_python.sh; python3 check_install.py" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "41alrKGO4d3v" }, "source": [ "# Run a recipe example\n", "\n", "ESPnet has a number of recipes (73 recipes on Sep. 16, 2021).\n", "Let's first check https://github.com/espnet/espnet/blob/master/egs2/README.md\n", "\n", "Please also check the general usage of the recipe in https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "H6uBU3Mahsaj" }, "source": [ "**CMU AN4 recipe**\n", "\n", "In this tutorial, we use the CMU an4 recipe.\n", "This is a small-scale speech recognition task mainly used for testing.\n", "\n", "First, move to the recipe directory" ] }, { "cell_type": "code", "metadata": { "id": "GO2hG6CZ4er5" }, "source": [ "%cd /content/espnet/egs2/an4/asr1\n", "!ls" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "nxy5AxZtwBAp" }, "source": [ "```\n", "egs2/an4/asr1/\n", " - conf/ # Configuration files for training, inference, etc.\n", " - scripts/ # Bash utilities of espnet2\n", " - pyscripts/ # Python utilities of espnet2\n", " - steps/ # From Kaldi utilities\n", " - utils/ # From Kaldi utilities\n", " - db.sh # The directory path of each corpora\n", " - path.sh # Setup script for environment variables\n", " - cmd.sh # Configuration for your backend of job scheduler\n", " - run.sh # Entry point\n", " - asr.sh # Invoked by run.sh\n", " ```" ] }, { "cell_type": "markdown", "metadata": { "id": "v9h_fs_9wb0L" }, "source": [ "ESPnet is designed for various use cases (local machines or cluster machines) based on Kaldi tools. If you use it in the cluster machines, please also check https://kaldi-asr.org/doc/queue.html\n", "\n", "The main stages can be parallelized by various jobs." ] }, { "cell_type": "code", "metadata": { "id": "7JZqgQEywL16" }, "source": [ "!cat run.sh" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Msp-8cLBg0Zs" }, "source": [ "`run.sh` can call `asr.sh`, which completes the entire speech recognition experiments, including data preparation, training, inference, and scoring. They are based on separate stages (totally 15 stages).\n", "\n", "Instead of executing the entire experiments by `run.sh`, the following example executes the experiment for each stage to understand the process in each stage." ] }, { "cell_type": "markdown", "metadata": { "id": "GJUcVDYB40A-" }, "source": [ "## data preparation\n", "\n", "**Stage 1: Data preparation for training, validation, and evaluation data**\n", "\n", "Note that `--stage ` is to start the stage and `--stop_stage ` is to stop the stage.\n", "We also need to specify training, validation, and test data. " ] }, { "cell_type": "code", "metadata": { "id": "gLDwMc4G4x1C" }, "source": [ "# 30 seconds\n", "!./asr.sh --stage 1 --stop_stage 1 --train_set train_nodev --valid_set train_dev --test_sets \"train_dev test\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "9ORceFPOkLQJ" }, "source": [ "After this stage is finished, please check the `data` directory" ] }, { "cell_type": "code", "metadata": { "id": "iLY4zuPFiAWK" }, "source": [ "!ls data" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "S85-3X82kWbm" }, "source": [ "In this recipe, we use `train_nodev` as a training set, `train_dev` as a validation set (monitor the training progress by checking the validation score). We also use (reuse) `test` and `train_dev` sets for the final speech recognition evaluation.\n", "\n", "Let's check one of the training data directory:\n" ] }, { "cell_type": "code", "metadata": { "id": "OyAbGjDElKFA" }, "source": [ "!ls -1 data/train_nodev/" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Mob1Pd_ylPyb" }, "source": [ "These are the speech and corresponding text and speaker information based on the Kaldi format. Please also check https://kaldi-asr.org/doc/data_prep.html\n", "```\n", "spk2utt # Speaker information\n", "text # Transcription file\n", "utt2spk # Speaker information\n", "wav.scp # Audio file\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "id": "kVom1NvZ6Mnx" }, "source": [ "**Stage 2: Speed perturbation** (one of the data augmentation methods)\n", "\n", "We do not use speed perturbation for this demo. But you can turn it on by adding an argument `--speed_perturb_factors \"0.9 1.0 1.1\"` to the shell script" ] }, { "cell_type": "code", "metadata": { "id": "hoYaonp96M04" }, "source": [ "!./asr.sh --stage 2 --stop_stage 2 --train_set train_nodev --valid_set train_dev --test_sets \"train_dev test\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "9EAZ0E_s6Zdy" }, "source": [ "**Stage 3: Format wav.scp: data/ -> dump/raw**\n", "\n", "We dump the data with specified format (flac in this case) for the efficient use of the data.\n", "\n", "Note that `--nj ` means the number of CPU jobs. Please set it appropriately by considering your CPU resources and disk access." ] }, { "cell_type": "code", "metadata": { "id": "1OiHotER6cwZ" }, "source": [ "# 30 seconds\n", "!./asr.sh --stage 3 --stop_stage 3 --train_set train_nodev --valid_set train_dev --test_sets \"train_dev test\" --nj 4" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "_P3gdQMR7H88" }, "source": [ "**Stage 4: Remove long/short data: dump/raw/org -> dump/raw**\n", "\n", "There are too long and too short audio data, which are harmful for our efficient training. Those data are removed from the list." ] }, { "cell_type": "code", "metadata": { "id": "_ociO4Nx7Ia9" }, "source": [ "!./asr.sh --stage 4 --stop_stage 4 --train_set train_nodev --valid_set train_dev --test_sets \"train_dev test\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "jtrtBRPe7Ygz" }, "source": [ "**Stage 5: Generate token_list from dump/raw/train_nodev/text using BPE.**\n", "\n", "This is important for text processing. We make a dictionary based on the English character in this example.\n", "We use a `sentencepiece` toolkit developed by Google." ] }, { "cell_type": "code", "metadata": { "id": "6sTbbss77Y26" }, "source": [ "!./asr.sh --stage 5 --stop_stage 5 --train_set train_nodev --valid_set train_dev --test_sets \"train_dev test\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "0k1pSh1wyUJc" }, "source": [ "Let's check the content of the dictionary. There are several special symbols, e.g.,\n", "\n", "```\n", " used for CTC\n", " unknown symbols do not appear in the training data\n", " start and end sentence symbols\n", "```\n" ] }, { "cell_type": "code", "metadata": { "id": "YBEVQ9eOdeGP" }, "source": [ "!cat data/token_list/bpe_unigram30/tokens.txt" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "_Y5IlGWz7sXd" }, "source": [ "## language modeling (skip in this tutorial)\n", "\n", "**Stages 6--9: Stages related to language modeling.**\n", "\n", "We skip the language modeling part in the recipe (stages 6 -- 9) in this tutorial." ] }, { "cell_type": "markdown", "metadata": { "id": "OASA_sOQ71M6" }, "source": [ "## End-to-end ASR\n", "\n", "**Stage 10: ASR collect stats**: train_set=dump/raw/train_nodev, valid_set=dump/raw/train_dev\n", "\n", "We estimate the mean and variance of the data to normalize the data. We also collect the information of input and output lengths for the efficient mini batch creation.\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "QMtMM9hu7s1a" }, "source": [ "# 18 seconds\n", "!./asr.sh --stage 10 --stop_stage 10 --train_set train_nodev --valid_set train_dev --test_sets \"train_dev test\" --nj 4" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "S7Bse9Sd8Vu3" }, "source": [ "**Stage 11: ASR Training:** train_set=dump/raw/train_nodev, valid_set=dump/raw/train_dev\n", "\n", "Main training loop. \n", "\n", "Please also monitor the following files\n", "- log file /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/train.log \n", "- loss /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/images/loss.png\n", "- accuracy /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/images/acc.png" ] }, { "cell_type": "code", "metadata": { "id": "LqoH1IcW8WA9" }, "source": [ "# It would take 20-30 min.\n", "!./asr.sh --stage 11 --stop_stage 11 --train_set train_nodev --valid_set train_dev --test_sets \"train_dev test\" --ngpu 1" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "1a-NSbgoPrvp" }, "source": [ "**Stage 12**: Decoding: training_dir=exp/asr_train_raw_bpe30\n", "\n", "Note that we need to make `--use_lm false` since we skip the language model.\n", "\n", "`inference_nj ` specifies the number of inference jobs \n", "\n", "Let's monitor the log /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/inference_asr_model_valid.acc.ave/train_dev/logdir/asr_inference.1.log" ] }, { "cell_type": "code", "metadata": { "id": "Upma_ZWmPrdw" }, "source": [ "# It would take ~10 minutes\n", "!./asr.sh --inference_nj 4 --stage 12 --stop_stage 12 --train_set train_nodev --valid_set train_dev --test_sets \"train_dev test\" --use_lm false" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "R44cm0UfTFOD" }, "source": [ "**Stage 13: Scoring**\n", "\n", "You can find word error rate (WER), character error rate (CER), etc. for each test set." ] }, { "cell_type": "code", "metadata": { "id": "cyHIldPNTFt1" }, "source": [ "!./asr.sh --stage 13 --stop_stage 13 --train_set train_nodev --valid_set train_dev --test_sets \"train_dev test\" --use_lm false" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "jnkqczeZ9O2n" }, "source": [ "You can also check the break down of the word error rate in /content/espnet/egs2/an4/asr1/exp/asr_train_raw_bpe30/inference_asr_model_valid.acc.ave/train_dev/score_wer/result.txt" ] }, { "cell_type": "markdown", "metadata": { "id": "7iWoWZTIzDQh" }, "source": [ "## How to change the training configs?\n", "\n", "### config file based\n", "All training options are changed by using a config file.\n", "\n", "Pleae check https://espnet.github.io/espnet/espnet2_training_option.html\n", "\n", "Let's first check config files prepared in the `an4` recipe\n", "\n", "```\n", "- LSTM-based E2E ASR /content/espnet/egs2/an4/asr1/conf/train_asr_rnn.yaml\n", "- Transformer based E2E ASR /content/espnet/egs2/an4/asr1/conf/train_asr_transformer.yaml\n", "```\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "iDGGJGBs8Go1" }, "source": [ "You can run\n", "\n", "**RNN**\n", "```\n", "./asr.sh --stage 10 \\\n", " --train_set train_nodev \\ \n", " --valid_set train_dev \\\n", " --test_sets \"train_dev test\" \\\n", " --nj 4 \\\n", " --inference_nj 4 \\\n", " --use_lm false \\\n", " ----asr_config conf/train_asr_rnn.yaml \n", "```\n", "\n", "**Transformer**\n", "```\n", "./asr.sh --stage 10 \\\n", " --train_set train_nodev \\ \n", " --valid_set train_dev \\\n", " --test_sets \"train_dev test\" \\\n", " --nj 4 \\\n", " --inference_nj 4 \\\n", " --use_lm false \\\n", " ----asr_config conf/train_asr_transformer.yaml\n", "```\n" ] }, { "cell_type": "markdown", "metadata": { "id": "wcMGLh1-HYwr" }, "source": [ "You can also find various configs in `espnet/egs2/*/asr1/conf/`, including \n", "- Conformer `espnet/egs2/librispeech/asr1/conf/train_asr_confformer.yaml`\n", "- Wav2vec2.0 pre-trained model and fine-tuning `https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/conf/tuning/train_asr_conformer7_wav2vec2_960hr_large.yaml`\n", "- HuBERT pre-trained model and fine-tuning `https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/conf/tuning/train_asr_conformer7_hubert_960hr_large.yaml`" ] }, { "cell_type": "markdown", "metadata": { "id": "rYzNLITz7wyG" }, "source": [ "### command line argument based\n", "\n", "You can also customize it by editing the file or passing the command line arguments, e.g., \n", "\n", "```\n", "./run.sh --stage 10 --asr_args \"--model_conf ctc_weight=0.3\"\n", "```\n", "```\n", "./run.sh --stage 10 --asr_args \"--optim_conf lr=0.1\"\n", "```\n", "\n", "See https://espnet.github.io/espnet/espnet2_tutorial.html#change-the-configuration-for-training" ] }, { "cell_type": "markdown", "metadata": { "id": "tvNphJbcCQUA" }, "source": [ "## How to make a new recipe?\n", "\n", "- Check https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE" ] } ] }