{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kNbxe3gyfbPi"
      },
      "source": [
        "# CMU 11492/11692 Spring 2023: Data preparation\n",
        "\n",
        "In this demonstration, we will show you the procedure to prepare the data for speech processing (ASR as an example).\n",
        "\n",
        "Main references:\n",
        "- [ESPnet repository](https://github.com/espnet/espnet)\n",
        "- [ESPnet documentation](https://espnet.github.io/espnet/)\n",
        "- [ESPnet tutorial in Speech Recognition and Understanding (Fall 2021)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tutorial_2021_CMU_11751_18781.ipynb)\n",
        "- [Recitation in Multilingual NLP (Spring 2022)](https://colab.research.google.com/drive/1tY6PxF_M5Nx5n488x0DrpujJOyqW-ATi?usp=sharing)\n",
        "- [ESPnet tutorail in Speech Recognition and Understanding (Fall 2022)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_recipe_tutorial_CMU_11751_18781_Fall2022.ipynb)\n",
        "\n",
        "Author: \n",
        "- Jiatong Shi (jiatongs@andrew.cmu.edu)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wl9JFMNJ5iYu"
      },
      "source": [
        "## Objectives\n",
        "After this demonstration, you are expected to know:\n",
        "- Understand the Kaldi(ESPnet) data format\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gGg1N9jufpf2"
      },
      "source": [
        "## Useful links\n",
        "\n",
        "- Installation https://espnet.github.io/espnet/installation.html\n",
        "- Kaldi Data format https://kaldi-asr.org/doc/data_prep.html\n",
        "- ESPnet data format https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE#about-kaldi-style-data-directory\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "n1G9slDo0AuF"
      },
      "source": [
        "## Download ESPnet\n",
        "\n",
        "We use `git clone` to download the source code of ESPnet and then go to a specific commit."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "St7lke2P0GUP"
      },
      "outputs": [],
      "source": [
        "# It takes a few seconds\n",
        "!git clone --depth 5 https://github.com/espnet/espnet\n",
        "\n",
        "# We use a specific commit just for reproducibility.\n",
        "%cd /content/espnet\n",
        "!git checkout 3970558fbbe38d7b7e9922b08a9aa249390d4fb7"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lZGnBSEaz1Zt"
      },
      "source": [
        "## Setup Python environment based on anaconda\n",
        "\n",
        "There are several other installation methods, but **we highly recommend the anaconda-based one**. In this demonstration, we will only need to have the python environment (no need to install the full espnet). But installation of ESPnet main codebase will be necessary for for training/inference/scoring.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4F80yqAIz86B"
      },
      "outputs": [],
      "source": [
        "# It takes 30 seconds\n",
        "%cd /content/espnet/tools\n",
        "!./setup_anaconda.sh anaconda espnet 3.9\n",
        "\n",
        "!./installers/install_sph2pipe.sh"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8HXoF3p3qiob"
      },
      "source": [
        "We will also install some essential python libraries (these will be auto-matically downloaded during espnet installation. However, today, we won't go through that part, so we need to mannually install the packages."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "EC3ft77jqfVw"
      },
      "outputs": [],
      "source": [
        "!pip install kaldiio soundfile tqdm librosa matplotlib IPython"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YiS6fVykh7UY"
      },
      "source": [
        "We will also need Kaldi for some essential scripts."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zeWlcS06h-5G"
      },
      "outputs": [],
      "source": [
        "!git clone https://github.com/kaldi-asr/kaldi.git"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "41alrKGO4d3v"
      },
      "source": [
        "# Data preparation in ESPnet\n",
        "\n",
        "ESPnet has a number of recipes (146 recipes on Jan. 23, 2023). One of the most important steps for those recipes is the preparation of the data. Constructing in different scenarios, spoken corpora need to be converted into a unified format. In ESPnet, we follow and adapt the Kaldi data format for various tasks.\n",
        "\n",
        "In this demonstration, we will focus on a specific recipe `an4` as an example.\n",
        "\n",
        "\n",
        "Other materials:\n",
        "- Kaldi format documentation can be found in https://kaldi-asr.org/doc/data_prep.html\n",
        "- ESPnet data format is in https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE#about-kaldi-style-data-directory\n",
        "- Please refer to https://github.com/espnet/espnet/blob/master/egs2/README.md for a complete list of recipes.\n",
        "- Please also check the general usage of the recipe in https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GJUcVDYB40A-"
      },
      "source": [
        "## Data preparation for AN4\n",
        "\n",
        "All the data preparation in ESPnet2 happens in `egs2/recipe_name/task/local/data.sh` where the task can be either `asr1`, `enh1`, `tts1`, etc.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "H6uBU3Mahsaj"
      },
      "source": [
        "**CMU AN4 recipe**\n",
        "\n",
        "In this demonstration, we will use the CMU `an4` recipe.\n",
        "This is a small-scale speech recognition task mainly used for testing.\n",
        "\n",
        "First, let's go to the recipe directory."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true
        },
        "id": "GO2hG6CZ4er5"
      },
      "outputs": [],
      "source": [
        "%cd /content/espnet/egs2/an4/asr1\n",
        "!ls"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nxy5AxZtwBAp"
      },
      "source": [
        "```\n",
        "egs2/an4/asr1/\n",
        " - conf/      # Configuration files for training, inference, etc.\n",
        " - scripts/   # Bash utilities of espnet2\n",
        " - pyscripts/ # Python utilities of espnet2\n",
        " - steps/     # From Kaldi utilities\n",
        " - utils/     # From Kaldi utilities\n",
        " - local/     # Some local scripts for specific recipes (Data Preparation usually in `local/data.sh`)\n",
        " - db.sh      # The directory path of each corpora\n",
        " - path.sh    # Setup script for environment variables\n",
        " - cmd.sh     # Configuration for your backend of job scheduler\n",
        " - run.sh     # Entry point\n",
        " - asr.sh     # Invoked by run.sh\n",
        " ```"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gLDwMc4G4x1C"
      },
      "outputs": [],
      "source": [
        "# a few seconds\n",
        "!./local/data.sh"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9ORceFPOkLQJ"
      },
      "source": [
        "The orginal data usually in various format. AN4 has a quite straightforward format. You may dig into the folder `an4` to see the raw format. After this preparation is finished, all the information will be in the `data` directory:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iLY4zuPFiAWK"
      },
      "outputs": [],
      "source": [
        "!ls data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S85-3X82kWbm"
      },
      "source": [
        "In this recipe, we use `train_nodev` as a training set, `train_dev` as a validation set (monitor the training progress by checking the validation score). We also use `test` and `train_dev` sets for the final speech recognition evaluation.\n",
        "\n",
        "Let's check one of the training data directories:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OyAbGjDElKFA"
      },
      "outputs": [],
      "source": [
        "!ls -1 data/train_nodev/"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Mob1Pd_ylPyb"
      },
      "source": [
        "In short, the four files are:\n",
        "\n",
        "```\n",
        "spk2utt # Speaker information\n",
        "text    # Transcription file\n",
        "utt2spk # Speaker information\n",
        "wav.scp # Audio file\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tuct2z9pig3Y"
      },
      "source": [
        "The `wav.scp` is the most important file that holds the speech data. For each line of the `wav.scp`, there are generally two components `WAV_ID` and `SPEECH_AUDIO` for each line of the file. The `WAV_ID` is an identifier for the utterance, while the `SPEECH_AUDIO` holds the speech audio data.\n",
        "\n",
        "The audio data can be in various audio formats, such as `wav`, `flac`, `sph`, etc. We can also use pipe to normalize audio files with (e.g., `sox`, `ffmpeg`, `sph2pipe`). The following from an4 is an example using `sph2pipe`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jsuYBCvFirWM"
      },
      "outputs": [],
      "source": [
        "!head -n 10 data/train_nodev/wav.scp"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Cid5sWSsi2p8"
      },
      "source": [
        "The `text` is to hold the transription of the speech. Similar to `wav.scp`, for each line of `text`, there are `UTT_ID` and `TRANSCRIPTION`. Note that the `UTT_ID` in `text` and `WAV_ID` in `wav.scp` are not necessary the same. But for the simple case (e.g., the `AN4`), we regard them as the same. The example in `AN4` is as:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vLTi2dlulLc4"
      },
      "outputs": [],
      "source": [
        "!head -n 10 data/train_nodev/text"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bKHpTxHzlZcB"
      },
      "source": [
        "The `spk2utt` and `utt2spk` are mapping between utterances and speakers. The information is widely used in conventional hidden Markov model (HMM)-based ASR systems, but not that popular in end-to-end ASR systems nowadays. However, they are still very important for tasks such as speaker diarization and multi-speaker text-to-speech. The examples of AN4 is as follows:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true
        },
        "id": "jhlhvX4Xl-0Y"
      },
      "outputs": [],
      "source": [
        "!head -n 10 data/train_nodev/spk2utt\n",
        "!echo \"--------------------------\"\n",
        "!head -n 10 data/train_nodev/utt2spk"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lLUSQzg9X4zG"
      },
      "source": [
        "## How to read file in pipe\n",
        "\n",
        "We can use `kaldiio` package to read audio files from `wav.scp`. The example is as follows:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true
        },
        "id": "ym7Mviy-YKSJ"
      },
      "outputs": [],
      "source": [
        "import soundfile\n",
        "import kaldiio\n",
        "import matplotlib.pyplot as plt\n",
        "from io import BytesIO\n",
        "from tqdm import tqdm\n",
        "import librosa.display\n",
        "import numpy as np\n",
        "import IPython.display as ipd\n",
        "import os\n",
        "\n",
        "os.environ['PATH'] = os.environ['PATH'] + \":/content/espnet/tools/sph2pipe\"\n",
        "\n",
        "wavscp = open(\"data/test/wav.scp\", \"r\")\n",
        "\n",
        "num_wav = 5\n",
        "count = 1\n",
        "for line in tqdm(wavscp):\n",
        "  utt_id, wavpath = line.strip().split(None, 1)\n",
        "  with kaldiio.open_like_kaldi(wavpath, \"rb\") as f:\n",
        "    with BytesIO(f.read()) as g:\n",
        "      wave, rate = soundfile.read(g, dtype=np.float32)\n",
        "      print(\"audio: {}\".format(utt_id))\n",
        "      librosa.display.waveshow(wave, rate)\n",
        "      plt.show()\n",
        "\n",
        "      ipd.display(ipd.Audio(wave, rate=rate)) # load a NumPy array\n",
        "      if count == num_wav:\n",
        "        break\n",
        "      count += 1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "q0iJF3B0muas"
      },
      "source": [
        "## Data preparation for TOTONAC\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dh1Tzx8umbZb"
      },
      "source": [
        "**CMU TOTONAC recipe**\n",
        "\n",
        "In the second part of the demonstration, we will use the CMU `totonac` recipe.\n",
        "This is a small-scale ASR recipe, which is an endangered language in central Mexico. We will follow mostly the similar procedure as the showcase of AN4. For the start, the recipe directory of `totonac` is almost the same as `an4`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true
        },
        "id": "Kb6txR9cnQGT"
      },
      "outputs": [],
      "source": [
        "%cd /content/espnet/egs2/totonac/asr1\n",
        "!ls"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XKEaLnhm9dhu"
      },
      "source": [
        "Then we execute `./local/data.sh` for the data preparation, which is the same as `an4`. The downloading takes a longer time (around 2-3 mins) for `totonac` as the speech is in higher-sampling rate and recorded in a conversational manner which include longer session rather than single utterances."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true
        },
        "id": "BK3-oPVm9d8y"
      },
      "outputs": [],
      "source": [
        "!. ../../../tools/activate_python.sh && pip install soundfile # we need soundfile for necessary processing\n",
        "\n",
        "!./local/data.sh"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9wmwtGkjASej"
      },
      "source": [
        "Let's first check the original data format of the `totonac`. To facilate the linguists working on the language, we use the ELAN format, which is special XML format. For preparation, we need to parse the format into the same Kaldi format as mentioned ahead. For more details, please check https://github.com/espnet/espnet/blob/master/egs2/totonac/asr1/local/data_prep.py"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true
        },
        "id": "2WiPvGPwAV77"
      },
      "outputs": [],
      "source": [
        "!ls -l downloads/Conversaciones/Botany/Transcripciones/ELAN-para-traducir | head -n 5\n",
        "!echo \"-----------------------------------------------\"\n",
        "!cat downloads/Conversaciones/Botany/Transcripciones/ELAN-para-traducir/Zongo_Botan_ESP400-SLC388_Convolvulaceae-Cuscuta-sp_2019-09-25-c_ed-2020-12-30.eaf"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PD-0YSyCAAwK"
      },
      "source": [
        "Similar to `AN4`, we will have three sets for the experiments for `totonac`, including train, test and dev. However, within the set, we also have a `segments` file apart from the files mentioned above.\n",
        "\n",
        "For each line of `segments`, we will have four fields for each line, including `UTT_ID`, `WAV_ID`, \"start time\" and \"end time\". Note that when `segments` files are presented, the `WAV_ID` in `wav.scp` and `UTT_ID` in `text`, `utt2spk` and `spk2utt` are not the same anymore. And the `segments` is the file that keeps the relationship between `WAV_ID` and `UTT_ID`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true
        },
        "id": "IUDP1AFC-i50"
      },
      "outputs": [],
      "source": [
        "!ls -l data\n",
        "!echo  \"--------------------------\"\n",
        "!ls -l data/train\n",
        "!echo  \"------------- wav.scp file -------------\"\n",
        "!head -n 10 data/train/wav.scp\n",
        "!echo  \"------------- Segment file -------------\"\n",
        "!head -n 10 data/train/segments\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6XLRicblDZ9O"
      },
      "source": [
        "#Questions: \n",
        "\n",
        "**Q1: The format itself is very general. But it cannot fit to all the tasks in speech processing. Could you list three tasks where the current format cannot be sufficient?**\n",
        "\n",
        "*Your Answers here*\n",
        "\n",
        "**Q2: For the three tasks you listed above, can you think of some modification or addition to the format to make it also working for the tasks?**\n",
        "\n",
        "*Your Answers here*\n",
        "\n",
        "**Q3: Briefly discuss the difference within the `wav.scp` between `an4` and `totonac`**\n",
        "\n",
        "*Your Answers here*\n",
        "\n",
        "(Note that for this assignment, you do not need to submit anything.)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}