CMU 11492/11692 Spring 2023: Data preparation

In this demonstration, we will show you the procedure to prepare the data for speech processing (ASR as an example).

Main references: - ESPnet repository - ESPnet documentation - ESPnet tutorial in Speech Recognition and Understanding (Fall 2021) - Recitation in Multilingual NLP (Spring 2022) - ESPnet tutorail in Speech Recognition and Understanding (Fall 2022)

Author: - Jiatong Shi (


After this demonstration, you are expected to know: - Understand the Kaldi(ESPnet) data format

Download ESPnet

We use git clone to download the source code of ESPnet and then go to a specific commit.

[ ]:
# It takes a few seconds
!git clone --depth 5

Setup Python environment based on anaconda

There are several other installation methods, but we highly recommend the anaconda-based one. In this demonstration, we will only need to have the python environment (no need to install the full espnet). But installation of ESPnet main codebase will be necessary for for training/inference/scoring.

[ ]:
# It takes 30 seconds
%cd /content/espnet/tools
!./ anaconda espnet 3.9

!pip install typeguard==2.13.0
[Errno 2] No such file or directory: '/content/espnet/tools'
/bin/bash: ./ No such file or directory
/bin/bash: ./installers/ No such file or directory

We will also install some essential python libraries (these will be auto-matically downloaded during espnet installation. However, today, we won’t go through that part, so we need to mannually install the packages.

[ ]:
!pip install kaldiio soundfile tqdm librosa matplotlib IPython

We will also need Kaldi for some essential scripts.

[ ]:
!git clone

Data preparation in ESPnet

ESPnet has a number of recipes (146 recipes on Jan. 23, 2023). One of the most important steps for those recipes is the preparation of the data. Constructing in different scenarios, spoken corpora need to be converted into a unified format. In ESPnet, we follow and adapt the Kaldi data format for various tasks.

In this demonstration, we will focus on a specific recipe an4 as an example.

Other materials: - Kaldi format documentation can be found in - ESPnet data format is in - Please refer to for a complete list of recipes. - Please also check the general usage of the recipe in

Data preparation for AN4

All the data preparation in ESPnet2 happens in egs2/recipe_name/task/local/ where the task can be either asr1, enh1, tts1, etc.

CMU AN4 recipe

In this demonstration, we will use the CMU an4 recipe. This is a small-scale speech recognition task mainly used for testing.

First, let’s go to the recipe directory.

[ ]:
%cd /content/espnet/egs2/an4/asr1
 - conf/      # Configuration files for training, inference, etc.
 - scripts/   # Bash utilities of espnet2
 - pyscripts/ # Python utilities of espnet2
 - steps/     # From Kaldi utilities
 - utils/     # From Kaldi utilities
 - local/     # Some local scripts for specific recipes (Data Preparation usually in `local/`)
 -      # The directory path of each corpora
 -    # Setup script for environment variables
 -     # Configuration for your backend of job scheduler
 -     # Entry point
 -     # Invoked by
[ ]:
# a few seconds

The orginal data usually in various format. AN4 has a quite straightforward format. You may dig into the folder an4 to see the raw format. After this preparation is finished, all the information will be in the data directory:

[ ]:
!ls data

In this recipe, we use train_nodev as a training set, train_dev as a validation set (monitor the training progress by checking the validation score). We also use test and train_dev sets for the final speech recognition evaluation.

Let’s check one of the training data directories:

[ ]:
!ls -1 data/train_nodev/

In short, the four files are:

spk2utt # Speaker information
text    # Transcription file
utt2spk # Speaker information
wav.scp # Audio file

The wav.scp is the most important file that holds the speech data. For each line of the wav.scp, there are generally two components WAV_ID and SPEECH_AUDIO for each line of the file. The WAV_ID is an identifier for the utterance, while the SPEECH_AUDIO holds the speech audio data.

The audio data can be in various audio formats, such as wav, flac, sph, etc. We can also use pipe to normalize audio files with (e.g., sox, ffmpeg, sph2pipe). The following from an4 is an example using sph2pipe.

[ ]:
!head -n 10 data/train_nodev/wav.scp

The text is to hold the transription of the speech. Similar to wav.scp, for each line of text, there are UTT_ID and TRANSCRIPTION. Note that the UTT_ID in text and WAV_ID in wav.scp are not necessary the same. But for the simple case (e.g., the AN4), we regard them as the same. The example in AN4 is as:

[ ]:
!head -n 10 data/train_nodev/text

The spk2utt and utt2spk are mapping between utterances and speakers. The information is widely used in conventional hidden Markov model (HMM)-based ASR systems, but not that popular in end-to-end ASR systems nowadays. However, they are still very important for tasks such as speaker diarization and multi-speaker text-to-speech. The examples of AN4 is as follows:

[ ]:
!head -n 10 data/train_nodev/spk2utt
!echo "--------------------------"
!head -n 10 data/train_nodev/utt2spk

How to read file in pipe

We can use kaldiio package to read audio files from wav.scp. The example is as follows:

[ ]:
import soundfile
import kaldiio
import matplotlib.pyplot as plt
from io import BytesIO
from tqdm import tqdm
import librosa.display
import numpy as np
import IPython.display as ipd
import os

os.environ['PATH'] = os.environ['PATH'] + ":/content/espnet/tools/sph2pipe"

wavscp = open("data/test/wav.scp", "r")

num_wav = 5
count = 1
for line in tqdm(wavscp):
  utt_id, wavpath = line.strip().split(None, 1)
  with kaldiio.open_like_kaldi(wavpath, "rb") as f:
    with BytesIO( as g:
      wave, rate =, dtype=np.float32)
      print("audio: {}".format(utt_id))
      librosa.display.waveshow(wave, rate)

      ipd.display(ipd.Audio(wave, rate=rate)) # load a NumPy array
      if count == num_wav:
      count += 1

Data preparation for TOTONAC


In the second part of the demonstration, we will use the CMU totonac recipe. This is a small-scale ASR recipe, which is an endangered language in central Mexico. We will follow mostly the similar procedure as the showcase of AN4. For the start, the recipe directory of totonac is almost the same as an4.

[ ]:
%cd /content/espnet/egs2/totonac/asr1

Then we execute ./local/ for the data preparation, which is the same as an4. The downloading takes a longer time (around 2-3 mins) for totonac as the speech is in higher-sampling rate and recorded in a conversational manner which include longer session rather than single utterances.

[ ]:
!. ../../../tools/ && pip install soundfile # we need soundfile for necessary processing


Let’s first check the original data format of the totonac. To facilate the linguists working on the language, we use the ELAN format, which is special XML format. For preparation, we need to parse the format into the same Kaldi format as mentioned ahead. For more details, please check

[ ]:
!ls -l downloads/Conversaciones/Botany/Transcripciones/ELAN-para-traducir | head -n 5
!echo "-----------------------------------------------"
!cat downloads/Conversaciones/Botany/Transcripciones/ELAN-para-traducir/Zongo_Botan_ESP400-SLC388_Convolvulaceae-Cuscuta-sp_2019-09-25-c_ed-2020-12-30.eaf

Similar to AN4, we will have three sets for the experiments for totonac, including train, test and dev. However, within the set, we also have a segments file apart from the files mentioned above.

For each line of segments, we will have four fields for each line, including UTT_ID, WAV_ID, “start time” and “end time”. Note that when segments files are presented, the WAV_ID in wav.scp and UTT_ID in text, utt2spk and spk2utt are not the same anymore. And the segments is the file that keeps the relationship between WAV_ID and UTT_ID.

[ ]:
!ls -l data
!echo  "--------------------------"
!ls -l data/train
!echo  "------------- wav.scp file -------------"
!head -n 10 data/train/wav.scp
!echo  "------------- Segment file -------------"
!head -n 10 data/train/segments


Q1: The format itself is very general. But it cannot fit to all the tasks in speech processing. Could you list three tasks where the current format cannot be sufficient?

Your Answers here

Q2: For the three tasks you listed above, can you think of some modification or addition to the format to make it also working for the tasks?

Your Answers here

Q3: Briefly discuss the difference within the ``wav.scp`` between ``an4`` and ``totonac``

Your Answers here

(Note that for this assignment, you do not need to submit anything.)