CMU 11492/11692 Spring 2023: Data preparation
CMU 11492/11692 Spring 2023: Data preparation
In this demonstration, we will show you the procedure to prepare the data for speech processing (ASR as an example).
Main references:
- ESPnet repository
- ESPnet documentation
- ESPnet tutorial in Speech Recognition and Understanding (Fall 2021)
- Recitation in Multilingual NLP (Spring 2022)
- ESPnet tutorail in Speech Recognition and Understanding (Fall 2022)
Author:
- Jiatong Shi (jiatongs@andrew.cmu.edu)
Objectives
After this demonstration, you are expected to know:
- Understand the Kaldi(ESPnet) data format
Useful links
- Installation https://espnet.github.io/espnet/installation.html
- Kaldi Data format https://kaldi-asr.org/doc/data_prep.html
- ESPnet data format https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE#about-kaldi-style-data-directory
Download ESPnet
We use git clone
to download the source code of ESPnet and then go to a specific commit.
# It takes a few seconds
!git clone --depth 5 https://github.com/espnet/espnet
Setup Python environment based on anaconda
There are several other installation methods, but we highly recommend the anaconda-based one. In this demonstration, we will only need to have the python environment (no need to install the full espnet). But installation of ESPnet main codebase will be necessary for for training/inference/scoring.
# It takes 30 seconds
%cd /content/espnet/tools
!./setup_anaconda.sh anaconda espnet 3.9
!./installers/install_sph2pipe.sh
!pip install typeguard==2.13.0
We will also install some essential python libraries (these will be auto-matically downloaded during espnet installation. However, today, we won't go through that part, so we need to mannually install the packages.
!pip install kaldiio soundfile tqdm librosa matplotlib IPython
We will also need Kaldi for some essential scripts.
!git clone https://github.com/kaldi-asr/kaldi.git
Data preparation in ESPnet
ESPnet has a number of recipes (146 recipes on Jan. 23, 2023). One of the most important steps for those recipes is the preparation of the data. Constructing in different scenarios, spoken corpora need to be converted into a unified format. In ESPnet, we follow and adapt the Kaldi data format for various tasks.
In this demonstration, we will focus on a specific recipe an4
as an example.
Other materials:
- Kaldi format documentation can be found in https://kaldi-asr.org/doc/data_prep.html
- ESPnet data format is in https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE#about-kaldi-style-data-directory
- Please refer to https://github.com/espnet/espnet/blob/master/egs2/README.md for a complete list of recipes.
- Please also check the general usage of the recipe in https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
Data preparation for AN4
All the data preparation in ESPnet2 happens in egs2/recipe_name/task/local/data.sh
where the task can be either asr1
, enh1
, tts1
, etc.
CMU AN4 recipe
In this demonstration, we will use the CMU an4
recipe. This is a small-scale speech recognition task mainly used for testing.
First, let's go to the recipe directory.
%cd /content/espnet/egs2/an4/asr1
!ls
egs2/an4/asr1/
- conf/ # Configuration files for training, inference, etc.
- scripts/ # Bash utilities of espnet2
- pyscripts/ # Python utilities of espnet2
- steps/ # From Kaldi utilities
- utils/ # From Kaldi utilities
- local/ # Some local scripts for specific recipes (Data Preparation usually in `local/data.sh`)
- db.sh # The directory path of each corpora
- path.sh # Setup script for environment variables
- cmd.sh # Configuration for your backend of job scheduler
- run.sh # Entry point
- asr.sh # Invoked by run.sh
# a few seconds
!./local/data.sh
The orginal data usually in various format. AN4 has a quite straightforward format. You may dig into the folder an4
to see the raw format. After this preparation is finished, all the information will be in the data
directory:
!ls data
In this recipe, we use train_nodev
as a training set, train_dev
as a validation set (monitor the training progress by checking the validation score). We also use test
and train_dev
sets for the final speech recognition evaluation.
Let's check one of the training data directories:
!ls -1 data/train_nodev/
In short, the four files are:
spk2utt # Speaker information
text # Transcription file
utt2spk # Speaker information
wav.scp # Audio file
The wav.scp
is the most important file that holds the speech data. For each line of the wav.scp
, there are generally two components WAV_ID
and SPEECH_AUDIO
for each line of the file. The WAV_ID
is an identifier for the utterance, while the SPEECH_AUDIO
holds the speech audio data.
The audio data can be in various audio formats, such as wav
, flac
, sph
, etc. We can also use pipe to normalize audio files with (e.g., sox
, ffmpeg
, sph2pipe
). The following from an4 is an example using sph2pipe
.
!head -n 10 data/train_nodev/wav.scp
The text
is to hold the transription of the speech. Similar to wav.scp
, for each line of text
, there are UTT_ID
and TRANSCRIPTION
. Note that the UTT_ID
in text
and WAV_ID
in wav.scp
are not necessary the same. But for the simple case (e.g., the AN4
), we regard them as the same. The example in AN4
is as:
!head -n 10 data/train_nodev/text
The spk2utt
and utt2spk
are mapping between utterances and speakers. The information is widely used in conventional hidden Markov model (HMM)-based ASR systems, but not that popular in end-to-end ASR systems nowadays. However, they are still very important for tasks such as speaker diarization and multi-speaker text-to-speech. The examples of AN4 is as follows:
!head -n 10 data/train_nodev/spk2utt
!echo "--------------------------"
!head -n 10 data/train_nodev/utt2spk
How to read file in pipe
We can use kaldiio
package to read audio files from wav.scp
. The example is as follows:
import soundfile
import kaldiio
import matplotlib.pyplot as plt
from io import BytesIO
from tqdm import tqdm
import librosa.display
import numpy as np
import IPython.display as ipd
import os
os.environ['PATH'] = os.environ['PATH'] + ":/content/espnet/tools/sph2pipe"
wavscp = open("data/test/wav.scp", "r")
num_wav = 5
count = 1
for line in tqdm(wavscp):
utt_id, wavpath = line.strip().split(None, 1)
with kaldiio.open_like_kaldi(wavpath, "rb") as f:
with BytesIO(f.read()) as g:
wave, rate = soundfile.read(g, dtype=np.float32)
print("audio: {}".format(utt_id))
librosa.display.waveshow(wave, rate)
plt.show()
ipd.display(ipd.Audio(wave, rate=rate)) # load a NumPy array
if count == num_wav:
break
count += 1
Data preparation for TOTONAC
CMU TOTONAC recipe
In the second part of the demonstration, we will use the CMU totonac
recipe. This is a small-scale ASR recipe, which is an endangered language in central Mexico. We will follow mostly the similar procedure as the showcase of AN4. For the start, the recipe directory of totonac
is almost the same as an4
.
%cd /content/espnet/egs2/totonac/asr1
!ls
Then we execute ./local/data.sh
for the data preparation, which is the same as an4
. The downloading takes a longer time (around 2-3 mins) for totonac
as the speech is in higher-sampling rate and recorded in a conversational manner which include longer session rather than single utterances.
!. ../../../tools/activate_python.sh && pip install soundfile # we need soundfile for necessary processing
!./local/data.sh
Let's first check the original data format of the totonac
. To facilate the linguists working on the language, we use the ELAN format, which is special XML format. For preparation, we need to parse the format into the same Kaldi format as mentioned ahead. For more details, please check https://github.com/espnet/espnet/blob/master/egs2/totonac/asr1/local/data_prep.py
!ls -l downloads/Conversaciones/Botany/Transcripciones/ELAN-para-traducir | head -n 5
!echo "-----------------------------------------------"
!cat downloads/Conversaciones/Botany/Transcripciones/ELAN-para-traducir/Zongo_Botan_ESP400-SLC388_Convolvulaceae-Cuscuta-sp_2019-09-25-c_ed-2020-12-30.eaf
Similar to AN4
, we will have three sets for the experiments for totonac
, including train, test and dev. However, within the set, we also have a segments
file apart from the files mentioned above.
For each line of segments
, we will have four fields for each line, including UTT_ID
, WAV_ID
, "start time" and "end time". Note that when segments
files are presented, the WAV_ID
in wav.scp
and UTT_ID
in text
, utt2spk
and spk2utt
are not the same anymore. And the segments
is the file that keeps the relationship between WAV_ID
and UTT_ID
.
!ls -l data
!echo "--------------------------"
!ls -l data/train
!echo "------------- wav.scp file -------------"
!head -n 10 data/train/wav.scp
!echo "------------- Segment file -------------"
!head -n 10 data/train/segments
#Questions:
Q1: The format itself is very general. But it cannot fit to all the tasks in speech processing. Could you list three tasks where the current format cannot be sufficient?
Your Answers here
Q2: For the three tasks you listed above, can you think of some modification or addition to the format to make it also working for the tasks?
Your Answers here
Q3: Briefly discuss the difference within the wav.scp
between an4
and totonac
Your Answers here
(Note that for this assignment, you do not need to submit anything.)