Converting audio file formats using

The is an utility to convert the audio format of the files specified wav.scp and the is a shell script wrapping In the typical case, in the stage3 of the template recipe, is used to convert the audio file format of your original corpus to the audio format which you actually want to feed to the DNN model. and has same function of generation wav.scp from wav.scp, but is different in that it has the capability of parallel processing.

wav.scp -> [] -> wav.scp

wav.scp -> [] -> wav.scp

Note that dumps audio files with linear PCM with sint16 regardless the input audio format.

Quick usage

At the first, you need to prepare a text file named as wav.scp:

ID_a /some_where/a.wav
ID_b /some_where2/b.wav

ID_aand ID_b are the IDs which you can name arbitrarily to specify audio files. Note that we don’t assume any directory stuctures for the audio files.

# Please change directory before using our shell scripts
cd egs2/some_corpus/some_task

nj=10  # Number of parallel jobs
audio_format=flac  # The audio codec of output files
fs=16k  # The sampling frequency of output files
ref_channels=0  # If the input data has multiple channels and you want to use only a single channel in the file (please spicify the channel with 0-based number)
./scripts/audio/ --nj "${nj}" --cmd "${cmd}" --audio_format "${audio_format}" --fs "${fs}" --ref_channels "${ref_channels}" somewhere/wav.scp output_dir

# Then, you can find output_dir/wav.scp

See also:

Why is audio file formatting necessary?

The audio data included in the corpus obtained from the source website are distributed in various audio file formats, i.e., the audio codec (wav of linear PCM, flac, mp3, DSD, u-law, a-lawor etc.), the sampling frequency (48khz, 44.1khz, 16khz, 8khz, or etc.), the bit depth (uint8, sint16, sint32, float20, float32 or etc.), the number of channels (monaural, stereo, or more than 2ch), the byter order(little endian or big endian).

When you try to develop a new recipe with a corpus that is not yet prepared in our recipes, of course, you can also try to use the audio data as they are without any formatting. However, in a typical case, the configuration of our DNN model may assume the specific audio format, especially regarding the sampling frequency and the data precision. If you are conservative with your new recipe, we recommend converting them to the original recipe’s audio format. For example, 16khz and sint16 audio is typically used in our ASR recipes.

The audio file formats supported in ESPnet2

ESPnet adopts python soundifile for data loading, and, thus the supported audio codecs depend on libsndfile.

You can check the supported audio codecs of soundfile with the following command:

import soundfile

Note that the wav.scp of Kaldi originally requires that the audio format is wav with pcm_s16le type, but wav.scp of ESPnet2 can handle all audio formats supported by soundfile. e.g. You can use flac format in wav.scp for the input/output of

Depending on the situation, you may choose one of the following codecs:

Codec Compression Maximum channnels Maximum sampling frequency Note
wav (Microsoft wav with linear pcm) No 1024 -
flac Lossless 8 192khz
mp3 Lossy 2 48khz The patent of MP3 has expired
ogg (Vorbis) Lossy 255 192khz Segmentation fault happens

By default, we select flac because flac can convert linear pcm files with compression rate of ~55 % without data loss. flac is helpful to reduce the IO load, especially, when training with a large amount of corpus. If you would like to change it to the other format, please use --audio_format option for

cd egs2/some_corpus/some_task
./ --audio_format mp3

Note that if the audio files in your corpus are disributed with lossy audio codec, such as MP3, it’s better to keep the file format to avoid the duplication of the full corpus with the uncompressed format.  If the input audio format type is exactly same as the output format, avoid the gengeration of the output files and reuse the input files.

Use case

Case1: Extract segmentations with long recoding

Create wav.scp and segments with the format of The format is <utterance_id> <wav_id> <start_time> <end_time> (second unit).


record_a a.wav


segment_a record_a 0.98 11.56
segment_a record_a 12.34 15.43

Then, you can extract the segments with:

./scripts/audio/ --segments segments wav.scp output_dir

Case2: Extract audio data from video codec / Use non supported format by soundfile

ffmpeg is required. Create wav.scp as following:

ID_a ffmpeg -i "ID_a.mp4" -f wav -af pan="1c|c0=c0" -acodec pcm_s16le - |
ID_b ffmpeg -i "ID_b.mp4" -f wav -af pan="1c|c0=c0" -acodec pcm_s16le - |
  • Note: -af pan is pan filter.

    • <num>c| specifies <num> of output channels

    • |c<out-channel>=c<in-channel> assigns <in-channel>th channel of input stream into <out-channel>th channel of output stream

  • Caution: -map_channel option is deprecated and will be removed.

Case3: Convert NIST Sphere files to wav

sph2pipe is required. Create wav.scp as following:

ID_a sph2pipe -f wav -p -c 1 ID_a.sph |
ID_b sph2pipe -f wav -p -c 1 ID_b.sph |

Case4: Using a mechanism for multi channels inputs

If you are going to generate multi channels audio file from monaural audio files, create the following wav.scp:

ID_a a1.wav a2.wav

and run the following commands:

./scripts/audio/ --multi_columns_input true wav.scp output_dir

Conversely, if you and going to monaural audio files from multi channels audio files

./scripts/audio/ --multi_columns_output true wav.scp output_dir

Then, you can get wav.scp like the following file:

ID_a output_dir/IDa-CH0.wav output_dir/ID_a-CH1.wav