Converting audio file formats using format_wav_scp.py
Converting audio file formats using format_wav_scp.py
The format_wav_scp.py is an utility to convert the audio format of the files specified wav.scp
and the format_wav_scp.sh is a shell script wrapping format_wav_scp.py
. In the typical case, in the stage3 of the template recipe, format_wav_scp.sh
is used to convert the audio file format of your original corpus to the audio format which you actually want to feed to the DNN model.
format_wav_scp.py
and format_wav_scp.sh
has same function of generation wav.scp
from wav.scp
, but format_wav_scp.sh
is different in that it has the capability of parallel processing.
wav.scp -> [format_wav_scp.py] -> wav.scp
wav.scp -> [format_wav_scp.sh] -> wav.scp
Note that format_wav_scp.py
dumps audio files with linear PCM with sint16
regardless the input audio format.
Quick usage
At the first, you need to prepare a text file named as wav.scp
:
ID_a /some_where/a.wav
ID_b /some_where2/b.wav
...
ID_a
and ID_b
are the IDs which you can name arbitrarily to specify audio files. Note that we don't assume any directory stuctures for the audio files.
# Please change directory before using our shell scripts
cd egs2/some_corpus/some_task
cmd=utils/run.pl
nj=10 # Number of parallel jobs
audio_format=flac # The audio codec of output files
fs=16k # The sampling frequency of output files
ref_channels=0 # If the input data has multiple channels and you want to use only a single channel in the file (please spicify the channel with 0-based number)
./scripts/audio/format_wav_scp.sh --nj "${nj}" --cmd "${cmd}" --audio_format "${audio_format}" --fs "${fs}" --ref_channels "${ref_channels}" somewhere/wav.scp output_dir
# Then, you can find output_dir/wav.scp
See also:
- About
wav.scp
: https://github.com/espnet/data_example - About
cmd
: Using job scheduling system
Why is audio file formatting necessary?
The audio data included in the corpus obtained from the source website are distributed in various audio file formats, i.e., the audio codec (wav
of linear PCM
, flac
, mp3
, DSD
, u-law
, a-law
or etc.), the sampling frequency (48khz
, 44.1khz
, 16khz
, 8khz
, or etc.), the bit depth (uint8
, sint16
, sint32
, float20
, float32
or etc.), the number of channels (monaural
, stereo
, or more than 2ch), the byter order(little endian
or big endian
).
When you try to develop a new recipe with a corpus that is not yet prepared in our recipes, of course, you can also try to use the audio data as they are without any formatting. However, in a typical case, the configuration of our DNN model may assume the specific audio format, especially regarding the sampling frequency and the data precision. If you are conservative with your new recipe, we recommend converting them to the original recipe's audio format. For example, 16khz
and sint16
audio is typically used in our ASR recipes.
The audio file formats supported in ESPnet2
ESPnet adopts python soundifile for data loading, and, thus the supported audio codecs depend on libsndfile.
You can check the supported audio codecs of soundfile
with the following command:
import soundfile
print(soundfile.available_formats())
Note that the wav.scp
of Kaldi originally requires that the audio format is wav with pcm_s16le type, but wav.scp
of ESPnet2 can handle all audio formats supported by soundfile. e.g. You can use flac
format in wav.scp
for the input/output of format_wav_scp.py
.
Depending on the situation, you may choose one of the following codecs:
Codec | Compression | Maximum channnels | Maximum sampling frequency | Note |
---|---|---|---|---|
wav (Microsoft wav with linear pcm) | No | 1024 | - | |
flac | Lossless | 8 | 192khz | |
mp3 | Lossy | 2 | 48khz | The patent of MP3 has expired |
ogg (Vorbis) | Lossy | 255 | 192khz | Segmentation fault happens |
By default, we select flac
because flac
can convert linear pcm files with compression rate of ~55 % without data loss. flac
is helpful to reduce the IO load, especially, when training with a large amount of corpus. If you would like to change it to the other format, please use --audio_format
option for run.sh
.
cd egs2/some_corpus/some_task
./run.sh --audio_format mp3
Note that if the audio files in your corpus are disributed with lossy audio codec, such as MP3
, it's better to keep the file format to avoid the duplication of the full corpus with the uncompressed format. If the input audio format type is exactly same as the output format, format_wav_scp.py
avoid the gengeration of the output files and reuse the input files.
Use case
Case1: Extract segmentations with long recoding
Create wav.scp
and segments
with the format of The format is <utterance_id> <wav_id> <start_time> <end_time>
(second unit).
wav.scp
:
record_a a.wav
...
segments
:
segment_a record_a 0.98 11.56
segment_a record_a 12.34 15.43
...
Then, you can extract the segments with:
./scripts/audio/format_wav_scp.sh --segments segments wav.scp output_dir
Case2: Extract audio data from video codec / Use non supported format by soundfile
ffmpeg
is required. Create wav.scp
as following:
ID_a ffmpeg -i "ID_a.mp4" -f wav -af pan="1c|c0=c0" -acodec pcm_s16le - |
ID_b ffmpeg -i "ID_b.mp4" -f wav -af pan="1c|c0=c0" -acodec pcm_s16le - |
...
- Note:
-af pan
is pan filter.<num>c|
specifies<num>
of output channels|c<out-channel>=c<in-channel>
assigns<in-channel>
th channel of input stream into<out-channel>
th channel of output stream
- Caution:
-map_channel
option is deprecated and will be removed.
Case3: Convert NIST Sphere files to wav
sph2pipe
is required. Create wav.scp
as following:
ID_a sph2pipe -f wav -p -c 1 ID_a.sph |
ID_b sph2pipe -f wav -p -c 1 ID_b.sph |
...
Case4: Using a mechanism for multi channels inputs
If you are going to generate multi channels audio file from monaural audio files, create the following wav.scp:
ID_a a1.wav a2.wav
...
and run the following commands:
./scripts/audio/format_wav_scp.sh --multi_columns_input true wav.scp output_dir
Conversely, if you and going to monaural audio files from multi channels audio files
./scripts/audio/format_wav_scp.sh --multi_columns_output true wav.scp output_dir
Then, you can get wav.scp
like the following file:
ID_a output_dir/IDa-CH0.wav output_dir/ID_a-CH1.wav
...