# Converting audio file formats using format_wav_scp.py The [format_wav_scp.py](https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/asr1/pyscripts/audio/format_wav_scp.py) is an utility to convert the audio format of the files specified `wav.scp` and the [format_wav_scp.sh](https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/scripts/audio/format_wav_scp.sh) is a shell script wrapping `format_wav_scp.py`. In the typical case, in the stage3 of the [template recipe](https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE), `format_wav_scp.sh` is used to convert the audio file format of your original corpus to the audio format which you actually want to feed to the DNN model. `format_wav_scp.py` and `format_wav_scp.sh` has same function of generation `wav.scp` from `wav.scp`, but `format_wav_scp.sh` is different in that it has the capability of parallel processing. ``` wav.scp -> [format_wav_scp.py] -> wav.scp wav.scp -> [format_wav_scp.sh] -> wav.scp ``` Note that `format_wav_scp.py` dumps audio files with linear PCM with `sint16` regardless the input audio format. ## Quick usage At the first, you need to prepare a text file named as `wav.scp`: ``` ID_a /some_where/a.wav ID_b /some_where2/b.wav ... ``` `ID_a`and `ID_b` are the IDs which you can name arbitrarily to specify audio files. Note that **we don't assume any directory stuctures for the audio files**. ```sh # Please change directory before using our shell scripts cd egs2/some_corpus/some_task cmd=utils/run.pl nj=10 # Number of parallel jobs audio_format=flac # The audio codec of output files fs=16k # The sampling frequency of output files ref_channels=0 # If the input data has multiple channels and you want to use only a single channel in the file (please spicify the channel with 0-based number) ./scripts/audio/format_wav_scp.sh --nj "${nj}" --cmd "${cmd}" --audio_format "${audio_format}" --fs "${fs}" --ref_channels "${ref_channels}" somewhere/wav.scp output_dir # Then, you can find output_dir/wav.scp ``` See also: - About `wav.scp`: https://github.com/espnet/data_example - About `cmd`: [Using job scheduling system](parallelization.md) ## Why is audio file formatting necessary? The audio data included in the corpus obtained from the source website are distributed in various audio file formats, i.e., the audio codec (`wav` of `linear PCM`, `flac`, `mp3`, `DSD`, `u-law`, `a-law`or etc.), the sampling frequency (`48khz`, `44.1khz`, `16khz`, `8khz`, or etc.), the bit depth (`uint8`, `sint16`, `sint32`, `float20`, `float32` or etc.), the number of channels (`monaural`, `stereo`, or more than 2ch), the byter order(`little endian` or `big endian`). When you try to develop a new recipe with a corpus that is not yet prepared in our recipes, of course, you can also try to use the audio data as they are without any formatting. However, in a typical case, the configuration of our DNN model may assume the specific audio format, especially regarding the sampling frequency and the data precision. If you are conservative with your new recipe, we recommend converting them to the original recipe's audio format. For example, `16khz` and `sint16` audio is typically used in our ASR recipes. ## The audio file formats supported in ESPnet2 ESPnet adopts [python soundifile](https://github.com/bastibe/python-soundfile) for data loading, and, thus the supported audio codecs depend on [libsndfile](http://www.mega-nerd.com/libsndfile/). You can check the supported audio codecs of `soundfile` with the following command: ```python import soundfile print(soundfile.available_formats()) ``` Note that the `wav.scp` of Kaldi originally requires that the audio format is wav with pcm_s16le type, but **`wav.scp` of ESPnet2 can handle all audio formats supported by soundfile**. e.g. You can use `flac` format in `wav.scp` for the input/output of `format_wav_scp.py`. Depending on the situation, you may choose one of the following codecs: | Codec | Compression | Maximum channnels | Maximum sampling frequency|Note| | ---- | ---- | ---- | ---- | ---- | | wav (Microsoft wav with linear pcm) | No | 1024 | - | | | flac | Lossless | 8 | 192khz || | mp3 | Lossy | 2 | 48khz | The patent of MP3 has expired | | ogg (Vorbis) | Lossy | 255 | 192khz | Segmentation fault happens | By default, we select `flac` because `flac` can convert linear pcm files with compression rate of ~55 % without data loss. `flac` is helpful to reduce the IO load, especially, when training with a large amount of corpus. If you would like to change it to the other format, please use `--audio_format` option for `run.sh`. ```sh cd egs2/some_corpus/some_task ./run.sh --audio_format mp3 ``` Note that if the audio files in your corpus are disributed with lossy audio codec, such as `MP3`, it's better to keep the file format to avoid the duplication of the full corpus with the uncompressed format.  **If the input audio format type is exactly same as the output format, `format_wav_scp.py` avoid the gengeration of the output files and reuse the input files**. ## Use case ### Case1: Extract segmentations with long recoding Create `wav.scp` and `segments` with the format of `The format is ` (second unit). `wav.scp`: ``` record_a a.wav ... ``` `segments`: ``` segment_a record_a 0.98 11.56 segment_a record_a 12.34 15.43 ... ``` Then, you can extract the segments with: ```sh ./scripts/audio/format_wav_scp.sh --segments segments wav.scp output_dir ``` ### Case2: Extract audio data from video codec / Use non supported format by soundfile `ffmpeg` is required. Create `wav.scp` as following: ``` ID_a ffmpeg -i "ID_a.mp4" -f wav -af pan="1c|c0=c0" -acodec pcm_s16le - | ID_b ffmpeg -i "ID_b.mp4" -f wav -af pan="1c|c0=c0" -acodec pcm_s16le - | ... ``` - Note: `-af pan` is [pan filter](https://ffmpeg.org/ffmpeg-filters.html#pan-1). - `c|` specifies `` of output channels - `|c=c` assigns ``th channel of input stream into ``th channel of output stream - Caution: `-map_channel` option is deprecated and will be removed. ### Case3: Convert NIST Sphere files to wav `sph2pipe` is required. Create `wav.scp` as following: ``` ID_a sph2pipe -f wav -p -c 1 ID_a.sph | ID_b sph2pipe -f wav -p -c 1 ID_b.sph | ... ``` ### Case4: Using a mechanism for multi channels inputs If you are going to generate multi channels audio file from monaural audio files, create the following wav.scp: ``` ID_a a1.wav a2.wav ... ``` and run the following commands: ```sh ./scripts/audio/format_wav_scp.sh --multi_columns_input true wav.scp output_dir ``` Conversely, if you and going to monaural audio files from multi channels audio files ```sh ./scripts/audio/format_wav_scp.sh --multi_columns_output true wav.scp output_dir ``` Then, you can get `wav.scp` like the following file: ``` ID_a output_dir/IDa-CH0.wav output_dir/ID_a-CH1.wav ...