Speaker Representation

About 2 min

Speaker Representation

This is a template of Spk1 recipe for ESPnet2. It follows d-vector style training/inference for speaker verification. In other words, it trains a DNN as a closed set speaker classifier. After training the classification head is removed. The last hidden layer (or sometimes another layer) is used as a speaker representation (i.e., speaker embedding) to represent diverse open set speakers.

Recipe flow
How to run
- LibriSpeech training
Related works

Recipe flow

Spk1 recipe consists of 4 stages.

Data preparation

Data preparation stage.

ESPnet format:

It calls local/data.sh to create Kaldi-style data directories in data/ for training, validation, and evaluation sets. It's the same as asr1 tasks.

How to run

VoxCeleb Training

Here, we show the procedure to run the recipe using egs2/voxceleb/spk1.

Move to the recipe directory.

$ cd egs2/voxceleb/spk1

Modify VOXCELEB1, VOXCELEB2 variables in db.sh if you want to change the download directory.

$ vim db.sh

Modify cmd.sh and conf/*.conf if you want to use the job scheduler. See the detail in using job scheduling system.

$ vim cmd.sh

Run run.sh, which conducts all of the stages explained above.

$ ./run.sh

@INPROCEEDINGS{jung2022pushing,
  title={Pushing the limits of raw waveform speaker recognition},
  author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
  year={2022},
  booktitle={Proc. INTERSPEECH}
}

Speaker Representation

Speaker Representation

Table of Contents

Recipe flow

How to run

Related works