CMU 11492/11692 Spring 2023: Speaker Recognition
CMU 11492/11692 Spring 2023: Speaker Recognition
In this demonstration, we will show you the procedure to conduct speaker recognition with the ASR functions of ESPnet.
Main references:
Author:
- Jiatong Shi (jiatongs@andrew.cmu.edu)
Objectives
After this demonstration, you are expected to understand the main procedure of using ESPnet ASR functions for speaker recognition.
❗Important Notes❗
- We are using Colab to show the demo. However, Colab has some constraints on the total GPU runtime. If you use too much GPU time, you may not be able to use GPU for some time.
- There are multiple in-class checkpoints ✅ throughout this tutorial. Your participation points are based on these tasks. Please try your best to follow all the steps! If you encounter issues, please notify the TAs as soon as possible so that we can make an adjustment for you.
- Please submit PDF files of your completed notebooks to Gradescope. You can print the notebook using
File -> Print
in the menu bar.
ESPnet installation
We follow the ESPnet installation as the previous tutorials (takes around 15 minutes).
!git clone --depth 5 -b 2023spring_speaker_recognition https://github.com/espnet/espnet
%cd /content/espnet/tools
!./setup_anaconda.sh anaconda espnet 3.9
# # It may take 12 minutes
%cd /content/espnet/tools
!make TH_VERSION=1.12.1 CUDA_VERSION=11.6
!. ./activate_python.sh && installers/install_speechbrain.sh
!. ./activate_python.sh && installers/install_rawnet.sh
!. ./activate_python.sh && pip install ipykernel
Speaker Recognition
Speaker recognition is a typical task that conduct utterance-level classification. Specifically, we will map an utterance into a pre-defined category. Recall that the ASR is doing a sequence-to-sequence task, so we can easily utilize ASR by using a 1-length sequence (i.e., class). Following this concept, we can start to implement the speaker recognition system! Noted that following the definition of the lecture, today, we will focus on speaker identification (close-set classification) instead of speaker verification.
Dataset
mini_librispeech
is a tiny subset of librispeech
dataset for development usage. Because of the free-license and cleaness of the data, librispeech
has been one of the most widely used corpora in the speech community. For more details, please refer to its original paper. In this demonstration, we will use the train set of mini_librispeech
to train and test a simple speaker recognition model.
First of all, let's get into the directory to check the structure.
%cd /content/espnet/egs2/mini_librispeech/sid1
!ls -l
Data Preparation
Similar to the previous tutorials, we will use the Kaldi-style format for the data preparation. The differences in this recipe is that we need to predict speaker ID instead of predicting transcription. Therefore, a straightforward process is to simply change the text
into utt2spk
.
So final files after preparation should be:
wav.scp text utt2spk spk2utt
But on the other hand, we change the format of text into
utt_id1 spk_id0
utt_id2 spk_id0
utt_id3 spk_id1
where spk_id0
and spk_id1
refers to the speaker IDs
!./run.sh --stage 1 --stop_stage 1
Data Preprocessing
For data preprocessing, we follow the similar way in previous tutorials/assignments.
!./run.sh --stage 2 --stop_stage 5
Question1 (✅ Checkpoint 1 (1 point))
In previous tutorials, we usually use character as our modeling units. But for here, we use a speaker id, which is a sequence of character, representing one speaker. So, in our preprocessing, which tokenizer (e.g., char, bpe, phn, word) is actually used to achieve speaker prediction? Please also indicate your reason(s).
To help you understand more, please check the documentation at https://espnet.github.io/espnet/search.html?q=tokenizer&check_keywords=yes&area=default
(For question-based checkpoint: please directly answer it in the text box)
[ANSWER HERE]
Use Pre-trained speaker representation
One feature in ESPnet is to adopt pre-trained speaker representation from other toolkits (including TDNN-based speaker embedding extraction from speechbrain and RawNet-based speaker embedding from RawNet. We can efficiently extract the speaker embedding with our supported scripts.
The speaker embedding can be used for text-to-speech purpose to handle multi-speaker synthesis. In this demonstration, we directly use the extraction model for speaker recognition.
!cat ./local/extract_xvector.sh
!./local/extract_xvector.sh
After calculating the xvectors, we also can analysis the embedding by t-SNE algorithm. The t-sne image is located at the extracted xvector folder
from IPython.display import Image, display
display(Image('dump/extracted/train/tsne.png'))
Extract speaker embedding from SpeechBrain
Similarly, we can also extract speaker embedding from speechbrain.
!cat ./local/extract_xvector_speechbrain.sh
!./local/extract_xvector_speechbrain.sh
Similar to the speechbrain-based embedding, we can visualize the embeddings from RawNet with t-SNE plot.
from IPython.display import Image, display
display(Image('dump/extracted_speechbrain/train/tsne.png'))
Training for speaker recognition
First, let's use xvector trained from TDNN (speech-brain model) to conduct speaker recognition.
!cat ./run_xvector_speechbrain.sh
!./run_xvector_speechbrain.sh
Question2 (✅ Checkpoint 2 (0.5 point))
We still use the ASR scoring scheme for our evaluation because it is already sufficient. Please briefly discuss which metric can be used for evaluation of the accuracy/error rate of speaker recognition results.
(For question-based checkpoint: please directly answer it in the text box)
[ANSWER HERE]
Then, let's use RawNet-based xvector to conduct speaker recognition
!cat ./run_xvector.sh
!./run_xvector.sh
Question3 (✅ Checkpoint 3 (0.5 point))
Clearly, we find some differences in the number between TDNN-based speaker embedding and RawNet-based speaker embedding. Could you briefly exaplin some possible reasons that why we could get such different results?
References:
(For question-based checkpoint: please directly answer it in the text box)
[ANSWER HERE]
We can also use ESPnet ASR model directly for speaker recognition purpose by predicting the target as speaker ID.
!./run.sh --stage 10
Question4 (✅ Checkpoint 4 (0.5 point))
We could get reasonable performances with the ASR model. However, we could easily find that the training is much more time-consuming than those with speaker embeddings. Could you please explain why we have such differences?
(For question-based checkpoint: please directly answer it in the text box)
[ANSWER HERE]