CMU 11751/18781 Fall 2022: ESPnet Tutorial2 (New task)

ESPnet is a widely-used end-to-end speech processing toolkit. It has supported various speech processing tasks. ESPnet uses PyTorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.

Main references: - ESPnet repository - ESPnet documentation - ESPnet tutorial in Speech Recognition and Understanding (Fall 2021) - Recitation in Multilingual NLP (Spring 2022) - ESPnet tutorial1 in Speech Recognition and Understanding (Fall 2022)

Author: Jiatong Shi (

We would like to thank You (Neil) Zhang for kindly helping the hands-on tutorial and sharing his knowledge on the task.

❗Important Notes❗

  • We are using Colab to show the demo. However, Colab has some constraints on the total GPU runtime. If you use too much GPU time, you may not be able to use GPU for some time.

  • There are multiple in-class checkpoints ✅ throughout this tutorial. There will also be some after-class excersices 📗 after the tutorial. Your participation points are based on these tasks. Please try your best to follow all the steps! If you encounter issues, please notify the TAs as soon as possible so that we can make an adjustment for you.

  • Please submit PDF files of your completed notebooks to Gradescope. You can print the notebook using File -> Print in the menu bar.

  • This tutorial covers some advanced usage of ESPnet, which is the extension of the first tutorial.


After this tutorial, you are expected to know: - How to add new task in ESPnet2 - How to add new models in ESPnet2 - How to create a new recipe (and template) of a new task from scratch

Function to print date and time

We first define a function to print the current date and time, which will be used in multiple places below.

[ ]:
def print_date_and_time():
  from datetime import datetime
  import pytz

  now ="America/New_York"))
  print("=" * 60)
  print(f' Current date and time: {now.strftime("%m/%d/%Y %H:%M:%S")}')
  print("=" * 60)

# example output

Install ESPnet (Almost same procedure as your first tutorial)

Download ESPnet

We use git clone to download the source code of ESPnet and then go to a specific commit.

Important: In other versions of ESPnet, you may encounter errors related to imcompatible package versions (numba). Please use the same commit to avoid such issues.

Note that we are using another branch espnet_tutorial_asvspoof instead of “master”. You can also use your own fork to proceed the following sections if you want to use Github to save your code.

[ ]:
# It takes a few seconds
!git clone --depth 5 -b 2022fall_new_task_tutorial

# We use a specific commit just for reproducibility.
%cd /content/espnet
!git checkout 9cff98a78ceaa4d85843be0a50b369ec826b27f6

Setup Python environment based on anaconda + Install ESPnet

[ ]:
# It takes 30 seconds
%cd /content/espnet/tools
!./ anaconda espnet 3.9

# It may take 12 minutes
%cd /content/espnet/tools
!make TH_VERSION=1.12.1 CUDA_VERSION=11.6

What we provide you and what you need to proceed

We have provide you most of the files needed for ASVSpoof recipe. So you do not need to add any additional files. However, noted that some of the files are not complete and need your completion to proceed. For a quick overview of the whole layout of the new task, please refer to…2022fall_new_task_tutorial

As elaborated in the warming-up, we have shown that there are two core components for a new task in ESPnet: a task library and correponding recipe setups. For the following of the section, we will briefly show the overall layout of adding the ASVSpoof task in ESPnet. The listed files are almost the minimum requirements to add a new task in ESPnet.

Task library for ASVSpoof

Followings are a list of files adding to ESPnet for ASVSpoof (files in “” are ones that need modifications)

- espnet2
  - bin
    - # Major entry point for asvspoof
    - "" (Checkpoint 4) # Inference scripts for asvspoof
  - asvspoof
    - decoder
      - # abstract class for decoder in ASVSpoof
      - "" (Checkpoint 3) # simple linear decoder for ASVSpoof
    - loss
      - # abstract class for loss in ASVSpoof
      - # naive binary class loss for ASVSpoof
      - "" (Bouns)
    - "" (Bouns)
  - tasks
    - "" (Checkpoint 2)

To help you understand more, we would recommend you to check the layout of other tasks (e.g., ASR, TTS, ST, etc.) to understand how the codebase is functioning.

Recipe for ASVSpoof

Followings are a list of files adding to ESPnet for ASVSpoof (files in boldface are ones that need modifications)

- egs2
    - asvspoof1
      - "" (Checkpoint 1)
      - others
  - espnet_tutorial
    - asvspoof11
      - conf
      - "” (Checkpoint 1)
      - local
        - "" (Bouns)
        - "" (Bouns)
      - "" (Checkpoint 5)
      - scripts
      - pyscripts
      - utils
      - steps

Noted that because of the symlink, the is essentially the same for checkpoint 1.

ASVSpoof data preparation

As discussed in the warm-up session, ASVSpoof aims to conduct a binary classfication. As the task layout is a bit different from the ASR task we touched on the first tutorial, so we need to use a different format to formulate the data. For here, to keep the simplicity, we stil use the exact same file as the first tutorial:

wav.scp text utt2spk spk2utt

But on the other hand, we change the format of text into

utt_id1 0
utt_id2 1
utt_id3 0

where 0 represents real speech and 1 stands for fake speech.

Download dataset

We first download the data from google drive. Noted that the data is a subset of the ASVSpoof2019 Challenge.

[ ]:
# a few seconds
%cd /content/espnet/egs2/espnet_tutorial/asvspoof1/
!gdown 1HRdjjmGXBTXOqOq9iijuXPCA4y_46OzP

Prepare data (Stage1 & Stage2)

This time, we make the task template to be as simple as possible. The data preparation will be only two stages, including basic data preparation and wave format.

[ ]:
# It may take around 6 minutes
!./ --stage 1 --stop_stage 2 --train_set train --valid_set dev --test_sets "eval"

ASVSpoof collect stats (✅ Checkpint 1 (1 point))

Similar to the previous tutorial, we collect the statisitcs for the data.

In the process, the data will be passed into a iterable loader. However, remember that the text file is no longer the format as the ASR recipe. Therefore, we will need to use another data loader to load the corresponding information.

Fortunately, we have a wide range of data loaders for choices, which is listing in here. Please choose the correct file format and replace the [REPLACE_ME] token in

After the replacement, you should be able to run the following blocks

[ ]:
# It takes less than 2 minutes
!./ --stage 3 --stop_stage 3 --train_set train --valid_set dev --test_sets "dev eval" --asvspoof_config conf/checkpoint1_dummy.yaml

# NOTE: Checkpoint 1

ASVSpoof Model

In this section, we will define the ASVSpoof model and use the model to conduct the training of ASVSpoof task. For easier understanding, we first use an encoder to convert speech features into hidden representations and then use a decoder to conduct the classification.

Encoder (✅ Checkpint 2 (1 point))

First, we are going to focus on the encoder part. There has been a long history over the discussion of the speech encoder in our community. Given the sequential perspective, people firstly investigated recurrent neural networks. More recently, we are focusing on conformer block, which is an extension to the transformer block. In the previous settings, we used a transformer block to collect stats. However, we would want to switch to conformer.

Code-reusibility is one of the major benefits of using ESPnet as a toolkit for speech tasks. As ESPnet already support conformer block in ASR, it is easy to import into this new task.

In ESPnet, adding modules that we already have can be as simple as two-line codes. Please add lines into /content/espnet/espnet2/tasks/ We have marked TODO in the scripts for your convenience.

[ ]:
# It takes less than 2 minutes
!./ --stage 3 --stop_stage 3 --train_set train --valid_set dev --test_sets "dev eval" --asvspoof_config conf/checkpoint2.yaml

# NOTE: Checkpoint 2

Decoder (✅ Checkpint 3 (1 point))

In this stage, we will finally start the training. As the previous tutorial, we can use the Tensorboard to monitor the process.

[ ]:
# Load the TensorBoard notebook extension
%reload_ext tensorboard

# Launch tensorboard before training
%tensorboard --logdir /content/espnet/egs2/espnet_tutorial/asvspoof1/exp

After we finished the encoder, we also need to create a decoder to conduct the prediciton. As the encoder will generate hidden representations, we want to have a simple decoder to conduct mean-pooling to all the hidden representation at the time-axis. There should be another linear layer to conclude the models into binary classification. Please fill the missing part in /conent/espnet/espnet2/asvspoof/decoder/ to finally start the training. For people who are not familiar with Pytorch, please refer the related resources for details.

Related resources that could be helpful for this checkpoint: - - -

[ ]:

# Training takes around 2 minutes
!./ --stage 4 --stop_stage 4 --train_set train --valid_set dev --test_sets "dev eval" --asvspoof_config conf/checkpoint2.yaml --inference_config conf/decode_asvspoof.yaml

# NOTE: Checkpoint 3

Model Inference

(✅ Checkpint 4 (1 point))

As the training is finished, we expect to conduct ASVSpoof on the test set. To approach that, we first have to finish the inference codebase. For our task specifically, we need the log-probability of the prediction to compute equal error rate (EER). Therefore the output should be a float number for each utterance.

Please fill the missing parts with TODOs in /content/espnet/espnet2/bin/

[ ]:
!./ --stage 5 --stop_stage 5 --train_set train --valid_set dev --test_sets "eval" --asvspoof_config conf/checkpoint2.yaml --inference_nj 1 --gpu_inference true

# NOTE: Checkpoint 4


(✅ Checkpint 5 (1 point))

We have prepred the scoring script for you. We can get the EER by the following code-block

[ ]:
!./ --stage 6 --stop_stage 6 --train_set train --valid_set dev --test_sets "eval" --asvspoof_config conf/checkpoint2.yaml
!chmod +x scripts/utils/
# NOTE: Checkpoint 5

📗 Exercise 1 (1 point bonus)

In the data you just downloaded, we have some extra data for training (/content/espnet/egs2/espnet_tutorial/asvspoof1/espnet_asvspoof_tutorial/extend_train). Please try to combine them with the training set and then conduct experiments the augmented set. You are also encouraged to change the model configuration. If you achieve a better equal error rate (EER) than the previous experiments, you can get a bonus point.

[ ]:

# NOTE: Exercise 1

📗 Exercise 2 (1 point bonus)

One main issue of speech anti-spoofing research is the generalization to unseen attacks, i.e., synthesis methods not seen in training the anti-spoofing models. In fact, the test set in our scenario is exact in the same case. Recently, there is a one-class learning method that compacts the natural speech representations and separate them from the fake speech with a certain margin in the embedding space.

We have implemented the AM softmax method located in /content/espnet/espnet2/asvspoof/loss/ and also prepared the template /content/espnet/espnet2/asvspoof/loss/ for your implementation. You can follow the TODOs to implement the methods (note that the inference/train_config should change accordingly).

If you successfully implement the OC-softmax and get similar/better EER, you can get a bouns point

[ ]:

# NOTE: Exercise 2