Use transfer learning for ASR in ESPnet2

About 4 min

Use transfer learning for ASR in ESPnet2

Author : Dan Berrebbi (dberrebb@andrew.cmu.edu)

Date : April 11th, 2022

Abstract

In that tutorial, we will introduce several options to use pre-trained models/parameters for Automatic Speech Recognition (ASR) in ESPnet2. Available options are :

use a local model you (or a collegue) have already trained,
use a trained model from ESPnet repository on HuggingFace.

We note that this is done for ASR training, so at stage 11 of ESPnet2 models' recipe.

Why using such (pre-)trained models ?

Several projects may involve making use of previously trained models, this is the reason why we developed ESPnet repository on HuggingFace for instance. Example of use cases are listed below (non-exhaustive):

target a low resource language, a model trained from scratch may perform badly if trained with only few hours of data,
study robustness to shifts (domain, language ... shifts) of a model,
make use of massively trained multilingual models.
...

ESPnet installation (about 10 minutes in total)

Please use the gpu environnement provided by google colab for runing this notebook.

!git clone --depth 5 https://github.com/espnet/espnet

# It takes 30 seconds
%cd /content/espnet/tools
!./setup_anaconda.sh anaconda espnet 3.9

# It may take ~8 minutes
%cd /content/espnet/tools
!make CUDA_VERSION=10.2

mini_an4 recipe as a transfer learning example

In this example, we use the mini_an4 data, which has only 4 utterances for training. This is of course too small to train an ASR model, but it enables to run all the decribed transfer learning models on a colab environnement. After having run and understood those models/instructions, you can apply it to any other recipe of ESPnet2 or a new recipe that you build. First, move to the recipe directory

%cd /content/espnet/egs2/mini_an4/asr1

Add a configuration file

As the mini_an4 does not contain any configuration file for ASR model, we add one here.

config = {'accum_grad': 1,
 'batch_size': 1,
 'batch_type': 'folded',
 'best_model_criterion': [['valid', 'acc', 'max']],
 'decoder': 'transformer',
 'decoder_conf': {'dropout_rate': 0.1,
  'input_layer': 'embed',
  'linear_units': 2048,
  'num_blocks': 6},
 'encoder': 'transformer',
 'encoder_conf': {'attention_dropout_rate': 0.0,
  'attention_heads': 4,
  'dropout_rate': 0.1,
  'input_layer': 'conv2d',
  'linear_units': 2048,
  'num_blocks': 12,
  'output_size': 256},
 'grad_clip': 5,
 'init': 'xavier_uniform',
 'keep_nbest_models': 1,
 'max_epoch': 5,
 'model_conf': {'ctc_weight': 0.3,
  'length_normalized_loss': False,
  'lsm_weight': 0.1},
 'optim': 'adam',
 'optim_conf': {'lr': 1.0},
 'patience': 0,
 'scheduler': 'noamlr',
 'scheduler_conf': {'warmup_steps': 1000}}

import yaml
with open("conf/train_asr.yaml","w") as f:
  yaml.dump(config, f)

Data preparation (stage 1 - stage 5)

!./asr.sh --stage 1 --stop_stage 5 --train-set "train_nodev" --valid-set "train_dev" --test_sets "test"

Stage 10: ASR collect stats:

# takes about 10 seconds
!./asr.sh --stage 10 --stop_stage 10 --train-set "train_nodev" --valid-set "train_dev" --test_sets "test" --asr_config "conf/train_asr.yaml"

Stage 11: ASR training (from scratch)

We train our model for only 5 epochs, just to have a pre-trained model.

# takes about 1-2 minutes
!./asr.sh --stage 11 --stop_stage 11 --train-set "train_nodev" --valid-set "train_dev" --test_sets "test" --asr_config "conf/train_asr.yaml" --asr_tag "pre_trained_model"

Stage 11.2 : ASR training over a pre-trained model

We train our new model over the previously trained model. (here as we use the same training data, this is not very useful, but again this is a toy example that is reproducible with any model.)

Step 1 : make sure your ASR model file has the proper ESPnet format (should be ok if trained with ESPnet). It just needs to be a ".pth" (or ".pt" or other extension) type pytorch model.

Step 2 : add the parameter --pretrained_model path/to/your/pretrained/model/file.pth to run.sh.

Step 3 : step 2 will initialize your new model with the parameters of the pre-trained model. Thus your new model will be trained with a strong initialization. However, if your new model have different parameter sizes for some parts of the model (e.g. last projection layer could be modified ...). This will lead to an error because of mismatches in size. To prevent this to happen, you can add the parameter --ignore_init_mismatch true in run.sh.

Step 4 (Optional) : if you only want to use some specific parts of the pre-trained model, or exclude specific parts, you can specify it in the --pretrained_model argument by passing the component names with the following syntax : --pretrained_model <file_path>:<src_key>:<dst_key>:<exclude_Keys>. src_key are the parameters you want to keep from the pre-trained model. dst_key are the parameters you want to initialize in the new model with the src_keyparameters. And exclude_Keys are the parameters from the pre-trained model that you do not want to use. You can leave src_key and dst_key fields empty and just fill exclude_Keys with the parameters that you ant to drop. For instance, if you want to re-use encoder parameters but not decoder ones, syntax will be --pretrained_model <file_path>:::decoder. You can see the argument expected format in more details here.

# takes about 1-2 minutes
!./asr.sh --stage 11 --stop_stage 11 --train-set "train_nodev" --valid-set "train_dev" \
--test_sets "test" --asr_config "conf/train_asr.yaml" --asr_tag "transfer_learning_with_pre_trained_model"\
 --pretrained_model "/content/espnet/egs2/mini_an4/asr1/exp/asr_train_asr_raw_bpe30/valid.acc.ave.pth"

Stage 11.3 : ASR training over a HuggingFace pre-trained model

We train our new model over the previously trained model from HuggingFace. Any model can be used, here we take a model trained on Bengali as an example. It can be found at https://huggingface.co/espnet/bn_openslr53.

Use a trained model from ESPnet repository on HuggingFace.

ESPnet repository on HuggingFace contains more than 200 pre-trained models, for a wide variety of languages and dataset, and we are actively expanding this repositories with new models every week! This enable any user to perform transfer learning with a wide variety of models without having to re-train them. In order to use our pre-trained models, the first step is to download the ".pth" model file from the HugginFace page. There are several easy way to do it, either by manually downloading them (e.g. wget https://huggingface.co/espnet/bn_openslr53/blob/main/exp/asr_train_asr_raw_bpe1000/41epoch.pth), cloning it (git clone https://huggingface.co/espnet/bn_openslr53) or downloading it through an ESPnet recipe (described in the models' pages on HuggingFace):

git checkout fa1b865352475b744c37f70440de1cc6b257ba70
pip install -e .
cd egs2/bn_openslr53/asr1
./run.sh --skip_data_prep false --skip_train true --download_model espnet/bn_openslr53

Then, as you have the ".pth" model file, you can follow the steps 1 to 4 from the previous section in order to use this pre-train model.

!wget https://huggingface.co/espnet/bn_openslr53/resolve/main/exp/asr_train_asr_raw_bpe1000/41epoch.pth

The next command line will raise an error because of the size mismatch of some parameters, as mentionned before (step3).

# will fail in about 5 seconds
!./asr.sh --stage 11 --stop_stage 11 --train-set "train_nodev" --valid-set "train_dev" \
--test_sets "test" --asr_config "conf/train_asr.yaml" --asr_tag "transfer_learning_with_pre_trained_model"\
 --pretrained_model "/content/espnet/egs2/mini_an4/asr1/41epoch.pth"

To solve this issue, as mentionned, we can use the --ignore_init_mismatch "true" parameter.

# takes about 1-2 minutes
!./asr.sh --stage 11 --stop_stage 11 --train-set "train_nodev" --valid-set "train_dev" \
--test_sets "test" --asr_config "conf/train_asr.yaml" --asr_tag "transfer_learning_with_pre_trained_model_from_HF"\
 --pretrained_model "/content/espnet/egs2/mini_an4/asr1/41epoch.pth" --ignore_init_mismatch "true"

Additional note about the --ignore_init_mismatch true option : This option is very convenient because in lots of transfer learning use cases, you will aim to use a model trained on a language X (e.g. X=English) for another language Y. Language Y may have a vocabulary (set of tokens) different from language X, for instance if you target Y=Totonac, a Mexican low resource language, your model may be stronger if you use a different set of bpes/tokens thatn the one used to train the English model. In that situation, the last layer (projection to vocabulary space) of your ASR model needs to be initialized from scratch and may be different in shape than the one of the English model. For that reason, you should use the --ignore_init_mismatch true option. It also enables to handle the case where the scripts are differents from languages X to Y.