CMU 11751/18781 Fall 2022: ESPnet Tutorial

ESPnet is a widely-used end-to-end speech processing toolkit. It has supported various speech processing tasks. ESPnet uses PyTorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.

Main references: - ESPnet repository - ESPnet documentation - ESPnet tutorial in Speech Recognition and Understanding (Fall 2021) - Recitation in Multilingual NLP (Spring 2022)

Author: Siddhant Arora (siddhana@andrew.cmu.edu) This notebook was modified from the material made by Yifan Peng (yifanpen@andrew.cmu.edu)

❗Important Notes❗

  • We are using Colab to show the demo. However, Colab has some constraints on the total GPU runtime. If you use too much GPU, you may fail to connect to a GPU backend for some time.

  • There are multiple in-class checkpoints ✅ throughout this tutorial. There will also be some after-class excersices 📗 after the tutorial. Your participation points are based on these tasks. Please try your best to follow all the steps! If you encounter issues, please notify the TAs as soon as possible so that we can make an adjustment for you.

  • Please submit PDF files of your completed notebooks to Gradescope. You can print the notebook using File -> Print in the menu bar.

  • This tutorial covers the basics of ESPnet, which will be the foundation of the next tutorial on Wednesday.

Objectives

After this tutorial, you are expected to know: - How to run existing recipes (data prep, training, inference and scoring) in ESPnet2 - How to change the training and decoding configurations - How to create a new recipe from scratch - Where to find resources if you encounter an issue

Install ESPnet

Function to print date and time

We first define a function to print the current date and time, which will be used in multiple places below.

[ ]:
def print_date_and_time():
  from datetime import datetime
  import pytz

  now = datetime.now(pytz.timezone("America/New_York"))
  print("=" * 60)
  print(f' Current date and time: {now.strftime("%m/%d/%Y %H:%M:%S")}')
  print("=" * 60)

# example output
print_date_and_time()

Check GPU type

Let’s check the GPU type of this allocated environment.

[ ]:
!nvidia-smi

Download ESPnet

We use git clone to download the source code of ESPnet and then go to a specific commit.

Important: In other versions of ESPnet, you may encounter errors related to imcompatible package versions (numba). Please use the same commit to avoid such issues.

[ ]:
# It takes a few seconds
!git clone --depth 5 https://github.com/espnet/espnet

Setup Python environment based on anaconda

There are several other installation methods, but we highly recommend the anaconda-based one.

[ ]:
# It takes 30 seconds
%cd /content/espnet/tools
!./setup_anaconda.sh anaconda espnet 3.9

Install ESPnet (same procedure as your first tutorial)

This step installs PyTorch and other required tools.

We specify CUDA_VERSION=11.6 for PyTorch 1.12.1. We also support many other versions. Please check https://github.com/espnet/espnet/blob/master/tools/installers/install_torch.sh for the detailed version list.

[ ]:
# It may take 12 minutes
%cd /content/espnet/tools
!make TH_VERSION=1.12.1 CUDA_VERSION=11.6

If other listed packages are necessary, install any of them using

. ./activation_python.sh && ./installers/install_xxx.sh

We show two examples, although they are not used in this demo.

[ ]:
# s3prl and fairseq are necessary if you want to use self-supervised pre-trained models
# It takes 50s
%cd /content/espnet/tools

!. ./activate_python.sh && ./installers/install_s3prl.sh
!. ./activate_python.sh && ./installers/install_fairseq.sh    # install s3prl to use Wav2Vec2 / HuBERT model series

Run an existing recipe

ESPnet has a number of recipes (130 recipes on Sep. 11, 2022). Please refer to https://github.com/espnet/espnet/blob/master/egs2/README.md for a complete list.

Please also check the general usage of the recipe in https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2

CMU AN4 recipe

In this tutorial, we will use the CMU an4 recipe. This is a small-scale speech recognition task mainly used for testing.

First, let’s go to the recipe directory.

[ ]:
%cd /content/espnet/egs2/an4/asr1
!ls
egs2/an4/asr1/
 - conf/      # Configuration files for training, inference, etc.
 - scripts/   # Bash utilities of espnet2
 - pyscripts/ # Python utilities of espnet2
 - steps/     # From Kaldi utilities
 - utils/     # From Kaldi utilities
 - db.sh      # The directory path of each corpora
 - path.sh    # Setup script for environment variables
 - cmd.sh     # Configuration for your backend of job scheduler
 - run.sh     # Entry point
 - asr.sh     # Invoked by run.sh

[SSL] Get the ``dump_hubert_feature.sh`` script and the ``training config`` ready. * GitHub: https://github.com/simpleoier/ESPnet_SSL_ASR_tutorial_misc.git)

[ ]:
!rm -r ESPnet_SSL_ASR_tutorial_misc
!git clone https://github.com/simpleoier/ESPnet_SSL_ASR_tutorial_misc.git
!cp ESPnet_SSL_ASR_tutorial_misc/dump_ssl_feature.sh ./local
!cp ESPnet_SSL_ASR_tutorial_misc/dump_feats.py ./local
!cp ESPnet_SSL_ASR_tutorial_misc/feats_loaders.py ./local
!chmod +x local/dump_ssl_feature.sh
!cp ESPnet_SSL_ASR_tutorial_misc/train_asr_demo_branchformer.yaml ./conf

ESPnet is designed for various use cases (local machines or cluster machines) based on Kaldi tools. If you use it in the cluster machines, please also check https://kaldi-asr.org/doc/queue.html

The main stages can be parallelized by various jobs.

[ ]:
!cat run.sh
!ls conf
!ls local

run.sh calls asr.sh, which completes the entire speech recognition experiments, including data preparation, training, inference, and scoring. They are separated into multiple stages (totally 16).

Instead of executing the entire pipeline by run.sh, let’s run it stage-by-stage to understand the process in each stage.

Data preparation

Stage 1: Data preparation: download raw data, split the entire set into train/dev/test, and prepare them in the Kaldi format

Note that --stage <N> is to start from this stage and --stop_stage <N> is to stop after this stage. We also need to specify the train, dev and test sets.

[ ]:
# a few seconds
!./asr.sh --stage 1 --stop_stage 1 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"

After this stage is finished, please check the newly created data directory:

[ ]:
!ls data

In this recipe, we use train_nodev as a training set, train_dev as a validation set (monitor the training progress by checking the validation score). We also use test and train_dev sets for the final speech recognition evaluation.

Let’s check one of the training data directories:

[ ]:
!ls -1 data/train_nodev/

These are the speech and corresponding text and speaker information in the Kaldi format. To understand their meanings, please check https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE#about-kaldi-style-data-directory.

Please also check the official documentation of Kaldi: https://kaldi-asr.org/doc/data_prep.html

spk2utt # Speaker information
text    # Transcription file
utt2spk # Speaker information
wav.scp # Audio file

Stage 2: Speed perturbation (one of the data augmentation methods)

We do not use speed perturbation for this demo. But you can turn it on by adding an argument --speed_perturb_factors "0.9 1.0 1.1" to the shell script.

Note that we perform speed perturbation and save the augmented data in the disk before training. Another approach is to perform data augmentation during training, such as SpecAug.

[ ]:
!./asr.sh --stage 2 --stop_stage 2 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"

Stage 3: Format wav.scp: data/ -> dump/raw

We dump the data with specified format (flac in this case) for the efficient use of the data.

# ====== Recreating "wav.scp" ======
# Kaldi-wav.scp, which can describe the file path with unix-pipe, like "cat /some/path |",
# shouldn't be used in training process.
# "format_wav_scp.sh" dumps such pipe-style-wav to real audio file
# and it can also change the audio-format and sampling rate.
# If nothing is need, then format_wav_scp.sh does nothing:
# i.e. the input file format and rate is same as the output.

Note that --nj <N> means the number of CPU jobs. Please set it appropriately by considering your CPU resources and disk access.

[ ]:
# 25 seconds
!./asr.sh --stage 3 --stop_stage 3 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --nj 4

⭕ [SSL] Stage 3.5: Extract SSL features

We dump the SSL features of the data with specified format (kaldi mat in this case) for the efficient use of the data.

  • First, we need to prepare the pretrained SSL models. In this colab, we use HuBERT models. We have three choices:

    1. HuBERT through FairSeq API; Model choices can be found from fairseq/hubert pretrained models Example usage:      mkdir -p downloads/hubert_pretrained_models     wget https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt -O ./downloads/hubert_pretrained_models/hubert_large_ll60k.pt     Append the following arguments:        --feature_type hubert --hubert_type fairseq --hubert_url "https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt" --hubert_dir_path "./downloads/hubert_pretrained_models" --layer 23

    2. HuBERT from ESPnet; Example usage:     # Download model     ./asr.sh --skip_data_prep true --skip_train true --skip_eval true --skip_upload true --download_model simpleoier/simpleoier_librispeech_hubert_iter1_train_ssl_torchaudiohubert_base_960h_pretrain_it1_raw --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"     Append the following arguments:       --feature_type hubert --hubert_type espnet --hubert_dir_path "/content/espnet/tools/anaconda/envs/espnet/lib/python3.9/site-packages/espnet_model_zoo/models--simpleoier--simpleoier_librispeech_hubert_iter1_train_ssl_torchaudiohubert_base_960h_pretrain_it1_raw/snapshots/4256c702685249202f333348a87c13143985b90b/exp/hubert_iter1_train_ssl_torchaudiohubert_base_960h_pretrain_it1_raw/valid.loss.ave.pth" --layer 12

    3. HuBERT through S3PRL API. S3prl also supports many other SSL models. Model choices can be found from s3prl_upstream_names here Append the following arguments:       --feature_type s3prl --s3prl_upstream_name hubert_large_ll60k --layer 24

  • Second, we extract the hubert features and copy the feats.scp into data dirs.

    # ====== Creating "feats.scp" ======
    # Kaldi-feats.scp, which describe the file path (ark file) and offset,
    

    Note that --nj <N> means the number of CPU / GPU jobs. Please set it appropriately by considering your CPU resources and disk access. local/dump_ssl_feature.sh is the entry script.

    📗 Check the shape of dumped feature [1.0 pt]

    We will finally read the dumped feature and print the shape information to check if it is successful. The expected output is

    fkai-an311-b (155, 1024)
    
[ ]:
# 5 min
# 'dump_hubert_feature.sh' reads wave files from a common dir, so we symbolically link dump/raw/test in dump/raw/org
!ln -s /content/espnet/egs2/an4/asr1/dump/raw/test /content/espnet/egs2/an4/asr1/dump/raw/org
!rm -r ssl_feats/

# Fairseq HuBERT large example
# !mkdir -p downloads/hubert_pretrained_models
# !wget https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt -O ./downloads/hubert_pretrained_models/hubert_large_ll60k.pt
# !local/dump_ssl_feature.sh --feat_dir ssl_feats --datadir dump/raw/org --train_set train_nodev --dev_set train_dev --test_sets "test" --use_gpu true --nj 1 --feature_type hubert --hubert_type fairseq --hubert_url "https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt" --hubert_dir_path "./downloads/hubert_pretrained_models" --layer 23

# S3PRL HuBERT large example
!local/dump_ssl_feature.sh --feat_dir ssl_feats --datadir dump/raw/org --train_set train_nodev --dev_set train_dev --test_sets "test" --use_gpu true --nj 1 --feature_type s3prl --s3prl_upstream_name wavlm_large --layer 24
#!local/dump_ssl_feature.sh --feat_dir ssl_feats --datadir dump/raw/org --train_set train_nodev --dev_set train_dev --test_sets "test" --use_gpu true --nj 1 --feature_type s3prl --s3prl_upstream_name hubert_large_ll60k --layer 24

# copy the feats.scp to data/*
!cp ssl_feats/s3prl/train_nodev/feats.scp data/train_nodev
!cp ssl_feats/s3prl/train_dev/feats.scp data/train_dev
!cp ssl_feats/s3prl/test/feats.scp data/test

# Print the shape of dumped features.
!/content/espnet/tools/anaconda/envs/espnet/bin/python3 -c "import kaldiio; reader=kaldiio.ReadHelper('scp:data/train_nodev/feats.scp'); key, array = next(reader.generator); print(key, array.shape)"

⭕ [SSL] Stage 3: Format feats.scp: data/ -> dump/extracted

Because we want to use extracted feature instead of raw wave, we need to run step 3 again**. It only construct a new dump/extracted folder, with some superficial commands.

👀 From now on, --feats_type "extracted" will be added.

[ ]:
# 25 seconds
!./asr.sh --stage 3 --stop_stage 3 --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --feats_type "extracted" --nj 4

Stage 4: Remove long/short data: dump/extracted/org -> dump/raw

Too long and too short audio data are harmful for efficient training. Those utterances are removed for training. But for inference and scoring, we still use the full data, which is important for fair comparison.

[ ]:
!./asr.sh --stage 4 --stop_stage 4 --feats_type "extracted" --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"

Stage 5: Generate token_list from dump/extracted/train_nodev/text using BPE.

This is important for text processing. Here, we make a dictionary simply using the English characters. We use the sentencepiece toolkit developed by Google.

[ ]:
!./asr.sh --stage 5 --stop_stage 5 --feats_type "extracted" --train_set train_nodev --valid_set train_dev --test_sets "train_dev test"

Language modeling (skipped in this tutorial)

Stages 6–9: Stages related to language modeling.

We skip the language modeling part in the recipe (stages 6 – 9) in this tutorial.

How to change the configs?

Let’s revisit the configs, since this is probably the most important part to improve the performance.

All training options are changed in the config file.

Pleae check https://espnet.github.io/espnet/espnet2_training_option.html

Let’s first check config files prepared in the an4 recipe

  • LSTM-based E2E ASR /content/espnet/egs2/an4/asr1/conf/train_asr_rnn.yaml

  • Transformer based E2E ASR /content/espnet/egs2/an4/asr1/conf/train_asr_transformer.yaml

You can run

RNN

./asr.sh --stage 10 \
   --feats_type "extracted" \
   --train_set train_nodev \
   --valid_set train_dev \
   --test_sets "train_dev test" \
   --nj 4 \
   --inference_nj 4 \
   --use_lm false \
   --asr_config conf/train_asr_rnn.yaml

Transformer

./asr.sh --stage 10 \
   --feats_type "extracted" \
   --train_set train_nodev \
   --valid_set train_dev \
   --test_sets "train_dev test" \
   --nj 4 \
   --inference_nj 4 \
   --use_lm false \
   --asr_config conf/train_asr_transformer.yaml

You can also find various configs in other recipes espnet/egs2/*/asr1/conf/, including - Conformer egs2/librispeech/asr1/conf/tuning/train_asr_conformer10_hop_length160.yaml - Branchformer egs2/librispeech/asr1/conf/tuning/train_asr_branchformer_hop_length160_e18_linear3072.yaml

You can also customize it by passing the command line arguments, e.g.,

./run.sh --stage 10 --asr_args "--model_conf ctc_weight=0.3"
./run.sh --stage 10 --asr_args "--optim_conf lr=0.1"

This approach has a highest priority. Thus, the arguments passed in the command line will overwrite those defined in the config file. This is convenient if you only want to change a few arguments.

Please refer to https://espnet.github.io/espnet/espnet2_tutorial.html#change-the-configuration-for-training for more details.

📗 Exercise 1

Run training, inference and scoring on AN4 using a new config. Here is an example config using Branchformer (Peng et al, ICML 2022).

  1. Frontend is set to null.

  2. A preencoder is added to reduce input dimension.

  3. In the encoder, the subsampling is reduced to 2 (input_layer is conv2d2)

  1. Gobal Mean normalization

    • Compute the statistics (mean / var) on the full training set. This is done in stage 10. Both mean and var are considered.

    • This is set by default in asr.sh by, specifically the argument --feats_normalize global_mvn.

  2. Utterance Mean normalization

    • Compute the statistics (mean / var) on each single utterance. By default, ESPnet only normalize the mean.

    • This can specified to asr.sh by --feats_normalize utt_mvn. Whatever the value is, as long as it is not global_mvn.

  3. No normalization

    • Nothing is done in the feature.

    • This can be specified by --feats_normalize null --asr_args "--normalize null"

Similarly, we create a config file named train_asr_demo_branchformer.yaml and start training.

batch_type: numel
batch_bins: 4000000
accum_grad: 1    # gradient accumulation steps
max_epoch: 40
patience: 10
init: xavier_uniform
best_model_criterion:  # criterion to save best models
-   - valid
    - acc
    - max
keep_nbest_models: 10  # save nbest models and average these checkpoints
use_amp: true    # whether to use automatic mixed precision
num_att_plot: 0  # do not save attention plots to save time in the demo
num_workers: 2   # number of workers in dataloader

frontend: null  # Since extracted features are used, frontend is not used.

preencoder: linear
preencoder_conf:
    input_size: 1024
    output_size: 128

encoder: branchformer
encoder_conf:
    output_size: 256
    use_attn: true
    attention_heads: 4
    attention_layer_type: rel_selfattn
    pos_enc_layer_type: rel_pos
    rel_pos_type: latest
    use_cgmlp: true
    cgmlp_linear_units: 1024
    cgmlp_conv_kernel: 31
    use_linear_after_conv: false
    gate_activation: identity
    merge_method: concat
    cgmlp_weight: 0.5               # used only if merge_method is "fixed_ave"
    attn_branch_drop_rate: 0.0      # used only if merge_method is "learned_ave"
    num_blocks: 12
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    attention_dropout_rate: 0.1
    input_layer: conv2d2
    stochastic_depth_rate: 0.0

decoder: transformer
decoder_conf:
    attention_heads: 4
    linear_units: 1024
    num_blocks: 3
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    self_attention_dropout_rate: 0.1
    src_attention_dropout_rate: 0.1

model_conf:
    ctc_weight: 0.3  # joint CTC/attention training
    lsm_weight: 0.1  # label smoothing weight
    length_normalized_loss: false

optim: adam
optim_conf:
    lr: 0.0002
scheduler: warmuplr  # linearly increase and exponentially decrease
scheduler_conf:
    warmup_steps: 200

My result is shown below:

## exp/asr_train_asr_demo_branchformer_extracted_bpe30
### WER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_asr_model_valid.acc.ave/test|130|773|95.9|2.6|1.6|0.0|4.1|16.9|
|decode_asr_asr_model_valid.acc.ave/train_dev|100|591|92.0|5.9|2.0|0.2|8.1|28.0|

### CER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_asr_model_valid.acc.ave/test|130|2565|98.1|0.1|1.8|0.1|2.0|16.9|
|decode_asr_asr_model_valid.acc.ave/train_dev|100|1915|95.5|0.7|3.8|0.2|4.7|28.0|

### TER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_asr_model_valid.acc.ave/test|130|2695|98.1|0.1|1.7|0.1|1.9|16.9|
|decode_asr_asr_model_valid.acc.ave/train_dev|100|2015|95.7|0.7|3.6|0.1|4.5|28.0|
[ ]:
# ~10 min
# Run multiple stages
!rm -r exp/asr_train_asr_demo_branchformer_extracted_bpe30
!./asr.sh --stage 10 --stop_stage 13 --feats_type "extracted" --feats_normalize utt_mvn --train_set train_nodev --valid_set train_dev --test_sets "train_dev test" --nj 4 --ngpu 1 --use_lm false --gpu_inference true --inference_nj 1 --asr_config conf/train_asr_demo_branchformer.yaml --inference_config conf/decode_asr.yaml
[ ]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

# Launch tensorboard before training
%tensorboard --logdir /content/espnet/egs2/an4/asr1/exp/asr_train_asr_demo_branchformer_extracted_bpe30/tensorboard
[ ]:
# NOTE: Exercise 1 Result 1 (HuBERT)
!scripts/utils/show_asr_result.sh exp
from IPython.display import Image, display
display(Image('exp/asr_train_asr_demo_branchformer_extracted_bpe30/images/acc.png', width=400))

print_date_and_time()
[ ]:
# NOTE: Exercise 1 Result 2 (WavLM)
!scripts/utils/show_asr_result.sh exp
from IPython.display import Image, display
display(Image('exp/asr_train_asr_demo_branchformer_extracted_bpe30/images/acc.png', width=400))

print_date_and_time()
[ ]:
# NOTE: Exercise 1 Result 3 (WavLM utt_mvn)
!scripts/utils/show_asr_result.sh exp
from IPython.display import Image, display
display(Image('exp/asr_train_asr_demo_branchformer_extracted_bpe30/images/acc.png', width=400))

print_date_and_time()

📗 Questions

  1. What is the difference between HuBERT and WavLM? [1 pt]

WavLM is a newer model which uses masked speech denoising to create an embedding applicable to multiple downstream tasks, not just ASR.
  1. Get the ASR performance of one more SSL feature, WavLM, and show the results. [1 pt]

Hint: change the s3prl_upstream_name to wavlm_large at stage 3.5 and run the following stages.

# RESULTS
## Environments
- date: `Sat Feb 25 03:26:54 UTC 2023`
- python version: `3.9.16 (main, Jan 11 2023, 16:05:54)  [GCC 11.2.0]`
- espnet version: `espnet 202301`
- pytorch version: `pytorch 1.12.1`
- Git hash: `15a6dc1501b65211725a4fb514fcf5dd24f7ae95`
  - Commit date: `Thu Feb 23 22:04:23 2023 -0500`

## exp/asr_train_asr_demo_branchformer_extracted_bpe30
### WER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_asr_model_valid.acc.ave/test|130|773|63.5|13.6|22.9|2.2|38.7|79.2|
|decode_asr_asr_model_valid.acc.ave/train_dev|100|591|59.6|18.1|22.3|2.4|42.8|82.0|

### CER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_asr_model_valid.acc.ave/test|130|2565|80.6|2.6|16.8|1.4|20.8|79.2|
|decode_asr_asr_model_valid.acc.ave/train_dev|100|1915|78.0|4.6|17.4|0.8|22.8|82.0|

### TER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_asr_model_valid.acc.ave/test|130|2695|81.6|2.4|16.0|1.3|19.8|79.2|
|decode_asr_asr_model_valid.acc.ave/train_dev|100|2015|79.1|4.4|16.6|0.7|21.7|82.0|

============================================================
 Current date and time: 02/24/2023 22:26:55
============================================================
  1. Compare the performance between HuBERT, WavLM and MFCC features. Which is better? How much is it? Why do you think it is better in one sentence? [1 pt]

It seems like HuBERT performed slightly better than WavLM, probably because HuBERT is more specifically focused on this ASR.
  1. Make a exploration of normalization mentioned in Stage 10 for either HuBRET or WavLM feature. Report the performance. [1 pt]

    Hint: you may change the number of epochs to get better performance. ``` # RESULTS ## Environments

  • date: Sat Feb 25 04:31:27 UTC 2023

  • python version: 3.9.16 (main, Jan 11 2023, 16:05:54)  [GCC 11.2.0]

  • espnet version: espnet 202301

  • pytorch version: pytorch 1.12.1

  • Git hash: 15a6dc1501b65211725a4fb514fcf5dd24f7ae95

    • Commit date: Thu Feb 23 22:04:23 2023 -0500

## exp/asr_train_asr_demo_branchformer_extracted_bpe30 ### WER

dataset

Snt

Wrd

Corr

Sub

Del

Ins

Err

S.Err

decode_asr_asr_model_valid.acc.ave/test

130

773

63.5

13.6

22.9

2.2

38.7

79.2

decode_asr_asr_model_valid.acc.ave/train_dev

100

591

59.6

18.1

22.3

2.4

42.8

82.0

### CER

dataset

Snt

Wrd

Corr

Sub

Del

Ins

Err

S.Err

decode_asr_asr_model_valid.acc.ave/test

130

2565

80.6

2.6

16.8

1.4

20.8

79.2

decode_asr_asr_model_valid.acc.ave/train_dev

100

1915

78.0

4.6

17.4

0.8

22.8

82.0

### TER

dataset

Snt

Wrd

Corr

Sub

Del

Ins

Err

S.Err

decode_asr_asr_model_valid.acc.ave/test

130

2695

81.6

2.4

16.0

1.3

19.8

79.2

decode_asr_asr_model_valid.acc.ave/train_dev

100

2015

79.1

4.4

16.6

0.7

21.7

82.0

============================================================ Current date and time: 02/24/2023 23:31:28 ============================================================ ```

Contribute to ESPnet

Please follow https://github.com/espnet/espnet/blob/master/CONTRIBUTING.md to upload your pre-trained model to Hugging Face and make a pull request in the ESPnet repository.