This is a template of cls1 recipe for ESPnet2.
Table of Contents
Recipe flow
CLS recipe consists of 10 stages.
1. Database-dependent data preparation
Data preparation stage. It calls local/
to creates Kaldi-style data directories for training, validation, and evaluation sets.
See also:
2. Wav dump preparation
This recipe supports --feats_type raw
option. This means we will run a wav dumping stage which reformats wav.scp
in data directories. This process standardizes all data to common sampling rate and data format.
3. Filtering
Filtering stage. Processing stage to remove long and short utterances from the training and validation sets. You can change the threshold values via --min_wav_duration
and --max_wav_duration
Empty text will also be removed. If your audio sample lacks a label in multi-label setting then use the <blank>
symbol. TODO(shikhar): This feature will be supported in a later PR.
4. Token list generation
Token list generation stage. It generates token list (dictionary) from text
file. We only support --token_type=word
option. This means that each unique space-separated word in the text file becomes a class/label for classification. Please note that this process is case-sensitive.
NOTE: Data preparation will end in stage 4. You can skip data preparation (stage 1 ~ stage 4) via --skip_data_prep
5. CLS statistics collection
Statistics calculation stage. It collects the shape information of the input and output and calculates statistics for feature normalization (mean and variance over training and validation sets).
6. CLS training
Classification model training stage. You can change the training setting via --train_config
and --cls_args
See also:
Training process will end in stage 6. You can skip training process (stage 5 ~ stage 6) via --skip_train
7. CLS inference
Classification model decoding stage. This stage outputs two files: text and score.
Example text file
as20k-eval-0 Music
For multi-class classification each row will have exactly one class. For multi-label classification each row can have any number of labels (zero or more). The above example is a multi-label output text file.
Example score file
as20k-eval-0 0.5590277314186096 0.451458394527435 ...
as20k-eval-1 0.00023992260685190558 0.00012396479723975062 ...
Each row of both multi-class and multi-label classification models will have probabilities for all tokens (in the same order as they are present in the token_list
We use a threshold of 0.5 for multi-label classification, and use argmax for multi-class classification. You can choose to just produce probabilities for the predicted class/labels in the score file with output_all_probabilities=false
8. Scoring
Evaluation stage. It produces mAP and accuracy metrics.
9. Model packing
Packing stage. It packs the trained model files. Set skip_upload
to False
10. Model upload
Upload stage. It uploads the trained model files. Provide hf_repo
and set skip_upload
to False
How to run
TOOD(shikhar): Change this to a recipe which downloads data (perhaps beans) later.
Here, we show the procedure to run the recipe using egs2/as20k/cls1
Move on the recipe directory.
$ cd egs2/as20k/cls1
variable in
to specify location where you have the AudioSet dataset.
$ vim
and conf/*.conf
if you want to use job scheduler. See the detail in using job scheduling system.
$ vim
, which conducts all of the stages explained above.
$ ./
For the first time, we recommend performing each stage step-by-step via --stage
and --stop_stage
$ ./ --stage 1 --stop_stage 1
$ ./ --stage 2 --stop_stage 2
$ ./ --stage 7 --stop_stage 7
This might help you understand each stage's processing and directory structure.
Here we show the example command to calculate classification metrics:
cd egs2/<recipe_name>/cls1
. ./
python3 pyscripts/utils/ \
-gtxt data/text \
-ptxt exp/cls_<split>/text \
-pscore exp/cls_<split>/score \
-tok data/token_list
About data directory
Each directory of training set, development set, and evaluation set, has same directory structure. See also about Kaldi data structure.
Directory structure
data/ ├── train/ # Training set directory │ ├── text # The transcription │ ├── wav.scp # Wave file path │ ├── utt2spk # A file mapping utterance-id to speaker-id │ ├── spk2utt # A file mapping speaker-id to utterance-id | ├── dev/ │ ... ├── eval/ │ ... └── token_list # token list file ...
formatuttidA <class_a> uttidB <class_b1> <class_b2> ...
Note that for multi-class classification each uttid should be associated with exactly one class. For multi-label classification, each uttid should have at least one label. (TODO) We will support the case with no label in the future with the
formatuttidA /path/to/uttidA.wav uttidB /path/to/uttidB.wav ...
formatuttidA speakerA uttidB speakerB uttidC speakerA uttidD speakerB ...
formatspeakerA uttidA uttidC ... speakerB uttidB uttidD ... ...
Note that
file can be generated byutt2spk
, andutt2spk
can be generated byspk2utt
, so it's enough to create either one of them.utils/ data/train/utt2spk > data/train/spk2utt utils/ data/train/spk2utt > data/train/utt2spk
If your corpus doesn't include speaker information, give the same speaker id as the utterance id to satisfy the directory format, otherwise give the same speaker id for all utterances (Actually we don't use speaker information for cls1 recipe now).
uttidA uttidA uttidB uttidB ...
uttidA dummy uttidB dummy ...
Once you complete creating the data directory, it's good to check it by utils/
utils/ --no-feats data/train
utils/ --no-feats data/dev
utils/ --no-feats data/test
Problems you might encounter
Below are some common errors to watch out for:
1. Torcheval not found
- Run
pip install torcheval
Supported Models
TODO(shikhar): Add details about BEATs once it is trained.