Language Identification
Language Identification
This is a template of the lid1
recipe for ESPnet2. It follows a classification-based training/inference pipeline for spoken language identification. The model is trained as a closed-set classifier over a predefined set of language labels. Optionally, language embeddings can be extracted and used for downstream analysis, e.g., t-SNE visualization.
Table of Contents
Recipe flow
lid1
recipe consists of 10 stages.
- Data preparation
Prepares Kaldi-style data directories using local/data.sh
.
Expected files include:
wav.scp
: path to raw audioutt2lang
: utterance-to-language mappinglang2utt
: language-to-utterance mapping (for sampling)segments
(optional): used to extract segments from long recordings
- Speed perturbation (Optional)
Applies offline speed perturbation to the training set using multiple speed factors, e.g., 0.9 1.0 1.1
.
- Wav format
Formats the audio to a consistent format (wav
, flac
, or Kaldi-ark) and copies necessary metadata to the working directory. Required for both training and evaluation sets.
- Statistics collection
Collects input feature shape statistics and language information needed for batching and model configuration.
- LID training
Trains the language identification model using the configuration provided via --lid_config
and optional arguments in --lid_args
. The model is trained to predict the correct language ID for each utterance.
- Inference and embedding extraction
Performs inference on evaluation sets. This stage supports both:
- LID prediction (predicted
utt2lang
) - Language embedding extraction (utterance-level or averaged per language)
- Optionally saves intermediate outputs
- Score calculation
Computes standard classification metrics (Accuracy, Macro Accuracy) by comparing model predictions with reference utt2lang
.
- t-SNE visualization
Visualizes the per-language embeddings using t-SNE.
9-10. (Optional) Pack and upload results
Packs the trained model and metadata into a zip file and optionally uploads it to Hugging Face Hub for sharing and reproducibility.
How to run
Example: VoxLingua107 training
Move to the recipe directory:
cd egs2/voxlingua107/lid1
Edit the following files:
vim db.sh # set path to VoxLingua107 dataset
vim cmd.sh # job scheduling command if using a cluster
vim conf/mms_ecapa_bs3min_baseline.yaml # model and training configuration (default training configuration)
Then run the full pipeline:
./run.sh
This will go through all the stages from data preparation to scoring.