Language Identification
Language Identification
This is a template of the lid1 recipe for ESPnet2. It follows a classification-based training/inference pipeline for spoken language identification. The model is trained as a closed-set classifier over a predefined set of language labels. Optionally, language embeddings can be extracted and used for downstream analysis, e.g., t-SNE visualization.
Table of Contents
Recipe flow
lid1 recipe consists of 10 stages.
- Data preparation
Prepares Kaldi-style data directories using local/data.sh.
Expected files include:
wav.scp: path to raw audioutt2lang: utterance-to-language mappinglang2utt: language-to-utterance mapping (for sampling)segments(optional): used to extract segments from long recordings
- Speed perturbation (Optional)
Applies offline speed perturbation to the training set using multiple speed factors, e.g., 0.9 1.0 1.1.
- Wav format
Formats the audio to a consistent format (wav, flac, or Kaldi-ark) and copies necessary metadata to the working directory. Required for both training and evaluation sets.
- Statistics collection
Collects input feature shape statistics and language information needed for batching and model configuration.
- LID training
Trains the language identification model using the configuration provided via --lid_config and optional arguments in --lid_args. The model is trained to predict the correct language ID for each utterance.
- Inference and embedding extraction
Performs inference on evaluation sets. This stage supports both:
- LID prediction (predicted
utt2lang) - Language embedding extraction (utterance-level or averaged per language)
- Optionally saves intermediate outputs
- Score calculation
Computes standard classification metrics (Accuracy, Macro Accuracy) by comparing model predictions with reference utt2lang.
- t-SNE visualization
Visualizes the per-language embeddings using t-SNE.
9-10. (Optional) Pack and upload results
Packs the trained model and metadata into a zip file and optionally uploads it to Hugging Face Hub for sharing and reproducibility.
How to run
Example: VoxLingua107 training
Move to the recipe directory:
cd egs2/voxlingua107/lid1Edit the following files:
vim db.sh # set path to VoxLingua107 dataset
vim cmd.sh # job scheduling command if using a cluster
vim conf/mms_ecapa_bs3min_baseline.yaml # model and training configuration (default training configuration)Then run the full pipeline:
./run.shThis will go through all the stages from data preparation to scoring.
