espnet2.train.lid_trainer.LIDTrainer
espnet2.train.lid_trainer.LIDTrainer
class espnet2.train.lid_trainer.LIDTrainer
Bases: Trainer
Trainer designed for LID, adapted from spk_trainer.py
classmethod extract_embed_lid(model: Module, iterator: Iterable[Dict[str, Tensor]], reporter: SubReporter, options: TrainerOptions, distributed_option: DistributedOption, output_dir: str, custom_bs: int, idx2lang: Dict[int, str], extract_embd: bool = False, save_every: int = 1000, resume: bool = True, lang_to_embds_dic: Dict[str, List[ndarray]] = None, save_embd_per_utt: bool = False, max_num_utt_per_lang: int | None = None, lang_counter_dic: Dict[str, int] | None = None) → None
Extract LIDs and language embeddings for each utterance in the dataset.
By default, this method performs language identification (LID) for each utterance. If extract_embd=True, it also extracts normalized language embeddings.
lang_embd_dic: {utt_id: lang_embd}, the language embedding for a specific utterance, this is used for temporary saving the language embedding of each utterance, and will be written to the dist every save_every utterances. lang_to_embds_dic: {lang: [utt1 embd, utt1 embd …]}, the language embedding for the utterances corresponding to each language, if set extract_embd to True, this will be defaultly used, this will not be written to the dist, but will be (in bin/lid_inference.py) used for calculating the average language embedding for each language, and plotting the tsne plot.
Saved results:
- lang_id_dic: {utt_id: predicted_lang}, mapping from utterance ID to
predicted language ID.
- lang_embd_dic (optional): {utt_id: lang_embd}, temporary in-memory : storage of per-utterance language embeddings. Saved to disk every save_every utterances if save_embd_per_utt=True.
- lang_to_embds_dic (optional): {lang: [embd_utt1, embd_utt2, …]}, : mapping from language ID to a list of embeddings from all utterances predicted or labeled with that language. This is not written to disk by this function, but is used downstream (e.g., in bin/lid_inference.py) for computing language-level average embeddings or generating t-SNE visualizations.
Notes:
- All extracted embeddings are L2-normalized.
- The function supports distributed inference using torch.distributed.
- Supports resume functionality by skipping already processed utterances
based on existing output files.
- Limits the number of utterances per language if max_num_utt_per_lang is specified.