espnet2.train.lid_trainer.LIDTrainer

About 1 min

espnet2.train.lid_trainer.LIDTrainer

class espnet2.train.lid_trainer.LIDTrainer

Bases: Trainer

Trainer designed for LID, adapted from spk_trainer.py

classmethod extract_embed_lid(model: Module, iterator: Iterable[Dict[str, Tensor]], reporter: SubReporter, options: TrainerOptions, distributed_option: DistributedOption, output_dir: str, custom_bs: int, idx2lang: Dict[int, str], extract_embd: bool = False, checkpoint_interval: int = 1000, resume: bool = True, lang_to_embds_dic: Dict[str, List[ndarray]] = None, save_embd_per_utt: bool = False, max_num_utt_per_lang: int | None = None, lang_counter_dic: Dict[str, int] | None = None) → None

Extract LIDs and language embeddings for each utterance in the dataset.

By default, this method performs language identification (LID) for each utterance. If extract_embd=True, it also extracts normalized language embeddings.

lang_embd_dic: {utt_id: lang_embd}, the language embedding for a specific utterance, this is used for temporary saving the language embedding of each utterance, and will be written to the disk every checkpoint_interval utterances. lang_to_embds_dic: {lang: [utt1 embd, utt1 embd …]}, the language embedding for the utterances corresponding to each language, if set extract_embd to True, this will be defaultly used, this will not be written to the dist, but will be (in bin/lid_inference.py) used for calculating the average language embedding for each language, and plotting the tsne plot.

Saved results:

lang_id_dic: {utt_id: predicted_lang}, mapping from utterance ID to

predicted language ID.

lang_embd_dic (optional): {utt_id: lang_embd}, temporary in-memory : storage of per-utterance language embeddings. Saved to disk every checkpoint_interval utterances if save_embd_per_utt=True.
lang_to_embds_dic (optional): {lang: [embd_utt1, embd_utt2, …]}, : mapping from language ID to a list of embeddings from all utterances predicted or labeled with that language. This is not written to disk by this function, but is used downstream (e.g., in bin/lid_inference.py) for computing language-level average embeddings or generating t-SNE visualizations.

Notes:

All extracted embeddings are L2-normalized.
The function supports distributed inference using torch.distributed.
Supports resume functionality by skipping already processed utterances

based on existing output files.

Limits the number of utterances per language if max_num_utt_per_lang is specified.