espnet2.spk package¶
espnet2.spk.espnet_model¶
-
class
espnet2.spk.espnet_model.
ESPnetSpeakerModel
(frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], encoder: Optional[espnet2.asr.encoder.abs_encoder.AbsEncoder], pooling: Optional[espnet2.spk.pooling.abs_pooling.AbsPooling], projector: Optional[espnet2.spk.projector.abs_projector.AbsProjector], loss: Optional[espnet2.spk.loss.abs_loss.AbsLoss])[source]¶ Bases:
espnet2.train.abs_espnet_model.AbsESPnetModel
Speaker embedding extraction model. Core model for diverse speaker-related tasks (e.g., verification, open-set identification, diarization)
The model architecture comprises mainly ‘encoder’, ‘pooling’, and ‘projector’. In common speaker recognition field, the combination of three would be usually named as ‘speaker_encoder’ (or speaker embedding extractor). We splitted it into three for flexibility in future extensions:
‘encoder’ : extract frame-level speaker embeddings.
‘pooling’ : aggregate into single utterance-level embedding.
- ‘projector’(optional) additional processing (e.g., one fully-
connected layer) to derive speaker embedding.
Possibly, in the future, ‘pooling’ and/or ‘projector’ can be integrated as a ‘decoder’, depending on the extension for joint usage of different tasks (e.g., ASR, SE, target speaker extraction).
-
collect_feats
(speech: torch.Tensor, speech_lengths: torch.Tensor, spk_labels: torch.Tensor = None, **kwargs) → Dict[str, torch.Tensor][source]¶
-
extract_feats
(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶
-
forward
(speech: torch.Tensor, spk_labels: torch.Tensor, task_tokens: torch.Tensor = None, extract_embd: bool = False, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Feed-forward through encoder layers and aggregate into utterance-level feature.
- Parameters:
speech – (Batch, samples)
speech_lengths – (Batch,)
extract_embd – a flag which doesn’t go through the classification head when set True
spk_labels – (Batch, )
speaker labels used in the train phase (one-hot) –
task_tokens – (Batch, )
tokens used in case of token-based trainings (task) –
espnet2.spk.__init__¶
espnet2.spk.encoder.ecapa_tdnn_encoder¶
ECAPA-TDNN Encoder
-
class
espnet2.spk.encoder.ecapa_tdnn_encoder.
EcapaTdnnEncoder
(input_size: int, block: str = 'EcapaBlock', model_scale: int = 8, ndim: int = 1024, output_size: int = 1536, **kwargs)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
ECAPA-TDNN encoder. Extracts frame-level ECAPA-TDNN embeddings from mel-filterbank energy or MFCC features. Paper: B Desplanques at el., ``ECAPA-TDNN: Emphasized Channel Attention,
Propagation and Aggregation in TDNN Based Speaker Verification,’’ in Proc. INTERSPEECH, 2020.
- Parameters:
input_size – input feature dimension.
block – type of encoder block class to use.
model_scale – scale value of the Res2Net architecture.
ndim – dimensionality of the hidden representation.
output_size – output embedding dimension.
-
forward
(x: torch.Tensor)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.spk.encoder.__init__¶
espnet2.spk.encoder.rawnet3_encoder¶
RawNet3 Encoder
-
class
espnet2.spk.encoder.rawnet3_encoder.
RawNet3Encoder
(input_size: int, block: str = 'Bottle2neck', model_scale: int = 8, ndim: int = 1024, output_size: int = 1536, **kwargs)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
RawNet3 encoder. Extracts frame-level RawNet embeddings from raw waveform. paper: J. Jung et al., “Pushing the limits of raw waveform speaker
recognition”, in Proc. INTERSPEECH, 2022.
- Parameters:
input_size – input feature dimension.
block – type of encoder block class to use.
model_scale – scale value of the Res2Net architecture.
ndim – dimensionality of the hidden representation.
output_size – ouptut embedding dimension.
-
forward
(x: torch.Tensor)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.spk.projector.abs_projector¶
-
class
espnet2.spk.projector.abs_projector.
AbsProjector
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(utt_embd: torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.spk.projector.__init__¶
espnet2.spk.projector.rawnet3_projector¶
-
class
espnet2.spk.projector.rawnet3_projector.
RawNet3Projector
(input_size, output_size)[source]¶ Bases:
espnet2.spk.projector.abs_projector.AbsProjector
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.spk.layers.rawnet_block¶
-
class
espnet2.spk.layers.rawnet_block.
AFMS
(nb_dim: int)[source]¶ Bases:
torch.nn.modules.module.Module
Alpha-Feature map scaling, added to the output of each residual block[1,2].
Reference: [1] RawNet2 : https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1011.pdf [2] AMFS : https://www.koreascience.or.kr/article/JAKO202029757857763.page
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.spk.layers.rawnet_block.
Bottle2neck
(inplanes, planes, kernel_size=None, dilation=None, scale=4, pool=False)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.spk.layers.ecapa_block¶
-
class
espnet2.spk.layers.ecapa_block.
EcapaBlock
(inplanes, planes, kernel_size=None, dilation=None, scale=8)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.spk.layers.ecapa_block.
SEModule
(channels: int, bottleneck: int = 128)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.spk.layers.__init__¶
espnet2.spk.loss.abs_loss¶
-
class
espnet2.spk.loss.abs_loss.
AbsLoss
(nout: int, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
-
abstract
forward
(x: torch.Tensor, label=None) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.spk.loss.aamsoftmax_subcenter_intertopk¶
-
class
espnet2.spk.loss.aamsoftmax_subcenter_intertopk.
ArcMarginProduct_intertopk_subcenter
(nout, nclasses, scale=32.0, margin=0.2, easy_margin=False, K=3, mp=0.06, k_top=5, do_lm=False)[source]¶ Bases:
espnet2.spk.loss.abs_loss.AbsLoss
Implement of large margin arc distance with intertopk and subcenter: Reference:
MULTI-QUERY MULTI-HEAD ATTENTION POOLING AND INTER-TOPK PENALTY FOR SPEAKER VERIFICATION. https://arxiv.org/pdf/2110.05042.pdf Sub-center ArcFace: Boosting Face Recognition by Large-Scale Noisy Web Faces. https://ibug.doc.ic.ac.uk/media/uploads/documents/eccv_1445.pdf
- Parameters:
in_features – size of each input sample
out_features – size of each output sample
scale – norm of input feature
margin – margin
cos (theta + margin) –
K – number of sub-centers
k_top – number of hard samples
mp – margin penalty of hard samples
do_lm – whether do large margin finetune
-
forward
(input, label)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.spk.loss.__init__¶
espnet2.spk.loss.aamsoftmax¶
-
class
espnet2.spk.loss.aamsoftmax.
AAMSoftmax
(nout, nclasses, margin=0.3, scale=15, easy_margin=False, **kwargs)[source]¶ Bases:
espnet2.spk.loss.abs_loss.AbsLoss
Additive angular margin softmax.
Paper: Deng, Jiankang, et al. “Arcface: Additive angular margin loss for deep face recognition.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
- Parameters:
nout – dimensionality of speaker embedding
nclases – number of speakers in the training set
margin – margin value of AAMSoftmax
scale – scale value of AAMSoftmax
-
forward
(x, label=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.spk.pooling.__init__¶
espnet2.spk.pooling.chn_attn_stat_pooling¶
-
class
espnet2.spk.pooling.chn_attn_stat_pooling.
ChnAttnStatPooling
(input_size: int = 1536)[source]¶ Bases:
espnet2.spk.pooling.abs_pooling.AbsPooling
Aggregates frame-level features to single utterance-level feature. Proposed in B.Desplanques et al., “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification”
- Parameters:
input_size – dimensionality of the input frame-level embeddings. Determined by encoder hyperparameter. For this pooling layer, the output dimensionality will be double of the input_size
-
forward
(x, task_tokens: torch.Tensor = None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.spk.pooling.abs_pooling¶
-
class
-
class
-
class
-
abstract
-
class
-
-
class
-
-
class
-
-
class
-
-
class
-
-
class
-
abstract
-
class
-
class