espnet2.spk package

espnet2.spk.espnet_model

class espnet2.spk.espnet_model.ESPnetSpeakerModel(frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], encoder: Optional[espnet2.asr.encoder.abs_encoder.AbsEncoder], pooling: Optional[espnet2.spk.pooling.abs_pooling.AbsPooling], projector: Optional[espnet2.spk.projector.abs_projector.AbsProjector], loss: Optional[espnet2.spk.loss.abs_loss.AbsLoss])[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

Speaker embedding extraction model.

Core model for diverse speaker-related tasks (e.g., verification, open-set identification, diarization)

The model architecture comprises mainly ‘encoder’, ‘pooling’, and ‘projector’. In common speaker recognition field, the combination of three would be usually named as ‘speaker_encoder’ (or speaker embedding extractor). We splitted it into three for flexibility in future extensions:

  • ‘encoder’ : extract frame-level speaker embeddings.

  • ‘pooling’ : aggregate into single utterance-level embedding.

  • ‘projector’(optional) additional processing (e.g., one fully-

    connected layer) to derive speaker embedding.

Possibly, in the future, ‘pooling’ and/or ‘projector’ can be integrated as a ‘decoder’, depending on the extension for joint usage of different tasks (e.g., ASR, SE, target speaker extraction).

aggregate(frame_level_feats: torch.Tensor) → torch.Tensor[source]
collect_feats(speech: torch.Tensor, speech_lengths: torch.Tensor, spk_labels: torch.Tensor = None, **kwargs) → Dict[str, torch.Tensor][source]
encode_frame(feats: torch.Tensor) → torch.Tensor[source]
extract_feats(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]
forward(speech: torch.Tensor, spk_labels: Optional[torch.Tensor] = None, task_tokens: Optional[torch.Tensor] = None, extract_embd: bool = False, **kwargs) → Union[Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor], torch.Tensor][source]

Feed-forward through encoder layers and aggregate into utterance-level

feature.

Parameters:
  • speech – (Batch, samples)

  • speech_lengths – (Batch,)

  • extract_embd – a flag which doesn’t go through the classification head when set True

  • spk_labels – (Batch, )

  • speaker labels used in the train phase (one-hot) –

  • task_tokens – (Batch, )

  • tokens used in case of token-based trainings (task) –

project_spk_embd(utt_level_feat: torch.Tensor) → torch.Tensor[source]

espnet2.spk.__init__

espnet2.spk.pooling.chn_attn_stat_pooling

class espnet2.spk.pooling.chn_attn_stat_pooling.ChnAttnStatPooling(input_size: int = 1536)[source]

Bases: espnet2.spk.pooling.abs_pooling.AbsPooling

Aggregates frame-level features to single utterance-level feature.

Proposed in B.Desplanques et al., “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification”

Parameters:

input_size – dimensionality of the input frame-level embeddings. Determined by encoder hyperparameter. For this pooling layer, the output dimensionality will be double of the input_size

forward(x, task_tokens: torch.Tensor = None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size()[source]

espnet2.spk.pooling.stat_pooling

class espnet2.spk.pooling.stat_pooling.StatsPooling(input_size: int = 1536)[source]

Bases: espnet2.spk.pooling.abs_pooling.AbsPooling

Aggregates frame-level features to single utterance-level feature.

Proposed in D. Snyder et al., “X-vectors: Robust dnn embeddings for speaker recognition”

Parameters:

input_size – dimensionality of the input frame-level embeddings. Determined by encoder hyperparameter. For this pooling layer, the output dimensionality will be double of the input_size

forward(x, task_tokens: torch.Tensor = None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size()[source]

espnet2.spk.pooling.mean_pooling

class espnet2.spk.pooling.mean_pooling.MeanPooling(input_size: int = 1536)[source]

Bases: espnet2.spk.pooling.abs_pooling.AbsPooling

Average frame-level features to a single utterance-level feature.

Parameters:

input_size – dimensionality of the input frame-level embeddings. Determined by encoder hyperparameter.

forward(x, task_tokens: torch.Tensor = None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size()[source]

espnet2.spk.pooling.abs_pooling

class espnet2.spk.pooling.abs_pooling.AbsPooling(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.spk.pooling.__init__

espnet2.spk.layers.__init__

espnet2.spk.layers.rawnet_block

class espnet2.spk.layers.rawnet_block.AFMS(nb_dim: int)[source]

Bases: torch.nn.modules.module.Module

Alpha-Feature map scaling, added to the output of each residual block[1,2].

Reference: [1] RawNet2 : https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1011.pdf [2] AMFS : https://www.koreascience.or.kr/article/JAKO202029757857763.page

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.spk.layers.rawnet_block.Bottle2neck(inplanes, planes, kernel_size=None, dilation=None, scale=4, pool=False)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.spk.layers.ecapa_block

class espnet2.spk.layers.ecapa_block.EcapaBlock(inplanes, planes, kernel_size=None, dilation=None, scale=8)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.spk.layers.ecapa_block.SEModule(channels: int, bottleneck: int = 128)[source]

Bases: torch.nn.modules.module.Module

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.spk.encoder.xvector_encoder

class espnet2.spk.encoder.xvector_encoder.XvectorEncoder(input_size: int, ndim: int = 512, output_size: int = 1500, kernel_sizes: List = [5, 3, 3, 1, 1], paddings: List = [2, 1, 1, 0, 0], dilations: List = [1, 2, 3, 1, 1], **kwargs)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

X-vector encoder. Extracts frame-level x-vector embeddings from features.

Paper: D. Snyder et al., “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. IEEE ICASSP, 2018.

Parameters:
  • input_size – input feature dimension.

  • ndim – dimensionality of the hidden representation.

  • output_size – ouptut embedding dimension.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.spk.encoder.rawnet3_encoder

RawNet3 Encoder

class espnet2.spk.encoder.rawnet3_encoder.RawNet3Encoder(input_size: int, block: str = 'Bottle2neck', model_scale: int = 8, ndim: int = 1024, output_size: int = 1536, **kwargs)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

RawNet3 encoder. Extracts frame-level RawNet embeddings from raw waveform.

paper: J. Jung et al., “Pushing the limits of raw waveform speaker

recognition”, in Proc. INTERSPEECH, 2022.

Parameters:
  • input_size – input feature dimension.

  • block – type of encoder block class to use.

  • model_scale – scale value of the Res2Net architecture.

  • ndim – dimensionality of the hidden representation.

  • output_size – ouptut embedding dimension.

forward(x: torch.Tensor)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.spk.encoder.identity_encoder

RawNet3 Encoder

class espnet2.spk.encoder.identity_encoder.IdentityEncoder(input_size: int)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Identity encoder. Does nothing, just passes frontend feature to the pooling.

Expected to be used for cases when frontend already has a good representation (e.g., SSL features).

Parameters:

input_size – input feature dimension.

forward(x: torch.Tensor)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.spk.encoder.ecapa_tdnn_encoder

ECAPA-TDNN Encoder

class espnet2.spk.encoder.ecapa_tdnn_encoder.EcapaTdnnEncoder(input_size: int, block: str = 'EcapaBlock', model_scale: int = 8, ndim: int = 1024, output_size: int = 1536, **kwargs)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

ECAPA-TDNN encoder. Extracts frame-level ECAPA-TDNN embeddings from

mel-filterbank energy or MFCC features. Paper: B Desplanques at el., ``ECAPA-TDNN: Emphasized Channel Attention,

Propagation and Aggregation in TDNN Based Speaker Verification,’’ in Proc. INTERSPEECH, 2020.

Parameters:
  • input_size – input feature dimension.

  • block – type of encoder block class to use.

  • model_scale – scale value of the Res2Net architecture.

  • ndim – dimensionality of the hidden representation.

  • output_size – output embedding dimension.

forward(x: torch.Tensor)[source]

Calculate forward propagation.

Parameters:

x (torch.Tensor) – Input tensor (#batch, L, input_size).

Returns:

Output tensor (#batch, L, output_size).

Return type:

torch.Tensor

output_size() → int[source]

espnet2.spk.encoder.__init__

espnet2.spk.encoder.conformer_encoder

Conformer encoder definition.

class espnet2.spk.encoder.conformer_encoder.MfaConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d2', normalize_before: bool = True, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, stochastic_depth_rate: Union[float, List[float]] = 0.0, layer_drop_rate: float = 0.0, max_pos_emb_len: int = 5000, padding_idx: Optional[int] = None)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Conformer encoder module for MFA-Conformer.

Paper: Y. Zhang et al., ``Mfa-conformer: Multi-scale feature aggregation conformer for automatic speaker verification,’’ in Proc. INTERSPEECH, 2022.

Parameters:
  • input_size (int) – Input dimension.

  • output_size (int) – Dimension of attention.

  • attention_heads (int) – The number of heads of multi head attention.

  • linear_units (int) – The number of units of position-wise feed forward.

  • num_blocks (int) – The number of encoder blocks.

  • dropout_rate (float) – Dropout rate.

  • attention_dropout_rate (float) – Dropout rate in attention.

  • positional_dropout_rate (float) – Dropout rate after adding positional encoding.

  • input_layer (Union[str, torch.nn.Module]) – Input layer type.

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.

  • positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.

  • rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.

  • encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.

  • encoder_attn_layer_type (str) – Encoder attention layer type.

  • activation_type (str) – Encoder activation function type.

  • macaron_style (bool) – Whether to use macaron style for positionwise layer.

  • use_cnn_module (bool) – Whether to use convolution module.

  • zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.

  • cnn_module_kernel (int) – Kernerl size of convolution module.

  • padding_idx (int) – Padding idx for input_layer=embed.

forward(x: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Calculate forward propagation.

Parameters:

x (torch.Tensor) – Input tensor (#batch, L, input_size).

Returns:

Output tensor (#batch, L, output_size).

Return type:

torch.Tensor

output_size() → int[source]

espnet2.spk.encoder.ska_tdnn_encoder

class espnet2.spk.encoder.ska_tdnn_encoder.Bottle2neck(inplanes, planes, kernel_size=None, kernel_sizes=[5, 7], dilation=None, scale=8, group=1)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.spk.encoder.ska_tdnn_encoder.ResBlock(inplanes: int, planes: int, stride: int = 1, reduction: int = 8, skfwse_freq: int = 40, skcwse_channel: int = 128)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.spk.encoder.ska_tdnn_encoder.SEModule(channels, bottleneck=128)[source]

Bases: torch.nn.modules.module.Module

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.spk.encoder.ska_tdnn_encoder.SKAttentionModule(channel=128, reduction=4, L=16, num_kernels=2)[source]

Bases: torch.nn.modules.module.Module

forward(x, convs)[source]

Forward function.

Input: [B, C, T] Split: [K, B, C, T] Fues: [B, C, T] Attention weight: [B, C, 1] Output: [B, C, T]

class espnet2.spk.encoder.ska_tdnn_encoder.SkaTdnnEncoder(input_size: int, block: str = 'Bottle2neck', ndim: int = 1024, model_scale: int = 8, skablock: str = 'ResBlock', ska_dim: int = 128, output_size: int = 1536, **kwargs)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

SKA-TDNN encoder. Extracts frame-level SKA-TDNN embeddings from features.

Paper: S. Mun, J. Jung et al., “Frequency and Multi-Scale Selective Kernel

Attention for Speaker Verification,’ in Proc. IEEE SLT 2022.

Parameters:
  • input_size – input feature dimension.

  • block – type of encoder block class to use.

  • model_scale – scale value of the Res2Net architecture.

  • ndim – dimensionality of the hidden representation.

  • output_size – ouptut embedding dimension.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]
class espnet2.spk.encoder.ska_tdnn_encoder.cwSKAttention(freq=40, channel=128, kernels=[3, 5], receptive=[3, 5], dilations=[1, 1], reduction=8, groups=1, L=16)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Forward Function.

Input: [B, C, F, T] Split: [K, B, C, F, T] Fuse: [B, C, F, T] Attention weight: [K, B, C, 1, 1] Output: [B, C, F, T]

class espnet2.spk.encoder.ska_tdnn_encoder.fwSKAttention(freq=40, channel=128, kernels=[3, 5], receptive=[3, 5], dilations=[1, 1], reduction=8, groups=1, L=16)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Forward function.

Input: [B, C, F, T] Split: [K, B, C, F, T] Fues: [B, C, F, T] Attention weight: [K, B, 1, F, 1] Output: [B, C, F, T]

espnet2.spk.projector.abs_projector

class espnet2.spk.projector.abs_projector.AbsProjector(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(utt_embd: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.spk.projector.rawnet3_projector

class espnet2.spk.projector.rawnet3_projector.RawNet3Projector(input_size, output_size=192)[source]

Bases: espnet2.spk.projector.abs_projector.AbsProjector

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size()[source]

espnet2.spk.projector.xvector_projector

class espnet2.spk.projector.xvector_projector.XvectorProjector(input_size, output_size)[source]

Bases: espnet2.spk.projector.abs_projector.AbsProjector

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size()[source]

espnet2.spk.projector.__init__

espnet2.spk.projector.ska_tdnn_projector

class espnet2.spk.projector.ska_tdnn_projector.SkaTdnnProjector(input_size, output_size)[source]

Bases: espnet2.spk.projector.abs_projector.AbsProjector

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size()[source]

espnet2.spk.loss.abs_loss

class espnet2.spk.loss.abs_loss.AbsLoss(nout: int, **kwargs)[source]

Bases: torch.nn.modules.module.Module

abstract forward(x: torch.Tensor, label=None) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.spk.loss.__init__

espnet2.spk.loss.aamsoftmax

class espnet2.spk.loss.aamsoftmax.AAMSoftmax(nout, nclasses, margin=0.3, scale=15, easy_margin=False, **kwargs)[source]

Bases: espnet2.spk.loss.abs_loss.AbsLoss

Additive angular margin softmax.

Paper: Deng, Jiankang, et al. “Arcface: Additive angular margin loss for deep face recognition.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

Parameters:
  • nout – dimensionality of speaker embedding

  • nclases – number of speakers in the training set

  • margin – margin value of AAMSoftmax

  • scale – scale value of AAMSoftmax

forward(x, label=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.spk.loss.aamsoftmax_subcenter_intertopk

class espnet2.spk.loss.aamsoftmax_subcenter_intertopk.ArcMarginProduct_intertopk_subcenter(nout, nclasses, scale=32.0, margin=0.2, easy_margin=False, K=3, mp=0.06, k_top=5, do_lm=False)[source]

Bases: espnet2.spk.loss.abs_loss.AbsLoss

Implement of large margin arc distance with intertopk and subcenter:

Reference:

MULTI-QUERY MULTI-HEAD ATTENTION POOLING AND INTER-TOPK PENALTY FOR SPEAKER VERIFICATION. https://arxiv.org/pdf/2110.05042.pdf Sub-center ArcFace: Boosting Face Recognition by Large-Scale Noisy Web Faces. https://ibug.doc.ic.ac.uk/media/uploads/documents/eccv_1445.pdf

Parameters:
  • in_features – size of each input sample

  • out_features – size of each output sample

  • scale – norm of input feature

  • margin – margin

  • cos (theta + margin) –

  • K – number of sub-centers

  • k_top – number of hard samples

  • mp – margin penalty of hard samples

  • do_lm – whether do large margin finetune

forward(input, label)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

update(margin=0.2)[source]