espnet2.spk.pooling.stat_pooling.StatsPooling
Less than 1 minute
espnet2.spk.pooling.stat_pooling.StatsPooling
class espnet2.spk.pooling.stat_pooling.StatsPooling(input_size: int = 1536)
Bases: AbsPooling
Aggregates frame-level features to single utterance-level feature.
Reference: X-Vectors: Robust DNN Embeddings for Speaker Recognition https://www.danielpovey.com/files/2018_icassp_xvectors.pdf
- Parameters:input_size – Dimension of the input frame-level embeddings.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
forward(x: Tensor, feat_lengths: Tensor | None = None) → Tensor
Forward pass of statistics pooling.
- Parameters:
- x – Input feature tensor of shape (batch_size, feature_dim, seq_len)
- feat_lengths – Optional tensor of shape (batch_size,) containing the valid length of each sequence before padding
- Returns: Utterance-level embeddings of shape (batch_size, 2 × feature_dim)
- Return type: x
output_size()