espnet2.spk.pooling.stat_pooling.StatsPooling

Less than 1 minute

class espnet2.spk.pooling.stat_pooling.StatsPooling(input_size: int = 1536)

Aggregates frame-level features to single utterance-level feature.

Reference: X-Vectors: Robust DNN Embeddings for Speaker Recognition https://www.danielpovey.com/files/2018_icassp_xvectors.pdf

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, feat_lengths: Tensor | None = None) → Tensor

Forward pass of statistics pooling.

Parameters:
- x – Input feature tensor of shape (batch_size, feature_dim, seq_len)
- feat_lengths – Optional tensor of shape (batch_size,) containing the valid length of each sequence before padding
Returns: Utterance-level embeddings of shape (batch_size, 2 × feature_dim)
Return type: x

output_size()