espnet2.spk.pooling.chn_attn_stat_pooling.ChnAttnStatPooling

Less than 1 minute

class espnet2.spk.pooling.chn_attn_stat_pooling.ChnAttnStatPooling(input_size: int = 1536)

Aggregates frame-level features to single utterance-level feature.

Reference: ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification https://arxiv.org/pdf/2005.07143

Parameters:input_size – Dimension of the input frame-level embeddings. The output dimensionality will be 2 × input_size after concatenating mean and std.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, feat_lengths: Tensor | None = None) → Tensor

Forward pass of channel-attentive statistical pooling.

Parameters:
- x – Input feature tensor of shape (batch_size, feature_dim, seq_len)
- feat_lengths – Optional tensor of shape (batch_size,) containing the valid length of each sequence before padding
Returns: Utterance-level embeddings of shape (batch_size, 2 × feature_dim)
Return type: x

output_size()