espnet2.asr.frontend.cnn.CNNFrontend

Less than 1 minute

espnet2.asr.frontend.cnn.CNNFrontend

class espnet2.asr.frontend.cnn.CNNFrontend(norm_mode: str, conv_mode: str, bias: bool, shapes: List[Tuple[int, int, int]] = [(512, 10, 5), (512, 3, 2), (512, 3, 2), (512, 3, 2), (512, 3, 2), (512, 2, 2), (512, 2, 2)], fs: int | str = 16000, normalize_audio: bool = False, normalize_output: bool = False, layer_norm_cls: Literal['transposed', 'dim1'] = 'transposed')

Bases: AbsFrontend

Convolutional feature extractor.

Typically used in SSL models. Uses raw waveforms as input.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, length: Tensor | None) → Tuple[Tensor, Tensor | None]

CNNFrontend Forward.

Parameters:
- x (Tensor) – Input Tensor representing a batch of audio, shape: [batch, time].
- length (Tensor or None , optional) – Valid length of each input sample. shape: [batch, ].
Returns: The resulting feature, shape: [batch, frame, feature] Optional[Tensor]:
Valid length of each output sample. shape: [batch, ].
Return type: Tensor

output_size() → int