espnet2.asr.preencoder.sinc.LightweightSincConvs

About 2 min

espnet2.asr.preencoder.sinc.LightweightSincConvs

class espnet2.asr.preencoder.sinc.LightweightSincConvs(fs: int | str | float = 16000, in_channels: int = 1, out_channels: int = 256, activation_type: str = 'leakyrelu', dropout_type: str = 'dropout', windowing_type: str = 'hamming', scale_type: str = 'mel')

Bases: AbsPreEncoder

Lightweight Sinc Convolutions.

Instead of using precomputed features, end-to-end speech recognition can also be done directly from raw audio using sinc convolutions, as described in “Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions” by Kürzinger et al. https://arxiv.org/abs/2010.07597

To use Sinc convolutions in your model instead of the default f-bank frontend, set this module as your pre-encoder with preencoder: sinc and use the input of the sliding window frontend with frontend: sliding_window in your yaml configuration file. So that the process flow is:

Frontend (SlidingWindow) -> SpecAug -> Normalization -> Pre-encoder (LightweightSincConvs) -> Encoder -> Decoder

Note that this method also performs data augmentation in time domain (vs. in spectral domain in the default frontend). Use plot_sinc_filters.py to visualize the learned Sinc filters.

Initialize the module.

Parameters:
- fs – Sample rate.
- in_channels – Number of input channels.
- out_channels – Number of output channels (for each input channel).
- activation_type – Choice of activation function.
- dropout_type – Choice of dropout function.
- windowing_type – Choice of windowing function.
- scale_type – Choice of filter-bank initialization scale.

espnet_initialization_fn()

Initialize sinc filters with filterbank values.

forward(input: Tensor, input_lengths: Tensor) → Tuple[Tensor, Tensor]

Apply Lightweight Sinc Convolutions.

The input shall be formatted as (B, T, C_in, D_in) with B as batch size, T as time dimension, C_in as channels, and D_in as feature dimension.

The output will then be (B, T, C_out*D_out) with C_out and D_out as output dimensions.

The current module structure only handles D_in=400, so that D_out=1. Remark for the multichannel case: C_out is the number of out_channels given at initialization multiplied with C_in.

gen_lsc_block(in_channels: int, out_channels: int, depthwise_kernel_size: int = 9, depthwise_stride: int = 1, depthwise_groups=None, pointwise_groups=0, dropout_probability: float = 0.15, avgpool=False)

Generate a convolutional block for Lightweight Sinc convolutions.

Each block consists of either a depthwise or a depthwise-separable convolutions together with dropout, (batch-)normalization layer, and an optional average-pooling layer.

Parameters:
- in_channels – Number of input channels.
- out_channels – Number of output channels.
- depthwise_kernel_size – Kernel size of the depthwise convolution.
- depthwise_stride – Stride of the depthwise convolution.
- depthwise_groups – Number of groups of the depthwise convolution.
- pointwise_groups – Number of groups of the pointwise convolution.
- dropout_probability – Dropout probability in the block.
- avgpool – If True, an AvgPool layer is inserted.
Returns: Neural network building block.
Return type: torch.nn.Sequential

output_size() → int

Get the output size.