espnet2.asr_transducer.encoder.blocks.conv1d.Conv1d

About 2 min

espnet2.asr_transducer.encoder.blocks.conv1d.Conv1d

class espnet2.asr_transducer.encoder.blocks.conv1d.Conv1d(input_size: int, output_size: int, kernel_size: int | Tuple, stride: int | Tuple = 1, dilation: int | Tuple = 1, groups: int | Tuple = 1, bias: bool = True, batch_norm: bool = False, relu: bool = True, causal: bool = False, dropout_rate: float = 0.0)

Bases: Module

Conv1d module definition.

Parameters:
- input_size – Input dimension.
- output_size – Output dimension.
- kernel_size – Size of the convolving kernel.
- stride – Stride of the convolution.
- dilation – Spacing between the kernel points.
- groups – Number of blocked connections from input channels to output channels.
- bias – Whether to add a learnable bias to the output.
- batch_norm – Whether to use batch normalization after convolution.
- relu – Whether to use a ReLU activation after convolution.
- causal – Whether to use causal convolution (set to True if streaming).
- dropout_rate – Dropout rate.

Construct a Conv1d object.

chunk_forward(x: Tensor, pos_enc: Tensor, mask: Tensor, left_context: int = 0) → Tuple[Tensor, Tensor]

Encode chunk of input sequence.

Parameters:
- x – Conv1d input sequences. (B, T, D_in)
- pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_in)
- mask – Source mask. (B, T)
- left_context – Number of previous frames the attention module can see in current chunk (not used here).
Returns: Conv1d output sequences. (B, T, D_out) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_out)
Return type: x

create_new_mask(mask: Tensor) → Tensor

Create new mask for output sequences.

Parameters:mask – Mask of input sequences. (B, T)
Returns: Mask of output sequences. (B, sub(T))
Return type: mask

create_new_pos_enc(pos_enc: Tensor) → Tensor

Create new positional embedding vector.

Parameters:pos_enc – Input sequences positional embedding. (B, 2 * (T - 1), D_in)
Returns: Output sequences positional embedding. : (B, 2 * (sub(T) - 1), D_in)
Return type: pos_enc

forward(x: Tensor, pos_enc: Tensor, mask: Tensor | None = None, chunk_mask: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor]

Encode input sequences.

Parameters:
- x – Conv1d input sequences. (B, T, D_in)
- pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_in)
- mask – Source mask. (B, T)
- chunk_mask – Chunk mask. (T_2, T_2)
Returns: Conv1d output sequences. (B, sub(T), D_out) mask: Source mask. (B, T) or (B, sub(T)) pos_enc: Positional embedding sequences.
(B, 2 * (T - 1), D_att) or (B, 2 * (sub(T) - 1), D_out)
Return type: x

reset_streaming_cache(left_context: int, device: device) → None

Initialize/Reset Conv1d cache for streaming.

Parameters:
- left_context – Number of previous frames the attention module can see in current chunk (not used here).
- device – Device to use for cache tensor.