espnet2.asr_transducer.encoder.blocks.conv1d.Conv1d
espnet2.asr_transducer.encoder.blocks.conv1d.Conv1d
class espnet2.asr_transducer.encoder.blocks.conv1d.Conv1d(input_size: int, output_size: int, kernel_size: int | Tuple, stride: int | Tuple = 1, dilation: int | Tuple = 1, groups: int | Tuple = 1, bias: bool = True, batch_norm: bool = False, relu: bool = True, causal: bool = False, dropout_rate: float = 0.0)
Bases: Module
Conv1d module definition.
- Parameters:
- input_size – Input dimension.
- output_size – Output dimension.
- kernel_size – Size of the convolving kernel.
- stride – Stride of the convolution.
- dilation – Spacing between the kernel points.
- groups – Number of blocked connections from input channels to output channels.
- bias – Whether to add a learnable bias to the output.
- batch_norm – Whether to use batch normalization after convolution.
- relu – Whether to use a ReLU activation after convolution.
- causal – Whether to use causal convolution (set to True if streaming).
- dropout_rate – Dropout rate.
Construct a Conv1d object.
chunk_forward(x: Tensor, pos_enc: Tensor, mask: Tensor, left_context: int = 0) → Tuple[Tensor, Tensor]
Encode chunk of input sequence.
- Parameters:
- x – Conv1d input sequences. (B, T, D_in)
- pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_in)
- mask – Source mask. (B, T)
- left_context – Number of previous frames the attention module can see in current chunk (not used here).
- Returns: Conv1d output sequences. (B, T, D_out) pos_enc: Positional embedding sequences. (B, 2 * (T - 1), D_out)
- Return type: x
create_new_mask(mask: Tensor) → Tensor
Create new mask for output sequences.
- Parameters:mask – Mask of input sequences. (B, T)
- Returns: Mask of output sequences. (B, sub(T))
- Return type: mask
create_new_pos_enc(pos_enc: Tensor) → Tensor
Create new positional embedding vector.
- Parameters:pos_enc – Input sequences positional embedding. (B, 2 * (T - 1), D_in)
- Returns: Output sequences positional embedding. : (B, 2 * (sub(T) - 1), D_in)
- Return type: pos_enc
forward(x: Tensor, pos_enc: Tensor, mask: Tensor | None = None, chunk_mask: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor]
Encode input sequences.
- Parameters:
- x – Conv1d input sequences. (B, T, D_in)
- pos_enc – Positional embedding sequences. (B, 2 * (T - 1), D_in)
- mask – Source mask. (B, T)
- chunk_mask – Chunk mask. (T_2, T_2)
- Returns: Conv1d output sequences. (B, sub(T), D_out) mask: Source mask. (B, T) or (B, sub(T)) pos_enc: Positional embedding sequences.
(B, 2 * (T - 1), D_att) or (B, 2 * (sub(T) - 1), D_out)
- Return type: x
reset_streaming_cache(left_context: int, device: device) → None
Initialize/Reset Conv1d cache for streaming.
- Parameters:
- left_context – Number of previous frames the attention module can see in current chunk (not used here).
- device – Device to use for cache tensor.