espnet2.asr.encoder.hubert_encoder.TorchAudioHuBERTPretrainEncoder
espnet2.asr.encoder.hubert_encoder.TorchAudioHuBERTPretrainEncoder
class espnet2.asr.encoder.hubert_encoder.TorchAudioHuBERTPretrainEncoder(input_size: int | None = None, extractor_mode: str = 'group_norm', extractor_conv_layer_config: List[List[int]] | None = [[512, 10, 5], [512, 3, 2], [512, 3, 2], [512, 3, 2], [512, 3, 2], [512, 2, 2], [512, 2, 2]], extractor_conv_bias: bool = False, encoder_embed_dim: int = 768, encoder_projection_dropout: float = 0.1, encoder_pos_conv_kernel: int = 128, encoder_pos_conv_groups: int = 16, encoder_num_layers: int = 12, encoder_num_heads: int = 12, encoder_attention_dropout: float = 0.1, encoder_ff_interm_features: int = 3072, encoder_ff_interm_dropout: float = 0.0, encoder_dropout: float = 0.1, encoder_layer_norm_first: bool = False, encoder_layer_drop: float = 0.05, mask_prob: float = 0.8, mask_selection: str = 'static', mask_other: float = 0.0, mask_length: int = 10, no_mask_overlap: bool = False, mask_min_space: int = 1, mask_channel_prob: float = 0.0, mask_channel_selection: str = 'static', mask_channel_other: float = 0.0, mask_channel_length: int = 10, no_mask_channel_overlap: bool = False, mask_channel_min_space: int = 1, skip_masked: bool = False, skip_nomask: bool = False, num_classes: int = 100, final_dim: int = 256, feature_grad_mult: float | None = 0.1, finetuning: bool = False, freeze_encoder_updates: int = 0)
Bases: AbsEncoder
Torch Audio Hubert encoder module.
- Parameters:
- extractor_mode – Operation mode of feature extractor. Valid values are “group_norm” or “layer_norm”.
- extractor_conv_layer_config – Configuration of convolution layers in feature extractor. List of convolution configuration, i.e. [[output_channel, kernel_size, stride], …]
- extractor_conv_bias – Whether to include bias term to each convolution operation.
- encoder_embed_dim – The dimension of embedding in encoder.
- encoder_projection_dropout – The dropout probability applied after the input feature is projected to “encoder_embed_dim”.
- encoder_pos_conv_kernel – Kernel size of convolutional positional embeddings.
- encoder_pos_conv_groups – Number of groups of convolutional positional embeddings.
- encoder_num_layers – Number of self attention layers in transformer block.
- encoder_num_heads – Number of heads in self attention layers.
- encoder_attention_dropout – Dropout probability applied after softmax in self-attention layer.
- encoder_ff_interm_features – Dimension of hidden features in feed forward layer.
- encoder_ff_interm_dropout – Dropout probability applied in feedforward layer.
- encoder_dropout – Dropout probability applied at the end of feed forward layer.
- encoder_layer_norm_first – Control the order of layer norm in transformer layer and each encoder layer. If True, in transformer layer, layer norm is applied before features are fed to encoder layers.
- encoder_layer_drop – Probability to drop each encoder layer during training.
- mask_prob – Probability for each token to be chosen as start of the span to be masked.
- mask_selection – How to choose the mask length. Options: [static, uniform, normal, poisson].
- mask_other – Secondary mask argument (used for more complex distributions).
- mask_length – The lengths of the mask.
- no_mask_overlap – Whether to allow masks to overlap.
- mask_min_space – Minimum space between spans (if no overlap is enabled).
- mask_channel_prob – (float): The probability of replacing a feature with 0.
- mask_channel_selection – How to choose the mask length for channel masking. Options: [static, uniform, normal, poisson].
- mask_channel_other – Secondary mask argument for channel masking(used for more complex distributions).
- mask_channel_length – Minimum space between spans (if no overlap is enabled) for channel masking.
- no_mask_channel_overlap – Whether to allow channel masks to overlap.
- mask_channel_min_space – Minimum space between spans for channel masking(if no overlap is enabled).
- skip_masked – If True, skip computing losses over masked frames.
- skip_nomask – If True, skip computing losses over unmasked frames.
- num_classes – The number of classes in the labels.
- final_dim – Project final representations and targets to final_dim.
- feature_grad_mult – The factor to scale the convolutional feature extraction layer gradients by. The scale factor will not affect the forward pass.
- finetuning – Whether to finetuning the model with ASR or other tasks.
- freeze_encoder_updates – The number of steps to freeze the encoder parameters in ASR finetuning.
Hubert specific Args: : Please refer to: https://pytorch.org/audio/stable/generated/torchaudio.models.hubert_pretrain_model.html#torchaudio.models.hubert_pretrain_model
Initializes internal Module state, shared by both nn.Module and ScriptModule.
forward(xs_pad: Tensor, ilens: Tensor, ys_pad: Tensor | None = None, ys_pad_length: Tensor | None = None, prev_states: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor | None]
Forward Hubert Pretrain Encoder.
- Parameters:
- xs_pad – input tensor (B, L, D)
- ilens – input length (B)
- prev_states – Not to be used now.
- Returns: position embedded tensor and mask
output_size() → int
reload_pretrained_parameters()