espnet2.gan_svs.vits.phoneme_predictor.PhonemePredictor
espnet2.gan_svs.vits.phoneme_predictor.PhonemePredictor
class espnet2.gan_svs.vits.phoneme_predictor.PhonemePredictor(vocabs: int, hidden_channels: int = 192, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 2, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 3, positional_encoding_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0)
Bases: Module
Phoneme Predictor module in VISinger.
Initialize PhonemePredictor module.
- Parameters:
- vocabs (int) – The number of vocabulary.
- hidden_channels (int) – The number of hidden channels.
- attention_dim (int) – The number of attention dimension.
- attention_heads (int) – The number of attention heads.
- linear_units (int) – The number of linear units.
- blocks (int) – The number of encoder blocks.
- positionwise_layer_type (str) – The type of position-wise layer.
- positionwise_conv_kernel_size (int) – The size of position-wise convolution kernel.
- positional_encoding_layer_type (str) – The type of positional encoding layer.
- self_attention_layer_type (str) – The type of self-attention layer.
- activation_type (str) – The type of activation function.
- normalize_before (bool) – Whether to apply normalization before the position-wise layer or not.
- use_macaron_style (bool) – Whether to use macaron style or not.
- use_conformer_conv (bool) – Whether to use Conformer convolution or not.
- conformer_kernel_size (int) – The size of Conformer kernel.
- dropout_rate (float) – The dropout rate.
- positional_dropout_rate (float) – The dropout rate for positional encoding.
- attention_dropout_rate (float) – The dropout rate for attention.
forward(x, x_mask)
Perform forward propagation.
- Parameters:
- x (Tensor) – The input tensor of shape (B, dim, length).
- x_mask (Tensor) – The mask tensor for the input tensor of shape (B, length).
- Returns: The predicted phoneme tensor of shape (length, B, vocab_size).
- Return type: Tensor