espnet2.gan_svs.vits.text_encoder.TextEncoder

About 1 min

espnet2.gan_svs.vits.text_encoder.TextEncoder

class espnet2.gan_svs.vits.text_encoder.TextEncoder(vocabs: int, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 6, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 3, positional_encoding_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, use_slur=True)

Bases: Module

Text encoder module in VISinger.

This is a module of text encoder described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Instead of the relative positional Transformer, we use conformer architecture as the encoder module, which contains additional convolution layers.

Initialize TextEncoder module.

Parameters:
- vocabs (int) – Vocabulary size.
- attention_dim (int) – Attention dimension.
- attention_heads (int) – Number of attention heads.
- linear_units (int) – Number of linear units of positionwise layers.
- blocks (int) – Number of encoder blocks.
- positionwise_layer_type (str) – Positionwise layer type.
- positionwise_conv_kernel_size (int) – Positionwise layer’s kernel size.
- positional_encoding_layer_type (str) – Positional encoding layer type.
- self_attention_layer_type (str) – Self-attention layer type.
- activation_type (str) – Activation function type.
- normalize_before (bool) – Whether to apply LayerNorm before attention.
- use_macaron_style (bool) – Whether to use macaron style components.
- use_conformer_conv (bool) – Whether to use conformer conv layers.
- conformer_kernel_size (int) – Conformer’s conv kernel size.
- dropout_rate (float) – Dropout rate.
- positional_dropout_rate (float) – Dropout rate for positional encoding.
- attention_dropout_rate (float) – Dropout rate for attention.
- use_slur (bool) – Whether to use slur embedding.

forward(phone: Tensor, phone_lengths: Tensor, midi_id: Tensor, dur: Tensor, slur: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor, Tensor]

Calculate forward propagation.

Parameters:
- phone (Tensor) – Input index tensor (B, T_text).
- phone_lengths (Tensor) – Length tensor (B,).
- midi_id (Tensor) – Input midi tensor (B, T_text).
- dur (Tensor) – Input duration tensor (B, T_text).
Returns: Encoded hidden representation (B, attention_dim, T_text). Tensor: Mask tensor for padded part (B, 1, T_text). Tensor: Encoded hidden representation for duration
(B, attention_dim, T_text).
Tensor: Encoded hidden representation for pitch : (B, attention_dim, T_text).
Return type: Tensor