espnet2.tts.transformer.transformer.Transformer
espnet2.tts.transformer.transformer.Transformer
class espnet2.tts.transformer.transformer.Transformer(idim: int, odim: int, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, dprenet_layers: int = 2, dprenet_units: int = 256, elayers: int = 6, eunits: int = 1024, adim: int = 512, aheads: int = 4, dlayers: int = 6, dunits: int = 1024, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, reduction_factor: int = 1, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'add', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, transformer_enc_dec_attn_dropout_rate: float = 0.1, eprenet_dropout_rate: float = 0.5, dprenet_dropout_rate: float = 0.5, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False, bce_pos_weight: float = 5.0, loss_type: str = 'L1', use_guided_attn_loss: bool = True, num_heads_applied_guided_attn: int = 2, num_layers_applied_guided_attn: int = 2, modules_applied_guided_attn: Sequence[str] = 'encoder-decoder', guided_attn_loss_sigma: float = 0.4, guided_attn_loss_lambda: float = 1.0)
Bases: AbsTTS
Transformer-TTS module.
This is a module of text-to-speech Transformer described in Neural Speech Synthesis with Transformer Network, which convert the sequence of tokens into the sequence of Mel-filterbanks.
Initialize Transformer module.
- Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- embed_dim (int) – Dimension of character embedding.
- eprenet_conv_layers (int) – Number of encoder prenet convolution layers.
- eprenet_conv_chans (int) – Number of encoder prenet convolution channels.
- eprenet_conv_filts (int) – Filter size of encoder prenet convolution.
- dprenet_layers (int) – Number of decoder prenet layers.
- dprenet_units (int) – Number of decoder prenet hidden units.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- adim (int) – Number of attention transformation dimensions.
- aheads (int) – Number of heads for multi head attention.
- dlayers (int) – Number of decoder layers.
- dunits (int) – Number of decoder hidden units.
- postnet_layers (int) – Number of postnet layers.
- postnet_chans (int) – Number of postnet channels.
- postnet_filts (int) – Filter size of postnet.
- use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.
- use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool) – Whether to apply layernorm layer before encoder block.
- decoder_normalize_before (bool) – Whether to apply layernorm layer before decoder block.
- encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.
- decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.
- positionwise_layer_type (str) – Position-wise operation type.
- positionwise_conv_kernel_size (int) – Kernel size in position wise conv 1d.
- reduction_factor (int) – Reduction factor.
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- use_gst (str) – Whether to use global style token.
- gst_tokens (int) – Number of GST embeddings.
- gst_heads (int) – Number of heads in GST multihead attention.
- gst_conv_layers (int) – Number of conv layers in GST.
- gst_conv_chans_list – (Sequence[int]): List of the number of channels of conv layers in GST.
- gst_conv_kernel_size (int) – Kernel size of conv layers in GST.
- gst_conv_stride (int) – Stride size of conv layers in GST.
- gst_gru_layers (int) – Number of GRU layers in GST.
- gst_gru_units (int) – Number of GRU units in GST.
- transformer_lr (float) – Initial value of learning rate.
- transformer_warmup_steps (int) – Optimizer warmup steps.
- transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention and positional encoding.
- transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention & positional encoding.
- transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.
- transformer_enc_dec_attn_dropout_rate (float) – Dropout rate in source attention module.
- init_type (str) – How to initialize transformer parameters.
- init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.
- init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.
- eprenet_dropout_rate (float) – Dropout rate in encoder prenet.
- dprenet_dropout_rate (float) – Dropout rate in decoder prenet.
- postnet_dropout_rate (float) – Dropout rate in postnet.
- use_masking (bool) – Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- bce_pos_weight (float) – Positive sample weight in bce calculation (only for use_masking=true).
- loss_type (str) – How to calculate loss.
- use_guided_attn_loss (bool) – Whether to use guided attention loss.
- num_heads_applied_guided_attn (int) – Number of heads in each layer to apply guided attention loss.
- num_layers_applied_guided_attn (int) – Number of layers to apply guided attention loss.
- modules_applied_guided_attn (Sequence *[*str ]) – List of module names to apply guided attention loss.
- guided_attn_loss_sigma (float)
- guided_attn_loss_lambda (float) – Lambda in guided attention loss.
forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, joint_training: bool = False) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Calculate forward propagation.
- Parameters:
- text (LongTensor) – Batch of padded character ids (B, Tmax).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, Lmax, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- joint_training (bool) – Whether to perform joint training with vocoder.
- Returns: Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.
- Return type: Tensor
inference(text: Tensor, feats: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False) → Dict[str, Tensor]
Generate the sequence of features given the sequences of characters.
- Parameters:
- text (LongTensor) – Input sequence of characters (T_text,).
- feats (Optional *[*Tensor ]) – Feature sequence to extract style embedding (T_feats’, idim).
- spembs (Optional *[*Tensor ]) – Speaker embedding (spk_embed_dim,).
- sids (Optional *[*Tensor ]) – Speaker ID (1,).
- lids (Optional *[*Tensor ]) – Language ID (1,).
- threshold (float) – Threshold in inference.
- minlenratio (float) – Minimum length ratio in inference.
- maxlenratio (float) – Maximum length ratio in inference.
- use_teacher_forcing (bool) – Whether to use teacher forcing.
- Returns: Output dict including the following items: : * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- prob (Tensor): Output sequence of stop probabilities (T_feats,).
- att_w (Tensor): Source attn weight (#layers, #heads, T_feats, T_text).
- Return type: Dict[str, Tensor]