espnet.nets.pytorch_backend.e2e_tts_transformer.Transformer
espnet.nets.pytorch_backend.e2e_tts_transformer.Transformer
class espnet.nets.pytorch_backend.e2e_tts_transformer.Transformer(idim, odim, args=None)
Bases: TTSInterface
, Module
Text-to-Speech Transformer module.
This is a module of text-to-speech Transformer described in Neural Speech Synthesis with Transformer Network, which convert the sequence of characters or phonemes into the sequence of Mel-filterbanks.
Initialize TTS-Transformer module.
- Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- args (Namespace , optional) –
- embed_dim (int): Dimension of character embedding.
- eprenet_conv_layers (int): : Number of encoder prenet convolution layers.
- eprenet_conv_chans (int): : Number of encoder prenet convolution channels.
- eprenet_conv_filts (int): Filter size of encoder prenet convolution.
- dprenet_layers (int): Number of decoder prenet layers.
- dprenet_units (int): Number of decoder prenet hidden units.
- elayers (int): Number of encoder layers.
- eunits (int): Number of encoder hidden units.
- adim (int): Number of attention transformation dimensions.
- aheads (int): Number of heads for multi head attention.
- dlayers (int): Number of decoder layers.
- dunits (int): Number of decoder hidden units.
- postnet_layers (int): Number of postnet layers.
- postnet_chans (int): Number of postnet channels.
- postnet_filts (int): Filter size of postnet.
- use_scaled_pos_enc (bool): : Whether to use trainable scaled positional encoding.
- use_batch_norm (bool): : Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool): : Whether to perform layer normalization before encoder block.
- decoder_normalize_before (bool): : Whether to perform layer normalization before decoder block.
- encoder_concat_after (bool): Whether to concatenate attention : layer’s input and output in encoder.
- decoder_concat_after (bool): Whether to concatenate attention : layer’s input and output in decoder.
- reduction_factor (int): Reduction factor.
- spk_embed_dim (int): Number of speaker embedding dimenstions.
- spk_embed_integration_type: How to integrate speaker embedding.
- transformer_init (float): How to initialize transformer parameters.
- transformer_lr (float): Initial value of learning rate.
- transformer_warmup_steps (int): Optimizer warmup steps.
- transformer_enc_dropout_rate (float): : Dropout rate in encoder except attention & positional encoding.
- transformer_enc_positional_dropout_rate (float): : Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float): : Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float): : Dropout rate in decoder except attention & positional encoding.
- transformer_dec_positional_dropout_rate (float): : Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float): : Dropout rate in deocoder self-attention module.
- transformer_enc_dec_attn_dropout_rate (float): : Dropout rate in encoder-deocoder attention module.
- eprenet_dropout_rate (float): Dropout rate in encoder prenet.
- dprenet_dropout_rate (float): Dropout rate in decoder prenet.
- postnet_dropout_rate (float): Dropout rate in postnet.
- use_masking (bool): : Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool): : Whether to apply weighted masking in loss calculation.
- bce_pos_weight (float): Positive sample weight in bce calculation : (only for use_masking=true).
- loss_type (str): How to calculate loss.
- use_guided_attn_loss (bool): Whether to use guided attention loss.
- num_heads_applied_guided_attn (int): : Number of heads in each layer to apply guided attention loss.
- num_layers_applied_guided_attn (int): : Number of layers to apply guided attention loss.
- modules_applied_guided_attn (list): : List of module names to apply guided attention loss.
- guided-attn-loss-sigma (float) Sigma in guided attention loss.
- guided-attn-loss-lambda (float): Lambda in guided attention loss.
static add_arguments(parser)
Add model-specific arguments to the parser.
property attention_plot_class
Return plot class for attention weight plot.
property base_plot_keys
Return base key names to plot during training.
keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.
- Returns: List of strings which are base keys to plot during training.
- Return type: list
calculate_all_attentions(xs, ilens, ys, olens, spembs=None, skip_output=False, keep_tensor=False, *args, **kwargs)
Calculate all of the attention weights.
- Parameters:
- xs (Tensor) – Batch of padded character ids (B, Tmax).
- ilens (LongTensor) – Batch of lengths of each input batch (B,).
- ys (Tensor) – Batch of padded target features (B, Lmax, odim).
- olens (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Tensor , optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- skip_output (bool , optional) – Whether to skip calculate the final output.
- keep_tensor (bool , optional) – Whether to keep original tensor.
- Returns: Dict of attention weights and outputs.
- Return type: dict
forward(xs, ilens, ys, labels, olens, spembs=None, *args, **kwargs)
Calculate forward propagation.
- Parameters:
- xs (Tensor) – Batch of padded character ids (B, Tmax).
- ilens (LongTensor) – Batch of lengths of each input batch (B,).
- ys (Tensor) – Batch of padded target features (B, Lmax, odim).
- olens (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Tensor , optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- Returns: Loss value.
- Return type: Tensor
inference(x, inference_args, spemb=None, *args, **kwargs)
Generate the sequence of features given the sequences of characters.
- Parameters:
- x (Tensor) – Input sequence of characters (T,).
- inference_args (Namespace) –
- threshold (float): Threshold in inference.
- minlenratio (float): Minimum length ratio in inference.
- maxlenratio (float): Maximum length ratio in inference.
- spemb (Tensor , optional) – Speaker embedding vector (spk_embed_dim).
- Returns: Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Encoder-decoder (source) attention weights (#layers, #heads, L, T).
- Return type: Tensor