espnet.nets.pytorch_backend.e2e_tts_tacotron2.Tacotron2
espnet.nets.pytorch_backend.e2e_tts_tacotron2.Tacotron2
class espnet.nets.pytorch_backend.e2e_tts_tacotron2.Tacotron2(idim, odim, args=None)
Bases: TTSInterface
, Module
Tacotron2 module for end-to-end text-to-speech (E2E-TTS).
This is a module of Spectrogram prediction network in Tacotron2 described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, which converts the sequence of characters into the sequence of Mel-filterbanks.
Initialize Tacotron2 module.
- Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- args (Namespace , optional) –
- spk_embed_dim (int): Dimension of the speaker embedding.
- embed_dim (int): Dimension of character embedding.
- elayers (int): The number of encoder blstm layers.
- eunits (int): The number of encoder blstm units.
- econv_layers (int): The number of encoder conv layers.
- econv_filts (int): The number of encoder conv filter size.
- econv_chans (int): The number of encoder conv filter channels.
- dlayers (int): The number of decoder lstm layers.
- dunits (int): The number of decoder lstm units.
- prenet_layers (int): The number of prenet layers.
- prenet_units (int): The number of prenet units.
- postnet_layers (int): The number of postnet layers.
- postnet_filts (int): The number of postnet filter size.
- postnet_chans (int): The number of postnet filter channels.
- output_activation (int): The name of activation function for outputs.
- adim (int): The number of dimension of mlp in attention.
- aconv_chans (int): The number of attention conv filter channels.
- aconv_filts (int): The number of attention conv filter size.
- cumulate_att_w (bool): Whether to cumulate previous attention weight.
- use_batch_norm (bool): Whether to use batch normalization.
- use_concate (int): Whether to concatenate encoder embedding : with decoder lstm outputs.
- dropout_rate (float): Dropout rate.
- zoneout_rate (float): Zoneout rate.
- reduction_factor (int): Reduction factor.
- spk_embed_dim (int): Number of speaker embedding dimenstions.
- spc_dim (int): Number of spectrogram embedding dimenstions : (only for use_cbhg=True).
- use_cbhg (bool): Whether to use CBHG module.
- cbhg_conv_bank_layers (int): The number of convoluional banks in CBHG.
- cbhg_conv_bank_chans (int): The number of channels of : convolutional bank in CBHG.
- cbhg_proj_filts (int): : The number of filter size of projection layeri in CBHG.
- cbhg_proj_chans (int): : The number of channels of projection layer in CBHG.
- cbhg_highway_layers (int): : The number of layers of highway network in CBHG.
- cbhg_highway_units (int): : The number of units of highway network in CBHG.
- cbhg_gru_units (int): The number of units of GRU in CBHG.
- use_masking (bool): : Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool): : Whether to apply weighted masking in loss calculation.
- bce_pos_weight (float): : Weight of positive sample of stop token (only for use_masking=True).
- use-guided-attn-loss (bool): Whether to use guided attention loss.
- guided-attn-loss-sigma (float) Sigma in guided attention loss.
- guided-attn-loss-lamdba (float): Lambda in guided attention loss.
static add_arguments(parser)
Add model-specific arguments to the parser.
property base_plot_keys
Return base key names to plot during training.
keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.
- Returns: List of strings which are base keys to plot during training.
- Return type: list
calculate_all_attentions(xs, ilens, ys, spembs=None, keep_tensor=False, *args, **kwargs)
Calculate all of the attention weights.
- Parameters:
- xs (Tensor) – Batch of padded character ids (B, Tmax).
- ilens (LongTensor) – Batch of lengths of each input batch (B,).
- ys (Tensor) – Batch of padded target features (B, Lmax, odim).
- olens (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Tensor , optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- keep_tensor (bool , optional) – Whether to keep original tensor.
- Returns: Batch of attention weights (B, Lmax, Tmax).
- Return type: Union[ndarray, Tensor]
forward(xs, ilens, ys, labels, olens, spembs=None, extras=None, *args, **kwargs)
Calculate forward propagation.
- Parameters:
- xs (Tensor) – Batch of padded character ids (B, Tmax).
- ilens (LongTensor) – Batch of lengths of each input batch (B,).
- ys (Tensor) – Batch of padded target features (B, Lmax, odim).
- olens (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Tensor , optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- extras (Tensor , optional) – Batch of groundtruth spectrograms (B, Lmax, spc_dim).
- Returns: Loss value.
- Return type: Tensor
inference(x, inference_args, spemb=None, *args, **kwargs)
Generate the sequence of features given the sequences of characters.
- Parameters:
- x (Tensor) – Input sequence of characters (T,).
- inference_args (Namespace) –
- threshold (float): Threshold in inference.
- minlenratio (float): Minimum length ratio in inference.
- maxlenratio (float): Maximum length ratio in inference.
- spemb (Tensor , optional) – Speaker embedding vector (spk_embed_dim).
- Returns: Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
- Return type: Tensor