espnet.nets.pytorch_backend.e2e_vc_tacotron2.Tacotron2

About 3 min

espnet.nets.pytorch_backend.e2e_vc_tacotron2.Tacotron2

class espnet.nets.pytorch_backend.e2e_vc_tacotron2.Tacotron2(idim, odim, args=None)

Bases: TTSInterface, Module

VC Tacotron2 module for VC.

This is a module of Tacotron2-based VC model, which convert the sequence of acoustic features into the sequence of acoustic features.

Initialize Tacotron2 module.

Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- args (Namespace , optional) –
  - spk_embed_dim (int): Dimension of the speaker embedding.
  - elayers (int): The number of encoder blstm layers.
  - eunits (int): The number of encoder blstm units.
  - econv_layers (int): The number of encoder conv layers.
  - econv_filts (int): The number of encoder conv filter size.
  - econv_chans (int): The number of encoder conv filter channels.
  - dlayers (int): The number of decoder lstm layers.
  - dunits (int): The number of decoder lstm units.
  - prenet_layers (int): The number of prenet layers.
  - prenet_units (int): The number of prenet units.
  - postnet_layers (int): The number of postnet layers.
  - postnet_filts (int): The number of postnet filter size.
  - postnet_chans (int): The number of postnet filter channels.
  - output_activation (int): The name of activation function for outputs.
  - adim (int): The number of dimension of mlp in attention.
  - aconv_chans (int): The number of attention conv filter channels.
  - aconv_filts (int): The number of attention conv filter size.
  - cumulate_att_w (bool): Whether to cumulate previous attention weight.
  - use_batch_norm (bool): Whether to use batch normalization.
  - use_concate (int): : Whether to concatenate encoder embedding with decoder lstm outputs.
  - dropout_rate (float): Dropout rate.
  - zoneout_rate (float): Zoneout rate.
  - reduction_factor (int): Reduction factor.
  - spk_embed_dim (int): Number of speaker embedding dimenstions.
  - spc_dim (int): Number of spectrogram embedding dimenstions : (only for use_cbhg=True).
  - use_cbhg (bool): Whether to use CBHG module.
  - cbhg_conv_bank_layers (int): : The number of convoluional banks in CBHG.
  - cbhg_conv_bank_chans (int): : The number of channels of convolutional bank in CBHG.
  - cbhg_proj_filts (int): : The number of filter size of projection layeri in CBHG.
  - cbhg_proj_chans (int): : The number of channels of projection layer in CBHG.
  - cbhg_highway_layers (int): : The number of layers of highway network in CBHG.
  - cbhg_highway_units (int): : The number of units of highway network in CBHG.
  - cbhg_gru_units (int): The number of units of GRU in CBHG.
  - use_masking (bool): Whether to mask padded part in loss calculation.
  - bce_pos_weight (float): Weight of positive sample of stop token : (only for use_masking=True).
  - use-guided-attn-loss (bool): Whether to use guided attention loss.
  - guided-attn-loss-sigma (float) Sigma in guided attention loss.
  - guided-attn-loss-lamdba (float): Lambda in guided attention loss.

static add_arguments(parser)

Add model-specific arguments to the parser.

property base_plot_keys

Return base key names to plot during training.

keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss

and validation/main/loss values.

also loss.png will be created as a figure visulizing main/loss : and validation/main/loss values.

Returns: List of strings which are base keys to plot during training.
Return type: list

calculate_all_attentions(xs, ilens, ys, spembs=None, *args, **kwargs)

Calculate all of the attention weights.

Parameters:
- xs (Tensor) – Batch of padded acoustic features (B, Tmax, idim).
- ilens (LongTensor) – Batch of lengths of each input batch (B,).
- ys (Tensor) – Batch of padded target features (B, Lmax, odim).
- olens (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Tensor , optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
Returns: Batch of attention weights (B, Lmax, Tmax).
Return type: numpy.ndarray

forward(xs, ilens, ys, labels, olens, spembs=None, spcs=None, *args, **kwargs)

Calculate forward propagation.

Parameters:
- xs (Tensor) – Batch of padded acoustic features (B, Tmax, idim).
- ilens (LongTensor) – Batch of lengths of each input batch (B,).
- ys (Tensor) – Batch of padded target features (B, Lmax, odim).
- olens (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Tensor , optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- spcs (Tensor , optional) – Batch of groundtruth spectrograms (B, Lmax, spc_dim).
Returns: Loss value.
Return type: Tensor

inference(x, inference_args, spemb=None, *args, **kwargs)

Generate the sequence of features given the sequences of characters.

Parameters:
- x (Tensor) – Input sequence of acoustic features (T, idim).
- inference_args (Namespace) –
  - threshold (float): Threshold in inference.
  - minlenratio (float): Minimum length ratio in inference.
  - maxlenratio (float): Maximum length ratio in inference.
- spemb (Tensor , optional) – Speaker embedding vector (spk_embed_dim).
Returns: Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
Return type: Tensor