espnet2.legacy.nets.pytorch_backend.tacotron2.decoder.Decoder
espnet2.legacy.nets.pytorch_backend.tacotron2.decoder.Decoder
class espnet2.legacy.nets.pytorch_backend.tacotron2.decoder.Decoder(idim, odim, att, dlayers=2, dunits=1024, prenet_layers=2, prenet_units=256, postnet_layers=5, postnet_chans=512, postnet_filts=5, output_activation_fn=None, cumulate_att_w=True, use_batch_norm=True, use_concate=True, dropout_rate=0.5, zoneout_rate=0.1, reduction_factor=1)
Bases: Module
Decoder module of Spectrogram prediction network.
This is a module of decoder of Spectrogram prediction network in Tacotron2, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. The decoder generates the sequence of features from the sequence of the hidden states.
Initialize Tacotron2 decoder module.
- Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- att (torch.nn.Module) – Instance of attention class.
- dlayers (int , optional) – The number of decoder lstm layers.
- dunits (int , optional) – The number of decoder lstm units.
- prenet_layers (int , optional) – The number of prenet layers.
- prenet_units (int , optional) – The number of prenet units.
- postnet_layers (int , optional) – The number of postnet layers.
- postnet_filts (int , optional) – The number of postnet filter size.
- postnet_chans (int , optional) – The number of postnet filter channels.
- output_activation_fn (torch.nn.Module , optional) – Activation function for outputs.
- cumulate_att_w (bool , optional) – Whether to cumulate previous attention weight.
- use_batch_norm (bool , optional) – Whether to use batch normalization.
- use_concate (bool , optional) – Whether to concatenate encoder embedding with decoder lstm outputs.
- dropout_rate (float , optional) – Dropout rate.
- zoneout_rate (float , optional) – Zoneout rate.
- reduction_factor (int , optional) – Reduction factor.
calculate_all_attentions(hs, hlens, ys)
Calculate all of the attention weights.
- Parameters:
- hs (Tensor) – Batch of the sequences of padded hidden states (B, Tmax, idim).
- hlens (LongTensor) – Batch of lengths of each input batch (B,).
- ys (Tensor) – Batch of the sequences of padded target features (B, Lmax, odim).
- Returns: Batch of attention weights (B, Lmax, Tmax).
- Return type: numpy.ndarray
####### NOTE This computation is performed in teacher-forcing manner.
forward(hs, hlens, ys)
Calculate forward propagation.
- Parameters:
- hs (Tensor) – Batch of the sequences of padded hidden states (B, Tmax, idim).
- hlens (LongTensor) – Batch of lengths of each input batch (B,).
- ys (Tensor) – Batch of the sequences of padded target features (B, Lmax, odim).
- Returns: Batch of output tensors after postnet (B, Lmax, odim). Tensor: Batch of output tensors before postnet (B, Lmax, odim). Tensor: Batch of logits of stop prediction (B, Lmax). Tensor: Batch of attention weights (B, Lmax, Tmax).
- Return type: Tensor
####### NOTE This computation is performed in teacher-forcing manner.
inference(h, threshold=0.5, minlenratio=0.0, maxlenratio=10.0, use_att_constraint=False, backward_window=None, forward_window=None)
Generate the sequence of features given the sequences of characters.
- Parameters:
- h (Tensor) – Input sequence of encoder hidden states (T, C).
- threshold (float , optional) – Threshold to stop generation.
- minlenratio (float , optional) – Minimum length ratio. If set to 1.0 and the length of input is 10, the minimum length of outputs will be 10 * 1 = 10.
- minlenratio – Minimum length ratio. If set to 10 and the length of input is 10, the maximum length of outputs will be 10 * 10 = 100.
- use_att_constraint (bool) – Whether to apply attention constraint introduced in Deep Voice 3.
- backward_window (int) – Backward window size in attention constraint.
- forward_window (int) – Forward window size in attention constraint.
- Returns: Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
- Return type: Tensor
####### NOTE This computation is performed in auto-regressive manner.
