espnet2.svs.singing_tacotron.decoder.Decoder
espnet2.svs.singing_tacotron.decoder.Decoder
class espnet2.svs.singing_tacotron.decoder.Decoder(idim, odim, att, dlayers=2, dunits=1024, prenet_layers=2, prenet_units=256, postnet_layers=5, postnet_chans=512, postnet_filts=5, output_activation_fn=None, cumulate_att_w=True, use_batch_norm=True, use_concate=True, dropout_rate=0.5, zoneout_rate=0.1, reduction_factor=1)
Bases: Module
Decoder module of Spectrogram prediction network.
This is a module of decoder of Spectrogram prediction network in Singing Tacotron, which described in
`https://arxiv.org/pdf/2202.07907v1.pdf`_
. The decoder generates the sequence of features from the sequence of the hidden states.
Filter for End-to-end Singing Voice Synthesis`: : https://arxiv.org/pdf/2202.07907v1.pdf
Initialize Singing Tacotron decoder module.
- Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- att (torch.nn.Module) – Instance of attention class.
- dlayers (int , optional) – The number of decoder lstm layers.
- dunits (int , optional) – The number of decoder lstm units.
- prenet_layers (int , optional) – The number of prenet layers.
- prenet_units (int , optional) – The number of prenet units.
- postnet_layers (int , optional) – The number of postnet layers.
- postnet_filts (int , optional) – The number of postnet filter size.
- postnet_chans (int , optional) – The number of postnet filter channels.
- output_activation_fn (torch.nn.Module , optional) – Activation function for outputs.
- cumulate_att_w (bool , optional) – Whether to cumulate previous attention weight.
- use_batch_norm (bool , optional) – Whether to use batch normalization.
- use_concate (bool , optional) – Whether to concatenate encoder embedding with decoder lstm outputs.
- dropout_rate (float , optional) – Dropout rate.
- zoneout_rate (float , optional) – Zoneout rate.
- reduction_factor (int , optional) – Reduction factor.
forward(hs, hlens, trans_token, ys)
Calculate forward propagation.
- Parameters:
- hs (Tensor) – Batch of the sequences of padded hidden states (B, Tmax, idim).
- hlens (LongTensor) – Batch of lengths of each input batch (B,).
- trans_token (Tensor) – Global transition token for duration (B x Tmax x 1)
- ys (Tensor) – Batch of the sequences of padded target features (B, Lmax, odim).
- Returns: Batch of output tensors after postnet (B, Lmax, odim). Tensor: Batch of output tensors before postnet (B, Lmax, odim). Tensor: Batch of logits of stop prediction (B, Lmax). Tensor: Batch of attention weights (B, Lmax, Tmax).
- Return type: Tensor
NOTE
This computation is performed in teacher-forcing manner.
inference(h, trans_token, threshold=0.5, minlenratio=0.0, maxlenratio=30.0, use_att_constraint=False, use_dynamic_filter=True, backward_window=1, forward_window=3)
Generate the sequence of features given the sequences of characters.
- Parameters:
h (Tensor) – Input sequence of encoder hidden states (T, C).
trans_token (Tensor) – Global transition token for duration.
threshold (float , optional) – Threshold to stop generation.
minlenratio (float , optional) – Minimum length ratio. If set to 1.0 and the length of input is 10, the minimum length of outputs will be 10 * 1 = 10.
minlenratio – Minimum length ratio. If set to 10 and the length of input is 10, the maximum length of outputs will be 10 * 10 = 100.
use_att_constraint (bool) – Whether to apply attention constraint introduced in Deep Voice 3.
use_dynamic_filter (bool) – Whether to apply dynamic filter introduced in
`Singing Tacotron`_
.
backward_window (int) – Backward window size in attention constraint.
forward_window (int) – Forward window size in attention constraint.
- Returns: Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
- Return type: Tensor
NOTE
This computation is performed in auto-regressive manner.