espnet2.svs.singing_tacotron.decoder.Decoder

About 2 min

espnet2.svs.singing_tacotron.decoder.Decoder

class espnet2.svs.singing_tacotron.decoder.Decoder(idim, odim, att, dlayers=2, dunits=1024, prenet_layers=2, prenet_units=256, postnet_layers=5, postnet_chans=512, postnet_filts=5, output_activation_fn=None, cumulate_att_w=True, use_batch_norm=True, use_concate=True, dropout_rate=0.5, zoneout_rate=0.1, reduction_factor=1)

Bases: Module

Decoder module of Spectrogram prediction network.

This is a module of decoder of Spectrogram prediction network in Singing Tacotron, which described in

`https://arxiv.org/pdf/2202.07907v1.pdf`_

. The decoder generates the sequence of features from the sequence of the hidden states.

Filter for End-to-end Singing Voice Synthesis`: : https://arxiv.org/pdf/2202.07907v1.pdf

Initialize Singing Tacotron decoder module.

Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- att (torch.nn.Module) – Instance of attention class.
- dlayers (int , optional) – The number of decoder lstm layers.
- dunits (int , optional) – The number of decoder lstm units.
- prenet_layers (int , optional) – The number of prenet layers.
- prenet_units (int , optional) – The number of prenet units.
- postnet_layers (int , optional) – The number of postnet layers.
- postnet_filts (int , optional) – The number of postnet filter size.
- postnet_chans (int , optional) – The number of postnet filter channels.
- output_activation_fn (torch.nn.Module , optional) – Activation function for outputs.
- cumulate_att_w (bool , optional) – Whether to cumulate previous attention weight.
- use_batch_norm (bool , optional) – Whether to use batch normalization.
- use_concate (bool , optional) – Whether to concatenate encoder embedding with decoder lstm outputs.
- dropout_rate (float , optional) – Dropout rate.
- zoneout_rate (float , optional) – Zoneout rate.
- reduction_factor (int , optional) – Reduction factor.

forward(hs, hlens, trans_token, ys)

Calculate forward propagation.

Parameters:
- hs (Tensor) – Batch of the sequences of padded hidden states (B, Tmax, idim).
- hlens (LongTensor) – Batch of lengths of each input batch (B,).
- trans_token (Tensor) – Global transition token for duration (B x Tmax x 1)
- ys (Tensor) – Batch of the sequences of padded target features (B, Lmax, odim).
Returns: Batch of output tensors after postnet (B, Lmax, odim). Tensor: Batch of output tensors before postnet (B, Lmax, odim). Tensor: Batch of logits of stop prediction (B, Lmax). Tensor: Batch of attention weights (B, Lmax, Tmax).
Return type: Tensor

NOTE

This computation is performed in teacher-forcing manner.

inference(h, trans_token, threshold=0.5, minlenratio=0.0, maxlenratio=30.0, use_att_constraint=False, use_dynamic_filter=True, backward_window=1, forward_window=3)

Generate the sequence of features given the sequences of characters.

Parameters:
- h (Tensor) – Input sequence of encoder hidden states (T, C).
- trans_token (Tensor) – Global transition token for duration.
- threshold (float , optional) – Threshold to stop generation.
- minlenratio (float , optional) – Minimum length ratio. If set to 1.0 and the length of input is 10, the minimum length of outputs will be 10 * 1 = 10.
- minlenratio – Minimum length ratio. If set to 10 and the length of input is 10, the maximum length of outputs will be 10 * 10 = 100.
- use_att_constraint (bool) – Whether to apply attention constraint introduced in Deep Voice 3.
- use_dynamic_filter (bool) – Whether to apply dynamic filter introduced in
```
`Singing Tacotron`_
```
  .
- backward_window (int) – Backward window size in attention constraint.
- forward_window (int) – Forward window size in attention constraint.
Returns: Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
Return type: Tensor

NOTE

This computation is performed in auto-regressive manner.