espnet2.svs.singing_tacotron.encoder.Encoder

Less than 1 minute

espnet2.svs.singing_tacotron.encoder.Encoder

class espnet2.svs.singing_tacotron.encoder.Encoder(idim, input_layer='embed', embed_dim=512, elayers=1, eunits=512, econv_layers=3, econv_chans=512, econv_filts=5, use_batch_norm=True, use_residual=False, dropout_rate=0.5, padding_idx=0)

Bases: Module

Encoder module of Spectrogram prediction network.

This is a module of encoder of Spectrogram prediction network in Singing Tacotron, which described in

`Singing-Tacotron: Global Duration Control Attention and Dynamic
Filter for End-to-end Singing Voice Synthesis`_

. This is the encoder which converts either a sequence of characters or acoustic features into the sequence of hidden states.

Filter for End-to-end Singing Voice Synthesis`: : https://arxiv.org/abs/2202.07907

Initialize Singing Tacotron encoder module.

Parameters:
- idim (int)
- input_layer (str) – Input layer type.
- embed_dim (int , optional)
- elayers (int , optional)
- eunits (int , optional)
- econv_layers (int , optional)
- econv_filts (int , optional)
- econv_chans (int , optional)
- use_batch_norm (bool , optional)
- use_residual (bool , optional)
- dropout_rate (float , optional)

forward(xs, ilens=None)

Calculate forward propagation.

Parameters:
- xs (Tensor) – Batch of the padded sequence. Either character ids (B, Tmax) or acoustic feature (B, Tmax, idim * encoder_reduction_factor). Padded value should be 0.
- ilens (LongTensor) – Batch of lengths of each input batch (B,).
Returns: Batch of the sequences of encoder states(B, Tmax, eunits). LongTensor: Batch of lengths of each sequence (B,)
Return type: Tensor

inference(x, ilens)

Inference.

Parameters:x (Tensor) – The sequeunce of character ids (T,) or acoustic feature (T, idim * encoder_reduction_factor).
Returns: The sequences of encoder states(T, eunits).
Return type: Tensor