espnet2.svs package

espnet2.svs.abs_svs

Singing-voice-synthesis abstrast class.

class espnet2.svs.abs_svs.AbsSVS(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

SVS abstract class.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate outputs and return the loss tensor.

abstract inference(text: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]

Return predicted output as a dict.

property require_raw_singing

Return whether or not raw_singing is required.

property require_vocoder

Return whether or not vocoder is required.

espnet2.svs.espnet_model

Singing-voice-synthesis ESPnet model.

class espnet2.svs.espnet_model.ESPnetSVSModel(text_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], score_feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], label_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], pitch_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], ying_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], duration_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], energy_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], pitch_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], energy_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], svs: espnet2.svs.abs_svs.AbsSVS)[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

ESPnet model for singing voice synthesis task.

Initialize ESPnetSVSModel module.

collect_feats(text: torch.Tensor, text_lengths: torch.Tensor, singing: torch.Tensor, singing_lengths: torch.Tensor, label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, duration_phn: Optional[torch.Tensor] = None, duration_phn_lengths: Optional[torch.Tensor] = None, duration_ruled_phn: Optional[torch.Tensor] = None, duration_ruled_phn_lengths: Optional[torch.Tensor] = None, duration_syb: Optional[torch.Tensor] = None, duration_syb_lengths: Optional[torch.Tensor] = None, slur: Optional[torch.Tensor] = None, slur_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, ying: Optional[torch.Tensor] = None, ying_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, **kwargs) → Dict[str, torch.Tensor][source]

Caclualte features and return them as a dict.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • singing (Tensor) – Singing waveform tensor (B, T_wav).

  • singing_lengths (Tensor) – Singing length tensor (B,).

  • label (Option[Tensor]) – Label tensor (B, T_label).

  • label_lengths (Optional[Tensor]) – Label lrngth tensor (B,).

  • phn_cnt (Optional[Tensor]) – Number of phones in each syllable (B, T_syb)

  • midi (Option[Tensor]) – Midi tensor (B, T_label).

  • midi_lengths (Optional[Tensor]) – Midi lrngth tensor (B,).

  • duration* is duration in time_shift ---- (----) –

  • duration_phn (Optional[Tensor]) – duration tensor (B, T_label).

  • duration_phn_lengths (Optional[Tensor]) – duration length tensor (B,).

  • duration_ruled_phn (Optional[Tensor]) – duration tensor (B, T_phone).

  • duration_ruled_phn_lengths (Optional[Tensor]) – duration length tensor (B,).

  • duration_syb (Optional[Tensor]) – duration tensor (B, T_syb).

  • duration_syb_lengths (Optional[Tensor]) – duration length tensor (B,).

  • slur (Optional[Tensor]) – slur tensor (B, T_slur).

  • slur_lengths (Optional[Tensor]) – slur length tensor (B,).

  • pitch (Optional[Tensor]) – Pitch tensor (B, T_wav). - f0 sequence

  • pitch_lengths (Optional[Tensor]) – Pitch length tensor (B,).

  • energy (Optional[Tensor) – Energy tensor.

  • energy_lengths (Optional[Tensor) – Energy length tensor (B,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).

  • sids (Optional[Tensor]) – Speaker ID tensor (B, 1).

  • lids (Optional[Tensor]) – Language ID tensor (B, 1).

Returns:

Dict of features.

Return type:

Dict[str, Tensor]

forward(text: torch.Tensor, text_lengths: torch.Tensor, singing: torch.Tensor, singing_lengths: torch.Tensor, feats: Optional[torch.Tensor] = None, feats_lengths: Optional[torch.Tensor] = None, label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, duration_phn: Optional[torch.Tensor] = None, duration_phn_lengths: Optional[torch.Tensor] = None, duration_ruled_phn: Optional[torch.Tensor] = None, duration_ruled_phn_lengths: Optional[torch.Tensor] = None, duration_syb: Optional[torch.Tensor] = None, duration_syb_lengths: Optional[torch.Tensor] = None, slur: Optional[torch.Tensor] = None, slur_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, ying: Optional[torch.Tensor] = None, ying_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, flag_IsValid=False, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Caclualte outputs and return the loss tensor.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • singing (Tensor) – Singing waveform tensor (B, T_wav).

  • singing_lengths (Tensor) – Singing length tensor (B,).

  • label (Option[Tensor]) – Label tensor (B, T_label).

  • label_lengths (Optional[Tensor]) – Label lrngth tensor (B,).

  • phn_cnt (Optional[Tensor]) – Number of phones in each syllable (B, T_syb)

  • midi (Option[Tensor]) – Midi tensor (B, T_label).

  • midi_lengths (Optional[Tensor]) – Midi lrngth tensor (B,).

  • duration_phn (Optional[Tensor]) – duration tensor (B, T_label).

  • duration_phn_lengths (Optional[Tensor]) – duration length tensor (B,).

  • duration_ruled_phn (Optional[Tensor]) – duration tensor (B, T_phone).

  • duration_ruled_phn_lengths (Optional[Tensor]) – duration length tensor (B,).

  • duration_syb (Optional[Tensor]) – duration tensor (B, T_syllable).

  • duration_syb_lengths (Optional[Tensor]) – duration length tensor (B,).

  • slur (Optional[Tensor]) – slur tensor (B, T_slur).

  • slur_lengths (Optional[Tensor]) – slur length tensor (B,).

  • pitch (Optional[Tensor]) – Pitch tensor (B, T_wav). - f0 sequence

  • pitch_lengths (Optional[Tensor]) – Pitch length tensor (B,).

  • energy (Optional[Tensor]) – Energy tensor.

  • energy_lengths (Optional[Tensor]) – Energy length tensor (B,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).

  • sids (Optional[Tensor]) – Speaker ID tensor (B, 1).

  • lids (Optional[Tensor]) – Language ID tensor (B, 1).

  • kwargs – “utt_id” is among the input.

Returns:

Loss scalar tensor. Dict[str, float]: Statistics to be monitored. Tensor: Weight tensor to summarize losses.

Return type:

Tensor

inference(text: torch.Tensor, singing: Optional[torch.Tensor] = None, label: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, duration_phn: Optional[torch.Tensor] = None, duration_ruled_phn: Optional[torch.Tensor] = None, duration_syb: Optional[torch.Tensor] = None, slur: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, **decode_config) → Dict[str, torch.Tensor][source]

Caclualte features and return them as a dict.

Parameters:
  • text (Tensor) – Text index tensor (T_text).

  • singing (Tensor) – Singing waveform tensor (T_wav).

  • label (Option[Tensor]) – Label tensor (T_label).

  • phn_cnt (Optional[Tensor]) – Number of phones in each syllable (T_syb)

  • midi (Option[Tensor]) – Midi tensor (T_l abel).

  • duration_phn (Optional[Tensor]) – duration tensor (T_label).

  • duration_ruled_phn (Optional[Tensor]) – duration tensor (T_phone).

  • duration_syb (Optional[Tensor]) – duration tensor (T_phone).

  • slur (Optional[Tensor]) – slur tensor (T_phone).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (D,).

  • sids (Optional[Tensor]) – Speaker ID tensor (1,).

  • lids (Optional[Tensor]) – Language ID tensor (1,).

  • pitch (Optional[Tensor) – Pitch tensor (T_wav).

  • energy (Optional[Tensor) – Energy tensor.

Returns:

Dict of outputs.

Return type:

Dict[str, Tensor]

espnet2.svs.__init__

espnet2.svs.singing_tacotron.encoder

Singing Tacotron encoder related modules.

class espnet2.svs.singing_tacotron.encoder.Duration_Encoder(idim, embed_dim=512, dropout_rate=0.5, padding_idx=0)[source]

Bases: torch.nn.modules.module.Module

Duration_Encoder module of Spectrogram prediction network.

This is a module of encoder of Spectrogram prediction network in Singing-Tacotron, This is the encoder which converts the sequence of durations and tempo features into a transition token.

END-TO-END SINGING VOICE SYNTHESIS`:

https://arxiv.org/abs/2202.07907

Initialize Singing-Tacotron encoder module.

Parameters:
  • idim (int) –

  • embed_dim (int, optional) –

  • dropout_rate (float, optional) –

forward(xs)[source]

Calculate forward propagation.

Parameters:

xs (Tensor) – Batch of the duration sequence.(B, Tmax, feature_len)

Returns:

Batch of the sequences of transition token (B, Tmax, 1). LongTensor: Batch of lengths of each sequence (B,)

Return type:

Tensor

inference(x)[source]

Inference.

class espnet2.svs.singing_tacotron.encoder.Encoder(idim, input_layer='embed', embed_dim=512, elayers=1, eunits=512, econv_layers=3, econv_chans=512, econv_filts=5, use_batch_norm=True, use_residual=False, dropout_rate=0.5, padding_idx=0)[source]

Bases: torch.nn.modules.module.Module

Encoder module of Spectrogram prediction network.

This is a module of encoder of Spectrogram prediction network in Singing Tacotron, which described in `Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for End-to-end Singing Voice Synthesis`_. This is the encoder which converts either a sequence of characters or acoustic features into the sequence of hidden states.

Filter for End-to-end Singing Voice Synthesis`:

https://arxiv.org/abs/2202.07907

Initialize Singing Tacotron encoder module.

Parameters:
  • idim (int) –

  • input_layer (str) – Input layer type.

  • embed_dim (int, optional) –

  • elayers (int, optional) –

  • eunits (int, optional) –

  • econv_layers (int, optional) –

  • econv_filts (int, optional) –

  • econv_chans (int, optional) –

  • use_batch_norm (bool, optional) –

  • use_residual (bool, optional) –

  • dropout_rate (float, optional) –

forward(xs, ilens=None)[source]

Calculate forward propagation.

Parameters:
  • xs (Tensor) – Batch of the padded sequence. Either character ids (B, Tmax) or acoustic feature (B, Tmax, idim * encoder_reduction_factor). Padded value should be 0.

  • ilens (LongTensor) – Batch of lengths of each input batch (B,).

Returns:

Batch of the sequences of encoder states(B, Tmax, eunits). LongTensor: Batch of lengths of each sequence (B,)

Return type:

Tensor

inference(x, ilens)[source]

Inference.

Parameters:

x (Tensor) – The sequeunce of character ids (T,) or acoustic feature (T, idim * encoder_reduction_factor).

Returns:

The sequences of encoder states(T, eunits).

Return type:

Tensor

espnet2.svs.singing_tacotron.encoder.encoder_init(m)[source]

Initialize encoder parameters.

espnet2.svs.singing_tacotron.__init__

espnet2.svs.singing_tacotron.decoder

Singing Tacotron decoder related modules.

class espnet2.svs.singing_tacotron.decoder.Decoder(idim, odim, att, dlayers=2, dunits=1024, prenet_layers=2, prenet_units=256, postnet_layers=5, postnet_chans=512, postnet_filts=5, output_activation_fn=None, cumulate_att_w=True, use_batch_norm=True, use_concate=True, dropout_rate=0.5, zoneout_rate=0.1, reduction_factor=1)[source]

Bases: torch.nn.modules.module.Module

Decoder module of Spectrogram prediction network.

This is a module of decoder of Spectrogram prediction network in Singing Tacotron, which described in `https://arxiv.org/pdf/2202.07907v1.pdf`_. The decoder generates the sequence of features from the sequence of the hidden states.

Filter for End-to-end Singing Voice Synthesis`:

https://arxiv.org/pdf/2202.07907v1.pdf

Initialize Singing Tacotron decoder module.

Parameters:
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • att (torch.nn.Module) – Instance of attention class.

  • dlayers (int, optional) – The number of decoder lstm layers.

  • dunits (int, optional) – The number of decoder lstm units.

  • prenet_layers (int, optional) – The number of prenet layers.

  • prenet_units (int, optional) – The number of prenet units.

  • postnet_layers (int, optional) – The number of postnet layers.

  • postnet_filts (int, optional) – The number of postnet filter size.

  • postnet_chans (int, optional) – The number of postnet filter channels.

  • output_activation_fn (torch.nn.Module, optional) – Activation function for outputs.

  • cumulate_att_w (bool, optional) – Whether to cumulate previous attention weight.

  • use_batch_norm (bool, optional) – Whether to use batch normalization.

  • use_concate (bool, optional) – Whether to concatenate encoder embedding with decoder lstm outputs.

  • dropout_rate (float, optional) – Dropout rate.

  • zoneout_rate (float, optional) – Zoneout rate.

  • reduction_factor (int, optional) – Reduction factor.

forward(hs, hlens, trans_token, ys)[source]

Calculate forward propagation.

Parameters:
  • hs (Tensor) – Batch of the sequences of padded hidden states (B, Tmax, idim).

  • hlens (LongTensor) – Batch of lengths of each input batch (B,).

  • trans_token (Tensor) – Global transition token for duration (B x Tmax x 1)

  • ys (Tensor) – Batch of the sequences of padded target features (B, Lmax, odim).

Returns:

Batch of output tensors after postnet (B, Lmax, odim). Tensor: Batch of output tensors before postnet (B, Lmax, odim). Tensor: Batch of logits of stop prediction (B, Lmax). Tensor: Batch of attention weights (B, Lmax, Tmax).

Return type:

Tensor

Note

This computation is performed in teacher-forcing manner.

inference(h, trans_token, threshold=0.5, minlenratio=0.0, maxlenratio=30.0, use_att_constraint=False, use_dynamic_filter=True, backward_window=1, forward_window=3)[source]

Generate the sequence of features given the sequences of characters.

Parameters:
  • h (Tensor) – Input sequence of encoder hidden states (T, C).

  • trans_token (Tensor) – Global transition token for duration.

  • threshold (float, optional) – Threshold to stop generation.

  • minlenratio (float, optional) – Minimum length ratio. If set to 1.0 and the length of input is 10, the minimum length of outputs will be 10 * 1 = 10.

  • minlenratio – Minimum length ratio. If set to 10 and the length of input is 10, the maximum length of outputs will be 10 * 10 = 100.

  • use_att_constraint (bool) – Whether to apply attention constraint introduced in Deep Voice 3.

  • use_dynamic_filter (bool) – Whether to apply dynamic filter introduced in `Singing Tacotron`_.

  • backward_window (int) – Backward window size in attention constraint.

  • forward_window (int) – Forward window size in attention constraint.

Returns:

Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).

Return type:

Tensor

Note

This computation is performed in auto-regressive manner.

espnet2.svs.singing_tacotron.decoder.decoder_init(m)[source]

Initialize decoder parameters.

espnet2.svs.singing_tacotron.singing_tacotron

Singing Tacotron related modules for ESPnet2.

class espnet2.svs.singing_tacotron.singing_tacotron.singing_tacotron(idim: int, odim: int, midi_dim: int = 129, duration_dim: int = 500, embed_dim: int = 512, elayers: int = 1, eunits: int = 512, econv_layers: int = 3, econv_chans: int = 512, econv_filts: int = 5, atype: str = 'GDCA', adim: int = 512, aconv_chans: int = 32, aconv_filts: int = 15, cumulate_att_w: bool = True, dlayers: int = 2, dunits: int = 1024, prenet_layers: int = 2, prenet_units: int = 256, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, output_activation: Optional[str] = None, use_batch_norm: bool = True, use_concate: bool = True, use_residual: bool = False, reduction_factor: int = 1, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'concat', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, dropout_rate: float = 0.5, zoneout_rate: float = 0.1, use_masking: bool = True, use_weighted_masking: bool = False, bce_pos_weight: float = 5.0, loss_type: str = 'L1', use_guided_attn_loss: bool = True, guided_attn_loss_sigma: float = 0.4, guided_attn_loss_lambda: float = 1.0)[source]

Bases: espnet2.svs.abs_svs.AbsSVS

singing_Tacotron module for end-to-end singing-voice-synthesis.

This is a module of Spectrogram prediction network in Singing Tacotron described in `Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for End-to-end Singing Voice Synthesis`_, which learn accurate alignment information automatically.

Filter for End-to-end Singing Voice Synthesis`:

https://arxiv.org/pdf/2202.07907v1.pdf

Initialize Singing Tacotron module.

Parameters:
  • idim (int) – Dimension of the label inputs.

  • odim – (int) Dimension of the outputs.

  • embed_dim (int) – Dimension of the token embedding.

  • elayers (int) – Number of encoder blstm layers.

  • eunits (int) – Number of encoder blstm units.

  • econv_layers (int) – Number of encoder conv layers.

  • econv_filts (int) – Number of encoder conv filter size.

  • econv_chans (int) – Number of encoder conv filter channels.

  • dlayers (int) – Number of decoder lstm layers.

  • dunits (int) – Number of decoder lstm units.

  • prenet_layers (int) – Number of prenet layers.

  • prenet_units (int) – Number of prenet units.

  • postnet_layers (int) – Number of postnet layers.

  • postnet_filts (int) – Number of postnet filter size.

  • postnet_chans (int) – Number of postnet filter channels.

  • output_activation (str) – Name of activation function for outputs.

  • adim (int) – Number of dimension of mlp in attention.

  • aconv_chans (int) – Number of attention conv filter channels.

  • aconv_filts (int) – Number of attention conv filter size.

  • cumulate_att_w (bool) – Whether to cumulate previous attention weight.

  • use_batch_norm (bool) – Whether to use batch normalization.

  • use_concate (bool) – Whether to concat enc outputs w/ dec lstm outputs.

  • reduction_factor (int) – Reduction factor.

  • spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.

  • langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.

  • spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.

  • spk_embed_integration_type (str) – How to integrate speaker embedding.

  • use_gst (str) – Whether to use global style token.

  • gst_tokens (int) – Number of GST embeddings.

  • gst_heads (int) – Number of heads in GST multihead attention.

  • gst_conv_layers (int) – Number of conv layers in GST.

  • gst_conv_chans_list – (Sequence[int]): List of the number of channels of conv layers in GST.

  • gst_conv_kernel_size (int) – Kernel size of conv layers in GST.

  • gst_conv_stride (int) – Stride size of conv layers in GST.

  • gst_gru_layers (int) – Number of GRU layers in GST.

  • gst_gru_units (int) – Number of GRU units in GST.

  • dropout_rate (float) – Dropout rate.

  • zoneout_rate (float) – Zoneout rate.

  • use_masking (bool) – Whether to mask padded part in loss calculation.

  • use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.

  • bce_pos_weight (float) – Weight of positive sample of stop token (only for use_masking=True).

  • loss_type (str) – Loss function type (“L1”, “L2”, or “L1+L2”).

  • use_guided_attn_loss (bool) – Whether to use guided attention loss.

  • guided_attn_loss_sigma (float) – Sigma in guided attention loss.

  • guided_attn_loss_lambda (float) – Lambda in guided attention loss.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, duration: Optional[Dict[str, torch.Tensor]] = None, duration_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, slur: torch.LongTensor = None, slur_lengths: torch.Tensor = None, ying: torch.Tensor = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, joint_training: bool = False, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • text (LongTensor) – Batch of padded character ids (B, T_text).

  • text_lengths (LongTensor) – Batch of lengths of each input batch (B,).

  • feats (Tensor) – Batch of padded target features (B, T_feats, odim).

  • feats_lengths (LongTensor) –

    Batch of the lengths of each target (B,).

    label (Optional[Dict]): key is “lab” or “score”;

    value (LongTensor): Batch of padded label ids (B, Tmax).

  • label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).

  • melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).

  • duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).

  • duration_length (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded duration (B, ).

  • slur (LongTensor) – Batch of padded slur (B, Tmax).

  • slur_lengths (LongTensor) – Batch of the lengths of padded slur (B, ).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

  • joint_training (bool) – Whether to perform joint training with vocoder.

Returns:

Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.

Return type:

Tensor

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, duration: Optional[Dict[str, torch.Tensor]] = None, slur: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 30.0, use_att_constraint: bool = False, use_dynamic_filter: bool = False, backward_window: int = 1, forward_window: int = 3, use_teacher_forcing: bool = False) → Dict[str, torch.Tensor][source]

Generate the sequence of features given the sequences of characters.

Parameters:
  • text (LongTensor) – Input sequence of characters (T_text,).

  • feats (Optional[Tensor]) – Feature sequence to extract style (N, idim).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).

  • pitch (FloatTensor) – Batch of padded f0 (Tmax).

  • duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (Tmax).

  • slur (LongTensor) – Batch of padded slur (B, Tmax).

  • spembs (Optional[Tensor]) – Speaker embedding (spk_embed_dim,).

  • sids (Optional[Tensor]) – Speaker ID (1,).

  • lids (Optional[Tensor]) – Language ID (1,).

  • threshold (float) – Threshold in inference.

  • minlenratio (float) – Minimum length ratio in inference.

  • maxlenratio (float) – Maximum length ratio in inference.

  • use_att_constraint (bool) – Whether to apply attention constraint.

  • use_dynamic_filter (bool) – Whether to apply dynamic filter.

  • backward_window (int) – Backward window in attention constraint or dynamic filter.

  • forward_window (int) – Forward window in attention constraint or dynamic filter.

  • use_teacher_forcing (bool) – Whether to use teacher forcing.

Returns:

Output dict including the following items:
  • feat_gen (Tensor): Output sequence of features (T_feats, odim).

  • prob (Tensor): Output sequence of stop probabilities (T_feats,).

  • att_w (Tensor): Attention weights (T_feats, T).

Return type:

Dict[str, Tensor]

espnet2.svs.feats_extract.score_feats_extract

class espnet2.svs.feats_extract.score_feats_extract.FrameScoreFeats(fs: Union[int, str] = 22050, n_fft: int = 1024, win_length: int = 512, hop_length: int = 128, window: str = 'hann', center: bool = True)[source]

Bases: espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, duration: Optional[torch.Tensor] = None, duration_lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

FrameScoreFeats forward function.

Parameters:
  • label – (Batch, Nsamples)

  • label_lengths – (Batch)

  • midi – (Batch, Nsamples)

  • midi_lengths – (Batch)

  • duration – (Batch, Nsamples)

  • duration_lengths – (Batch)

Returns:

(Batch, Frames)

Return type:

output

get_parameters() → Dict[str, Any][source]
label_aggregate(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

lage_aggregate function.

Parameters:
  • input – (Batch, Nsamples, Label_dim)

  • input_lengths – (Batch)

Returns:

(Batch, Frames, Label_dim)

Return type:

output

output_size() → int[source]
espnet2.svs.feats_extract.score_feats_extract.ListsToTensor(xs)[source]
class espnet2.svs.feats_extract.score_feats_extract.SyllableScoreFeats(fs: Union[int, str] = 22050, n_fft: int = 1024, win_length: int = 512, hop_length: int = 128, window: str = 'hann', center: bool = True)[source]

Bases: espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, duration: Optional[torch.Tensor] = None, duration_lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

SyllableScoreFeats forward function.

Parameters:
  • label – (Batch, Nsamples)

  • label_lengths – (Batch)

  • midi – (Batch, Nsamples)

  • midi_lengths – (Batch)

  • duration – (Batch, Nsamples)

  • duration_lengths – (Batch)

Returns:

(Batch, Frames)

Return type:

output

get_parameters() → Dict[str, Any][source]
get_segments(label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, duration: Optional[torch.Tensor] = None, duration_lengths: Optional[torch.Tensor] = None)[source]
output_size() → int[source]
espnet2.svs.feats_extract.score_feats_extract.expand_to_frame(expand_len, len_size, label, midi, duration)[source]

espnet2.svs.feats_extract.__init__

espnet2.svs.xiaoice.loss

XiaoiceSing2 related loss module for ESPnet2.

class espnet2.svs.xiaoice.loss.XiaoiceSing2Loss(use_masking: bool = True, use_weighted_masking: bool = False)[source]

Bases: torch.nn.modules.module.Module

Loss function module for FastSpeech2.

Initialize feed-forward Transformer loss module.

Parameters:
  • use_masking (bool) – Whether to apply masking for padded part in loss calculation.

  • use_weighted_masking (bool) – Whether to weighted masking in loss calculation.

forward(after_outs: torch.Tensor, before_outs: torch.Tensor, d_outs: torch.Tensor, p_outs: torch.Tensor, v_outs: torch.Tensor, ys: torch.Tensor, ds: torch.Tensor, ps: torch.Tensor, vs: torch.Tensor, ilens: torch.Tensor, olens: torch.Tensor, loss_type: str = 'L1') → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • after_outs (Tensor) – Batch of outputs after postnets (B, T_feats, odim).

  • before_outs (Tensor) – Batch of outputs before postnets (B, T_feats, odim).

  • d_outs (LongTensor) – Batch of outputs of duration predictor (B, T_text).

  • p_outs (Tensor) – Batch of outputs of log_f0 (B, T_text, 1).

  • v_outs (Tensor) – Batch of outputs of VUV (B, T_text, 1).

  • ys (Tensor) – Batch of target features (B, T_feats, odim).

  • ds (LongTensor) – Batch of durations (B, T_text).

  • ps (Tensor) – Batch of target log_f0 (B, T_text, 1).

  • vs (Tensor) – Batch of target VUV (B, T_text, 1).

  • ilens (LongTensor) – Batch of the lengths of each input (B,).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

  • loss_type (str) – Mel loss type (“L1” (MAE), “L2” (MSE) or “L1+L2”)

Returns:

Mel loss value. Tensor: Duration predictor loss value. Tensor: Pitch predictor loss value. Tensor: VUV predictor loss value.

Return type:

Tensor

espnet2.svs.xiaoice.__init__

espnet2.svs.xiaoice.XiaoiceSing

XiaoiceSing related modules.

class espnet2.svs.xiaoice.XiaoiceSing.XiaoiceSing(idim: int, odim: int, midi_dim: int = 129, duration_dim: int = 500, adim: int = 384, aheads: int = 4, elayers: int = 6, eunits: int = 1536, dlayers: int = 6, dunits: int = 1536, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, postnet_dropout_rate: float = 0.5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, reduction_factor: int = 1, encoder_type: str = 'transformer', decoder_type: str = 'transformer', transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, conformer_rel_pos_type: str = 'legacy', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, zero_triu: bool = False, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False, loss_function: str = 'XiaoiceSing2', loss_type: str = 'L1', lambda_mel: float = 1, lambda_dur: float = 0.1, lambda_pitch: float = 0.01, lambda_vuv: float = 0.01)[source]

Bases: espnet2.svs.abs_svs.AbsSVS

XiaoiceSing module for Singing Voice Synthesis.

This is a module of XiaoiceSing. A high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. It follows the main architecture of FastSpeech while proposing some singing-specific design:

  1. Add features from musical score (e.g.note pitch and length)

  2. Add a residual connection in F0 prediction to attenuate off-key issues

3) The duration of all the phonemes in a musical note is accumulated to calculate the syllable duration loss for rhythm enhancement (syllable loss)

Initialize XiaoiceSing module.

Parameters:
  • idim (int) – Dimension of the label inputs.

  • odim (int) – Dimension of the outputs.

  • midi_dim (int) – Dimension of the midi inputs.

  • duration_dim (int) – Dimension of the duration inputs.

  • elayers (int) – Number of encoder layers.

  • eunits (int) – Number of encoder hidden units.

  • dlayers (int) – Number of decoder layers.

  • dunits (int) – Number of decoder hidden units.

  • postnet_layers (int) – Number of postnet layers.

  • postnet_chans (int) – Number of postnet channels.

  • postnet_filts (int) – Kernel size of postnet.

  • postnet_dropout_rate (float) – Dropout rate in postnet.

  • use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.

  • use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.

  • encoder_normalize_before (bool) – Whether to apply layernorm layer before encoder block.

  • decoder_normalize_before (bool) – Whether to apply layernorm layer before decoder block.

  • encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.

  • decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.

  • duration_predictor_layers (int) – Number of duration predictor layers.

  • duration_predictor_chans (int) – Number of duration predictor channels.

  • duration_predictor_kernel_size (int) – Kernel size of duration predictor.

  • duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.

  • reduction_factor (int) – Reduction factor.

  • encoder_type (str) – Encoder type (“transformer” or “conformer”).

  • decoder_type (str) – Decoder type (“transformer” or “conformer”).

  • transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention and positional encoding.

  • transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.

  • transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.

  • transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention & positional encoding.

  • transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.

  • transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.

  • spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.

  • langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.

  • spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.

  • spk_embed_integration_type – How to integrate speaker embedding.

  • init_type (str) – How to initialize transformer parameters.

  • init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.

  • init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.

  • use_masking (bool) – Whether to apply masking for padded part in loss calculation.

  • use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.

  • loss_function (str) – Loss functions (“FastSpeech1” or “XiaoiceSing2”)

  • loss_type (str) – Mel loss type (“L1” (MAE), “L2” (MSE) or “L1+L2”)

  • lambda_mel (float) – Loss scaling coefficient for Mel loss.

  • lambda_dur (float) – Loss scaling coefficient for duration loss.

  • lambda_pitch (float) – Loss scaling coefficient for pitch loss.

  • lambda_vuv (float) – Loss scaling coefficient for VUV loss.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, duration_lengths: Optional[Dict[str, torch.Tensor]] = None, slur: torch.LongTensor = None, slur_lengths: torch.Tensor = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, joint_training: bool = False, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • text (LongTensor) – Batch of padded character ids (B, T_text).

  • text_lengths (LongTensor) – Batch of lengths of each input (B,).

  • feats (Tensor) – Batch of padded target features (B, T_feats, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).

  • label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).

  • melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).

  • duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).

  • duration_length (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded duration (B, ).

  • slur (LongTensor) – Batch of padded slur (B, Tmax).

  • slur_lengths (LongTensor) – Batch of the lengths of padded slur (B, ).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

  • joint_training (bool) – Whether to perform joint training with vocoder.

Returns:

Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.

Return type:

Tensor

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, slur: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, use_teacher_forcing: torch.Tensor = False, joint_training: bool = False) → Dict[str, torch.Tensor][source]

Generate the sequence of features given the sequences of characters.

Parameters:
  • text (LongTensor) – Input sequence of characters (T_text,).

  • feats (Optional[Tensor]) – Feature sequence to extract style (N, idim).

  • durations (Optional[LongTensor]) – Groundtruth of duration (T_text + 1,).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (Tmax).

  • slur (LongTensor) – Batch of padded slur (B, Tmax).

  • spembs (Optional[Tensor]) – Speaker embedding (spk_embed_dim,).

  • sids (Optional[Tensor]) – Speaker ID (1,).

  • lids (Optional[Tensor]) – Language ID (1,).

  • alpha (float) – Alpha to control the speed.

Returns:

Output dict including the following items:
  • feat_gen (Tensor): Output sequence of features (T_feats, odim).

  • duration (Tensor): Duration sequence (T_text + 1,).

Return type:

Dict[str, Tensor]

espnet2.svs.naive_rnn.__init__

espnet2.svs.naive_rnn.naive_rnn

Naive-SVS related modules.

class espnet2.svs.naive_rnn.naive_rnn.NaiveRNN(idim: int, odim: int, midi_dim: int = 129, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, elayers: int = 3, eunits: int = 1024, ebidirectional: bool = True, midi_embed_integration_type: str = 'add', dlayers: int = 3, dunits: int = 1024, dbidirectional: bool = True, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, use_batch_norm: bool = True, reduction_factor: int = 1, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', eprenet_dropout_rate: float = 0.5, edropout_rate: float = 0.1, ddropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', use_masking: bool = False, use_weighted_masking: bool = False, loss_type: str = 'L1')[source]

Bases: espnet2.svs.abs_svs.AbsSVS

NaiveRNN-SVS module.

This is an implementation of naive RNN for singing voice synthesis The features are processed directly over time-domain from music score and predict the singing voice features

Initialize NaiveRNN module.

Parameters:
  • idim (int) – Dimension of the label inputs.

  • odim (int) – Dimension of the outputs.

  • midi_dim (int) – Dimension of the midi inputs.

  • embed_dim (int) – Dimension of the token embedding.

  • eprenet_conv_layers (int) – Number of prenet conv layers.

  • eprenet_conv_filts (int) – Number of prenet conv filter size.

  • eprenet_conv_chans (int) – Number of prenet conv filter channels.

  • elayers (int) – Number of encoder layers.

  • eunits (int) – Number of encoder hidden units.

  • ebidirectional (bool) – If bidirectional in encoder.

  • midi_embed_integration_type (str) – how to integrate midi information, (“add” or “cat”).

  • dlayers (int) – Number of decoder lstm layers.

  • dunits (int) – Number of decoder lstm units.

  • dbidirectional (bool) – if bidirectional in decoder.

  • postnet_layers (int) – Number of postnet layers.

  • postnet_filts (int) – Number of postnet filter size.

  • postnet_chans (int) – Number of postnet filter channels.

  • use_batch_norm (bool) – Whether to use batch normalization.

  • reduction_factor (int) – Reduction factor.

  • extra embedding related (#) –

  • spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.

  • langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.

  • spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.

  • spk_embed_integration_type (str) – How to integrate speaker embedding.

  • eprenet_dropout_rate (float) – Prenet dropout rate.

  • edropout_rate (float) – Encoder dropout rate.

  • ddropout_rate (float) – Decoder dropout rate.

  • postnet_dropout_rate (float) – Postnet dropout_rate.

  • init_type (str) – How to initialize transformer parameters.

  • use_masking (bool) – Whether to mask padded part in loss calculation.

  • use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.

  • loss_type (str) – Loss function type (“L1”, “L2”, or “L1+L2”).

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, duration_lengths: Optional[Dict[str, torch.Tensor]] = None, slur: torch.LongTensor = None, slur_lengths: torch.Tensor = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • text (LongTensor) – Batch of padded character ids (B, Tmax).

  • text_lengths (LongTensor) – Batch of lengths of each input batch (B,).

  • feats (Tensor) – Batch of padded target features (B, Lmax, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).

  • label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).

  • melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).

  • duration (Optional[Dict]) – key is “lab”, “score”; value (LongTensor): Batch of padded duration (B, Tmax).

  • duration_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded duration (B, ).

  • slur (LongTensor) – Batch of padded slur (B, Tmax).

  • slur_lengths (LongTensor) – Batch of the lengths of padded slur (B, ).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

GS Fix:

arguements from forward func. V.S. **batch from espnet_model.py label == durations | phone sequence melody -> pitch sequence

Returns:

Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.

Return type:

Tensor

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, slur: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, use_teacher_forcing: torch.Tensor = False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • text (LongTensor) – Batch of padded character ids (Tmax).

  • feats (Tensor) – Batch of padded target features (Lmax, odim).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).

  • pitch (FloatTensor) – Batch of padded f0 (Tmax).

  • slur (LongTensor) – Batch of padded slur (B, Tmax).

  • duration (Optional[Dict]) – key is “lab”, “score”; value (LongTensor): Batch of padded duration (Tmax).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (1).

  • lids (Optional[Tensor]) – Batch of language IDs (1).

Returns:

Output dict including the following items: * feat_gen (Tensor): Output sequence of features (T_feats, odim).

Return type:

Dict[str, Tensor]

class espnet2.svs.naive_rnn.naive_rnn.NaiveRNNLoss(use_masking=True, use_weighted_masking=False)[source]

Bases: torch.nn.modules.module.Module

Loss function module for Tacotron2.

Initialize Tactoron2 loss module.

Parameters:
  • use_masking (bool) – Whether to apply masking for padded part in loss calculation.

  • use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.

forward(after_outs, before_outs, ys, olens)[source]

Calculate forward propagation.

Parameters:
  • after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).

  • before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

Returns:

L1 loss value. Tensor: Mean square error loss value.

Return type:

Tensor

espnet2.svs.naive_rnn.naive_rnn_dp

NaiveRNN-DP-SVS related modules.

class espnet2.svs.naive_rnn.naive_rnn_dp.NaiveRNNDP(idim: int, odim: int, midi_dim: int = 129, embed_dim: int = 512, duration_dim: int = 500, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, elayers: int = 3, eunits: int = 1024, ebidirectional: bool = True, midi_embed_integration_type: str = 'add', dlayers: int = 3, dunits: int = 1024, dbidirectional: bool = True, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, use_batch_norm: bool = True, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, reduction_factor: int = 1, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', eprenet_dropout_rate: float = 0.5, edropout_rate: float = 0.1, ddropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', use_masking: bool = False, use_weighted_masking: bool = False)[source]

Bases: espnet2.svs.abs_svs.AbsSVS

NaiveRNNDP-SVS module.

This is an implementation of naive RNN with duration prediction for singing voice synthesis The features are processed directly over time-domain from music score and predict the singing voice features

Initialize NaiveRNNDP module.

Parameters:
  • idim (int) – Dimension of the label inputs.

  • odim (int) – Dimension of the outputs.

  • midi_dim (int) – Dimension of the midi inputs.

  • embed_dim (int) – Dimension of the token embedding.

  • eprenet_conv_layers (int) – Number of prenet conv layers.

  • eprenet_conv_filts (int) – Number of prenet conv filter size.

  • eprenet_conv_chans (int) – Number of prenet conv filter channels.

  • elayers (int) – Number of encoder layers.

  • eunits (int) – Number of encoder hidden units.

  • ebidirectional (bool) – If bidirectional in encoder.

  • midi_embed_integration_type (str) – how to integrate midi information, (“add” or “cat”).

  • dlayers (int) – Number of decoder lstm layers.

  • dunits (int) – Number of decoder lstm units.

  • dbidirectional (bool) – if bidirectional in decoder.

  • postnet_layers (int) – Number of postnet layers.

  • postnet_filts (int) – Number of postnet filter size.

  • postnet_chans (int) – Number of postnet filter channels.

  • use_batch_norm (bool) – Whether to use batch normalization.

  • reduction_factor (int) – Reduction factor.

  • duration_predictor_layers (int) – Number of duration predictor layers.

  • duration_predictor_chans (int) – Number of duration predictor channels.

  • duration_predictor_kernel_size (int) – Kernel size of duration predictor.

  • duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.

  • extra embedding related (#) –

  • spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.

  • langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.

  • spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.

  • spk_embed_integration_type (str) – How to integrate speaker embedding.

  • eprenet_dropout_rate (float) – Prenet dropout rate.

  • edropout_rate (float) – Encoder dropout rate.

  • ddropout_rate (float) – Decoder dropout rate.

  • postnet_dropout_rate (float) – Postnet dropout_rate.

  • init_type (str) – How to initialize transformer parameters.

  • use_masking (bool) – Whether to mask padded part in loss calculation.

  • use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, duration_lengths: Optional[Dict[str, torch.Tensor]] = None, slur: torch.LongTensor = None, slur_lengths: torch.Tensor = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, joint_training: bool = False, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • text (LongTensor) – Batch of padded character ids (B, Tmax).

  • text_lengths (LongTensor) – Batch of lengths of each input batch (B,).

  • feats (Tensor) – Batch of padded target features (B, Lmax, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).

  • label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).

  • melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).

  • duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).

  • duration_length (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded duration (B, ).

  • slur (LongTensor) – Batch of padded slur (B, Tmax).

  • slur_lengths (LongTensor) – Batch of the lengths of padded slur (B, ).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

  • joint_training (bool) – Whether to perform joint training with vocoder.

GS Fix:

arguements from forward func. V.S. **batch from espnet_model.py label == durations | phone sequence melody -> pitch sequence

Returns:

Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.

Return type:

Tensor

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, slur: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, joint_training: bool = False, use_teacher_forcing: torch.Tensor = False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • text (LongTensor) – Batch of padded character ids (Tmax).

  • feats (Tensor) – Batch of padded target features (Lmax, odim).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).

  • pitch (FloatTensor) – Batch of padded f0 (Tmax).

  • duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (Tmax).

  • slur (LongTensor) – Batch of padded slur (B, Tmax).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (1).

  • lids (Optional[Tensor]) – Batch of language IDs (1).

Returns:

Output dict including the following items:
  • feat_gen (Tensor): Output sequence of features (T_feats, odim).

Return type:

Dict[str, Tensor]