espnet2.svs.naive_rnn.naive_rnn.NaiveRNN
espnet2.svs.naive_rnn.naive_rnn.NaiveRNN
class espnet2.svs.naive_rnn.naive_rnn.NaiveRNN(idim: int, odim: int, midi_dim: int = 129, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, elayers: int = 3, eunits: int = 1024, ebidirectional: bool = True, midi_embed_integration_type: str = 'add', dlayers: int = 3, dunits: int = 1024, dbidirectional: bool = True, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, use_batch_norm: bool = True, reduction_factor: int = 1, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'add', eprenet_dropout_rate: float = 0.5, edropout_rate: float = 0.1, ddropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', use_masking: bool = False, use_weighted_masking: bool = False, loss_type: str = 'L1')
Bases: AbsSVS
NaiveRNN-SVS module.
This is an implementation of naive RNN for singing voice synthesis The features are processed directly over time-domain from music score and predict the singing voice features
Initialize NaiveRNN module.
- Parameters:
- idim (int) – Dimension of the label inputs.
- odim (int) – Dimension of the outputs.
- midi_dim (int) – Dimension of the midi inputs.
- embed_dim (int) – Dimension of the token embedding.
- eprenet_conv_layers (int) – Number of prenet conv layers.
- eprenet_conv_filts (int) – Number of prenet conv filter size.
- eprenet_conv_chans (int) – Number of prenet conv filter channels.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- ebidirectional (bool) – If bidirectional in encoder.
- midi_embed_integration_type (str) – how to integrate midi information, (“add” or “cat”).
- dlayers (int) – Number of decoder lstm layers.
- dunits (int) – Number of decoder lstm units.
- dbidirectional (bool) – if bidirectional in decoder.
- postnet_layers (int) – Number of postnet layers.
- postnet_filts (int) – Number of postnet filter size.
- postnet_chans (int) – Number of postnet filter channels.
- use_batch_norm (bool) – Whether to use batch normalization.
- reduction_factor (int) – Reduction factor.
- related ( # extra embedding)
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- eprenet_dropout_rate (float) – Prenet dropout rate.
- edropout_rate (float) – Encoder dropout rate.
- ddropout_rate (float) – Decoder dropout rate.
- postnet_dropout_rate (float) – Postnet dropout_rate.
- init_type (str) – How to initialize transformer parameters.
- use_masking (bool) – Whether to mask padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- loss_type (str) – Loss function type (“L1”, “L2”, or “L1+L2”).
forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, melody_lengths: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, duration: Dict[str, Tensor] | None = None, duration_lengths: Dict[str, Tensor] | None = None, slur: LongTensor | None = None, slur_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, flag_IsValid=False) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Calculate forward propagation.
- Parameters:
- text (LongTensor) – Batch of padded character ids (B, Tmax).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, Lmax, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
- label_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
- melody_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).
- duration (Optional *[*Dict ]) – key is “lab”, “score”; value (LongTensor): Batch of padded duration (B, Tmax).
- duration_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded duration (B, ).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- slur_lengths (LongTensor) – Batch of the lengths of padded slur (B, ).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
GS Fix: : arguements from forward func. V.S. <br/>
**
<br/> batch from espnet_model.py label == durations | phone sequence melody -> pitch sequence
- Returns: Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.
- Return type: Tensor
inference(text: Tensor, feats: Tensor | None = None, label: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, duration: Dict[str, Tensor] | None = None, slur: Dict[str, Tensor] | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, use_teacher_forcing: Tensor = False) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Calculate forward propagation.
- Parameters:
- text (LongTensor) – Batch of padded character ids (Tmax).
- feats (Tensor) – Batch of padded target features (Lmax, odim).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
- pitch (FloatTensor) – Batch of padded f0 (Tmax).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- duration (Optional *[*Dict ]) – key is “lab”, “score”; value (LongTensor): Batch of padded duration (Tmax).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (1).
- Returns: Output dict including the following items: * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- Return type: Dict[str, Tensor]