espnet2.svs package¶

espnet2.svs.init¶

espnet2.svs.abs_svs¶

Singing-voice-synthesis abstrast class.

class espnet2.svs.abs_svs.AbsSVS(*args, **kwargs)[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

SVS abstract class.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶: Calculate outputs and return the loss tensor.

abstract inference(text: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶: Return predicted output as a dict.

property require_raw_singing¶: Return whether or not raw_singing is required.

property require_vocoder¶: Return whether or not vocoder is required.

espnet2.svs.espnet_model¶

Singing-voice-synthesis ESPnet model.

class espnet2.svs.espnet_model.ESPnetSVSModel(text_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], score_feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], label_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], pitch_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], ying_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], duration_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], energy_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], pitch_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], energy_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], svs: espnet2.svs.abs_svs.AbsSVS)[source]¶

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

ESPnet model for singing voice synthesis task.

Initialize ESPnetSVSModel module.

collect_feats(text: torch.Tensor, text_lengths: torch.Tensor, singing: torch.Tensor, singing_lengths: torch.Tensor, label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, duration_phn: Optional[torch.Tensor] = None, duration_phn_lengths: Optional[torch.Tensor] = None, duration_ruled_phn: Optional[torch.Tensor] = None, duration_ruled_phn_lengths: Optional[torch.Tensor] = None, duration_syb: Optional[torch.Tensor] = None, duration_syb_lengths: Optional[torch.Tensor] = None, slur: Optional[torch.Tensor] = None, slur_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, ying: Optional[torch.Tensor] = None, ying_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, **kwargs) → Dict[str, torch.Tensor][source]¶

Caclualte features and return them as a dict.

Parameters:

text (Tensor) – Text index tensor (B, T_text).
text_lengths (Tensor) – Text length tensor (B,).
singing (Tensor) – Singing waveform tensor (B, T_wav).
singing_lengths (Tensor) – Singing length tensor (B,).
label (Option[Tensor]) – Label tensor (B, T_label).
label_lengths (Optional[Tensor]) – Label lrngth tensor (B,).
phn_cnt (Optional[Tensor]) – Number of phones in each syllable (B, T_syb)
midi (Option[Tensor]) – Midi tensor (B, T_label).
midi_lengths (Optional[Tensor]) – Midi lrngth tensor (B,).
duration* is duration in time_shift ---- (----) –
duration_phn (Optional[Tensor]) – duration tensor (B, T_label).
duration_phn_lengths (Optional[Tensor]) – duration length tensor (B,).
duration_ruled_phn (Optional[Tensor]) – duration tensor (B, T_phone).
duration_ruled_phn_lengths (Optional[Tensor]) – duration length tensor (B,).
duration_syb (Optional[Tensor]) – duration tensor (B, T_syb).
duration_syb_lengths (Optional[Tensor]) – duration length tensor (B,).
slur (Optional[Tensor]) – slur tensor (B, T_slur).
slur_lengths (Optional[Tensor]) – slur length tensor (B,).
pitch (Optional[Tensor]) – Pitch tensor (B, T_wav). - f0 sequence
pitch_lengths (Optional[Tensor]) – Pitch length tensor (B,).
energy (Optional[Tensor) – Energy tensor.
energy_lengths (Optional[Tensor) – Energy length tensor (B,).
spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).
sids (Optional[Tensor]) – Speaker ID tensor (B, 1).
lids (Optional[Tensor]) – Language ID tensor (B, 1).

Returns:

Dict of features.

Return type:

Dict[str, Tensor]

forward(text: torch.Tensor, text_lengths: torch.Tensor, singing: torch.Tensor, singing_lengths: torch.Tensor, feats: Optional[torch.Tensor] = None, feats_lengths: Optional[torch.Tensor] = None, label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, duration_phn: Optional[torch.Tensor] = None, duration_phn_lengths: Optional[torch.Tensor] = None, duration_ruled_phn: Optional[torch.Tensor] = None, duration_ruled_phn_lengths: Optional[torch.Tensor] = None, duration_syb: Optional[torch.Tensor] = None, duration_syb_lengths: Optional[torch.Tensor] = None, slur: Optional[torch.Tensor] = None, slur_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, ying: Optional[torch.Tensor] = None, ying_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, flag_IsValid=False, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Caclualte outputs and return the loss tensor.

Parameters:

text (Tensor) – Text index tensor (B, T_text).
text_lengths (Tensor) – Text length tensor (B,).
singing (Tensor) – Singing waveform tensor (B, T_wav).
singing_lengths (Tensor) – Singing length tensor (B,).
label (Option[Tensor]) – Label tensor (B, T_label).
label_lengths (Optional[Tensor]) – Label lrngth tensor (B,).
phn_cnt (Optional[Tensor]) – Number of phones in each syllable (B, T_syb)
midi (Option[Tensor]) – Midi tensor (B, T_label).
midi_lengths (Optional[Tensor]) – Midi lrngth tensor (B,).
duration_phn (Optional[Tensor]) – duration tensor (B, T_label).
duration_phn_lengths (Optional[Tensor]) – duration length tensor (B,).
duration_ruled_phn (Optional[Tensor]) – duration tensor (B, T_phone).
duration_ruled_phn_lengths (Optional[Tensor]) – duration length tensor (B,).
duration_syb (Optional[Tensor]) – duration tensor (B, T_syllable).
duration_syb_lengths (Optional[Tensor]) – duration length tensor (B,).
slur (Optional[Tensor]) – slur tensor (B, T_slur).
slur_lengths (Optional[Tensor]) – slur length tensor (B,).
pitch (Optional[Tensor]) – Pitch tensor (B, T_wav). - f0 sequence
pitch_lengths (Optional[Tensor]) – Pitch length tensor (B,).
energy (Optional[Tensor]) – Energy tensor.
energy_lengths (Optional[Tensor]) – Energy length tensor (B,).
spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).
sids (Optional[Tensor]) – Speaker ID tensor (B, 1).
lids (Optional[Tensor]) – Language ID tensor (B, 1).
kwargs – “utt_id” is among the input.

Returns:

Loss scalar tensor. Dict[str, float]: Statistics to be monitored. Tensor: Weight tensor to summarize losses.

Return type:

Tensor

inference(text: torch.Tensor, singing: Optional[torch.Tensor] = None, label: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, duration_phn: Optional[torch.Tensor] = None, duration_ruled_phn: Optional[torch.Tensor] = None, duration_syb: Optional[torch.Tensor] = None, slur: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, **decode_config) → Dict[str, torch.Tensor][source]¶

Caclualte features and return them as a dict.

Parameters:

text (Tensor) – Text index tensor (T_text).
singing (Tensor) – Singing waveform tensor (T_wav).
label (Option[Tensor]) – Label tensor (T_label).
phn_cnt (Optional[Tensor]) – Number of phones in each syllable (T_syb)
midi (Option[Tensor]) – Midi tensor (T_l abel).
duration_phn (Optional[Tensor]) – duration tensor (T_label).
duration_ruled_phn (Optional[Tensor]) – duration tensor (T_phone).
duration_syb (Optional[Tensor]) – duration tensor (T_phone).
slur (Optional[Tensor]) – slur tensor (T_phone).
spembs (Optional[Tensor]) – Speaker embedding tensor (D,).
sids (Optional[Tensor]) – Speaker ID tensor (1,).
lids (Optional[Tensor]) – Language ID tensor (1,).
pitch (Optional[Tensor) – Pitch tensor (T_wav).
energy (Optional[Tensor) – Energy tensor.

Returns:

Dict of outputs.

Return type:

Dict[str, Tensor]

espnet2.svs.feats_extract.init¶

espnet2.svs.feats_extract.score_feats_extract¶

class espnet2.svs.feats_extract.score_feats_extract.FrameScoreFeats(fs: Union[int, str] = 22050, n_fft: int = 1024, win_length: int = 512, hop_length: int = 128, window: str = 'hann', center: bool = True)[source]¶

Bases: espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract

extra_repr()[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, duration: Optional[torch.Tensor] = None, duration_lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶

FrameScoreFeats forward function.

Parameters:

label – (Batch, Nsamples)
label_lengths – (Batch)
midi – (Batch, Nsamples)
midi_lengths – (Batch)
duration – (Batch, Nsamples)
duration_lengths – (Batch)

Returns:

(Batch, Frames)

Return type:

output

get_parameters() → Dict[str, Any][source]¶

label_aggregate(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶

lage_aggregate function.

Parameters:

input – (Batch, Nsamples, Label_dim)
input_lengths – (Batch)

Returns:

(Batch, Frames, Label_dim)

Return type:

output

output_size() → int[source]¶

espnet2.svs.feats_extract.score_feats_extract.ListsToTensor(xs)[source]¶

class espnet2.svs.feats_extract.score_feats_extract.SyllableScoreFeats(fs: Union[int, str] = 22050, n_fft: int = 1024, win_length: int = 512, hop_length: int = 128, window: str = 'hann', center: bool = True)[source]¶

Bases: espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract

extra_repr()[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, duration: Optional[torch.Tensor] = None, duration_lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶

SyllableScoreFeats forward function.

Parameters:

label – (Batch, Nsamples)
label_lengths – (Batch)
midi – (Batch, Nsamples)
midi_lengths – (Batch)
duration – (Batch, Nsamples)
duration_lengths – (Batch)

Returns:

(Batch, Frames)

Return type:

output

get_parameters() → Dict[str, Any][source]¶

get_segments(label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, duration: Optional[torch.Tensor] = None, duration_lengths: Optional[torch.Tensor] = None)[source]¶

output_size() → int[source]¶

espnet2.svs.feats_extract.score_feats_extract.expand_to_frame(expand_len, len_size, label, midi, duration)[source]¶

espnet2.svs.naive_rnn.init¶

espnet2.svs.naive_rnn.naive_rnn_dp¶

NaiveRNN-DP-SVS related modules.

class espnet2.svs.naive_rnn.naive_rnn_dp.NaiveRNNDP(idim: int, odim: int, midi_dim: int = 129, embed_dim: int = 512, duration_dim: int = 500, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, elayers: int = 3, eunits: int = 1024, ebidirectional: bool = True, midi_embed_integration_type: str = 'add', dlayers: int = 3, dunits: int = 1024, dbidirectional: bool = True, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, use_batch_norm: bool = True, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, reduction_factor: int = 1, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', eprenet_dropout_rate: float = 0.5, edropout_rate: float = 0.1, ddropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', use_masking: bool = False, use_weighted_masking: bool = False)[source]¶

Bases: espnet2.svs.abs_svs.AbsSVS

NaiveRNNDP-SVS module.

This is an implementation of naive RNN with duration prediction for singing voice synthesis The features are processed directly over time-domain from music score and predict the singing voice features

Initialize NaiveRNNDP module.

Parameters:

idim (int) – Dimension of the label inputs.
odim (int) – Dimension of the outputs.
midi_dim (int) – Dimension of the midi inputs.
embed_dim (int) – Dimension of the token embedding.
eprenet_conv_layers (int) – Number of prenet conv layers.
eprenet_conv_filts (int) – Number of prenet conv filter size.
eprenet_conv_chans (int) – Number of prenet conv filter channels.
elayers (int) – Number of encoder layers.
eunits (int) – Number of encoder hidden units.
ebidirectional (bool) – If bidirectional in encoder.
midi_embed_integration_type (str) – how to integrate midi information, (“add” or “cat”).
dlayers (int) – Number of decoder lstm layers.
dunits (int) – Number of decoder lstm units.
dbidirectional (bool) – if bidirectional in decoder.
postnet_layers (int) – Number of postnet layers.
postnet_filts (int) – Number of postnet filter size.
postnet_chans (int) – Number of postnet filter channels.
use_batch_norm (bool) – Whether to use batch normalization.
reduction_factor (int) – Reduction factor.
duration_predictor_layers (int) – Number of duration predictor layers.
duration_predictor_chans (int) – Number of duration predictor channels.
duration_predictor_kernel_size (int) – Kernel size of duration predictor.
duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.
extra embedding related (#) –
spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
spk_embed_integration_type (str) – How to integrate speaker embedding.
eprenet_dropout_rate (float) – Prenet dropout rate.
edropout_rate (float) – Encoder dropout rate.
ddropout_rate (float) – Decoder dropout rate.
postnet_dropout_rate (float) – Postnet dropout_rate.
init_type (str) – How to initialize transformer parameters.
use_masking (bool) – Whether to mask padded part in loss calculation.
use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, duration_lengths: Optional[Dict[str, torch.Tensor]] = None, slur: torch.LongTensor = None, slur_lengths: torch.Tensor = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, joint_training: bool = False, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Calculate forward propagation.

Parameters:

text (LongTensor) – Batch of padded character ids (B, Tmax).
text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
feats (Tensor) – Batch of padded target features (B, Lmax, odim).
feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).
pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).
duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).
duration_length (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded duration (B, ).
slur (LongTensor) – Batch of padded slur (B, Tmax).
slur_lengths (LongTensor) – Batch of the lengths of padded slur (B, ).
spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).
lids (Optional[Tensor]) – Batch of language IDs (B, 1).
joint_training (bool) – Whether to perform joint training with vocoder.

GS Fix:: arguements from forward func. V.S. **batch from espnet_model.py label == durations | phone sequence melody -> pitch sequence

Returns:: Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.
Return type:: Tensor

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, slur: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, joint_training: bool = False, use_teacher_forcing: torch.Tensor = False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Calculate forward propagation.

Parameters:

text (LongTensor) – Batch of padded character ids (Tmax).
feats (Tensor) – Batch of padded target features (Lmax, odim).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
pitch (FloatTensor) – Batch of padded f0 (Tmax).
duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (Tmax).
slur (LongTensor) – Batch of padded slur (B, Tmax).
spembs (Optional[Tensor]) – Batch of speaker embeddings (spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (1).
lids (Optional[Tensor]) – Batch of language IDs (1).

Returns:

Output dict including the following items:

feat_gen (Tensor): Output sequence of features (T_feats, odim).

Return type:

Dict[str, Tensor]

espnet2.svs.naive_rnn.naive_rnn¶

Naive-SVS related modules.

class espnet2.svs.naive_rnn.naive_rnn.NaiveRNN(idim: int, odim: int, midi_dim: int = 129, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, elayers: int = 3, eunits: int = 1024, ebidirectional: bool = True, midi_embed_integration_type: str = 'add', dlayers: int = 3, dunits: int = 1024, dbidirectional: bool = True, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, use_batch_norm: bool = True, reduction_factor: int = 1, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', eprenet_dropout_rate: float = 0.5, edropout_rate: float = 0.1, ddropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', use_masking: bool = False, use_weighted_masking: bool = False, loss_type: str = 'L1')[source]¶

Bases: espnet2.svs.abs_svs.AbsSVS

NaiveRNN-SVS module.

This is an implementation of naive RNN for singing voice synthesis The features are processed directly over time-domain from music score and predict the singing voice features

Initialize NaiveRNN module.

Parameters:

idim (int) – Dimension of the label inputs.
odim (int) – Dimension of the outputs.
midi_dim (int) – Dimension of the midi inputs.
embed_dim (int) – Dimension of the token embedding.
eprenet_conv_layers (int) – Number of prenet conv layers.
eprenet_conv_filts (int) – Number of prenet conv filter size.
eprenet_conv_chans (int) – Number of prenet conv filter channels.
elayers (int) – Number of encoder layers.
eunits (int) – Number of encoder hidden units.
ebidirectional (bool) – If bidirectional in encoder.
midi_embed_integration_type (str) – how to integrate midi information, (“add” or “cat”).
dlayers (int) – Number of decoder lstm layers.
dunits (int) – Number of decoder lstm units.
dbidirectional (bool) – if bidirectional in decoder.
postnet_layers (int) – Number of postnet layers.
postnet_filts (int) – Number of postnet filter size.
postnet_chans (int) – Number of postnet filter channels.
use_batch_norm (bool) – Whether to use batch normalization.
reduction_factor (int) – Reduction factor.
extra embedding related (#) –
spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
spk_embed_integration_type (str) – How to integrate speaker embedding.
eprenet_dropout_rate (float) – Prenet dropout rate.
edropout_rate (float) – Encoder dropout rate.
ddropout_rate (float) – Decoder dropout rate.
postnet_dropout_rate (float) – Postnet dropout_rate.
init_type (str) – How to initialize transformer parameters.
use_masking (bool) – Whether to mask padded part in loss calculation.
use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
loss_type (str) – Loss function type (“L1”, “L2”, or “L1+L2”).

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, duration_lengths: Optional[Dict[str, torch.Tensor]] = None, slur: torch.LongTensor = None, slur_lengths: torch.Tensor = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Calculate forward propagation.

Parameters:

text (LongTensor) – Batch of padded character ids (B, Tmax).
text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
feats (Tensor) – Batch of padded target features (B, Lmax, odim).
feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).
pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).
duration (Optional[Dict]) – key is “lab”, “score”; value (LongTensor): Batch of padded duration (B, Tmax).
duration_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded duration (B, ).
slur (LongTensor) – Batch of padded slur (B, Tmax).
slur_lengths (LongTensor) – Batch of the lengths of padded slur (B, ).
spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).
lids (Optional[Tensor]) – Batch of language IDs (B, 1).

GS Fix:: arguements from forward func. V.S. **batch from espnet_model.py label == durations ｜ phone sequence melody -> pitch sequence

Returns:: Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.
Return type:: Tensor

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, slur: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, use_teacher_forcing: torch.Tensor = False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Calculate forward propagation.

Parameters:

text (LongTensor) – Batch of padded character ids (Tmax).
feats (Tensor) – Batch of padded target features (Lmax, odim).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
pitch (FloatTensor) – Batch of padded f0 (Tmax).
slur (LongTensor) – Batch of padded slur (B, Tmax).
duration (Optional[Dict]) – key is “lab”, “score”; value (LongTensor): Batch of padded duration (Tmax).
spembs (Optional[Tensor]) – Batch of speaker embeddings (spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (1).
lids (Optional[Tensor]) – Batch of language IDs (1).

Returns:

Output dict including the following items: * feat_gen (Tensor): Output sequence of features (T_feats, odim).

Return type:

Dict[str, Tensor]

class espnet2.svs.naive_rnn.naive_rnn.NaiveRNNLoss(use_masking=True, use_weighted_masking=False)[source]¶

Bases: torch.nn.modules.module.Module

Loss function module for Tacotron2.

Initialize Tactoron2 loss module.

Parameters:

use_masking (bool) – Whether to apply masking for padded part in loss calculation.
use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.

forward(after_outs, before_outs, ys, olens)[source]¶

Calculate forward propagation.

Parameters:

after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).
before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).

Returns:

L1 loss value. Tensor: Mean square error loss value.

Return type:

Tensor

espnet2.svs.singing_tacotron.encoder¶

Singing Tacotron encoder related modules.

class espnet2.svs.singing_tacotron.encoder.Duration_Encoder(idim, embed_dim=512, dropout_rate=0.5, padding_idx=0)[source]¶

Bases: torch.nn.modules.module.Module

Duration_Encoder module of Spectrogram prediction network.

This is a module of encoder of Spectrogram prediction network in Singing-Tacotron, This is the encoder which converts the sequence of durations and tempo features into a transition token.

END-TO-END SINGING VOICE SYNTHESIS`:: https://arxiv.org/abs/2202.07907

Initialize Singing-Tacotron encoder module.

Parameters:

idim (int) –
embed_dim (int, optional) –
dropout_rate (float, optional) –

forward(xs)[source]¶

Calculate forward propagation.

Parameters:: xs (Tensor) – Batch of the duration sequence.(B, Tmax, feature_len)
Returns:: Batch of the sequences of transition token (B, Tmax, 1). LongTensor: Batch of lengths of each sequence (B,)
Return type:: Tensor

inference(x)[source]¶: Inference.

class espnet2.svs.singing_tacotron.encoder.Encoder(idim, input_layer='embed', embed_dim=512, elayers=1, eunits=512, econv_layers=3, econv_chans=512, econv_filts=5, use_batch_norm=True, use_residual=False, dropout_rate=0.5, padding_idx=0)[source]¶

Bases: torch.nn.modules.module.Module

Encoder module of Spectrogram prediction network.

This is a module of encoder of Spectrogram prediction network in Singing Tacotron, which described in `Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for End-to-end Singing Voice Synthesis`_. This is the encoder which converts either a sequence of characters or acoustic features into the sequence of hidden states.

Filter for End-to-end Singing Voice Synthesis`:: https://arxiv.org/abs/2202.07907

Initialize Singing Tacotron encoder module.

Parameters:

idim (int) –
input_layer (str) – Input layer type.
embed_dim (int, optional) –
elayers (int, optional) –
eunits (int, optional) –
econv_layers (int, optional) –
econv_filts (int, optional) –
econv_chans (int, optional) –
use_batch_norm (bool, optional) –
use_residual (bool, optional) –
dropout_rate (float, optional) –

forward(xs, ilens=None)[source]¶

Calculate forward propagation.

Parameters:

xs (Tensor) – Batch of the padded sequence. Either character ids (B, Tmax) or acoustic feature (B, Tmax, idim * encoder_reduction_factor). Padded value should be 0.
ilens (LongTensor) – Batch of lengths of each input batch (B,).

Returns:

Batch of the sequences of encoder states(B, Tmax, eunits). LongTensor: Batch of lengths of each sequence (B,)

Return type:

Tensor

inference(x, ilens)[source]¶

Inference.

Parameters:: x (Tensor) – The sequeunce of character ids (T,) or acoustic feature (T, idim * encoder_reduction_factor).
Returns:: The sequences of encoder states(T, eunits).
Return type:: Tensor

espnet2.svs.singing_tacotron.encoder.encoder_init(m)[source]¶: Initialize encoder parameters.

espnet2.svs.singing_tacotron.init¶

espnet2.svs.singing_tacotron.singing_tacotron¶

Singing Tacotron related modules for ESPnet2.

class espnet2.svs.singing_tacotron.singing_tacotron.singing_tacotron(idim: int, odim: int, midi_dim: int = 129, duration_dim: int = 500, embed_dim: int = 512, elayers: int = 1, eunits: int = 512, econv_layers: int = 3, econv_chans: int = 512, econv_filts: int = 5, atype: str = 'GDCA', adim: int = 512, aconv_chans: int = 32, aconv_filts: int = 15, cumulate_att_w: bool = True, dlayers: int = 2, dunits: int = 1024, prenet_layers: int = 2, prenet_units: int = 256, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, output_activation: Optional[str] = None, use_batch_norm: bool = True, use_concate: bool = True, use_residual: bool = False, reduction_factor: int = 1, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'concat', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, dropout_rate: float = 0.5, zoneout_rate: float = 0.1, use_masking: bool = True, use_weighted_masking: bool = False, bce_pos_weight: float = 5.0, loss_type: str = 'L1', use_guided_attn_loss: bool = True, guided_attn_loss_sigma: float = 0.4, guided_attn_loss_lambda: float = 1.0)[source]¶

Bases: espnet2.svs.abs_svs.AbsSVS

singing_Tacotron module for end-to-end singing-voice-synthesis.

This is a module of Spectrogram prediction network in Singing Tacotron described in `Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for End-to-end Singing Voice Synthesis`_, which learn accurate alignment information automatically.

Filter for End-to-end Singing Voice Synthesis`:: https://arxiv.org/pdf/2202.07907v1.pdf

Initialize Singing Tacotron module.

Parameters:

idim (int) – Dimension of the label inputs.
odim – (int) Dimension of the outputs.
embed_dim (int) – Dimension of the token embedding.
elayers (int) – Number of encoder blstm layers.
eunits (int) – Number of encoder blstm units.
econv_layers (int) – Number of encoder conv layers.
econv_filts (int) – Number of encoder conv filter size.
econv_chans (int) – Number of encoder conv filter channels.
dlayers (int) – Number of decoder lstm layers.
dunits (int) – Number of decoder lstm units.
prenet_layers (int) – Number of prenet layers.
prenet_units (int) – Number of prenet units.
postnet_layers (int) – Number of postnet layers.
postnet_filts (int) – Number of postnet filter size.
postnet_chans (int) – Number of postnet filter channels.
output_activation (str) – Name of activation function for outputs.
adim (int) – Number of dimension of mlp in attention.
aconv_chans (int) – Number of attention conv filter channels.
aconv_filts (int) – Number of attention conv filter size.
cumulate_att_w (bool) – Whether to cumulate previous attention weight.
use_batch_norm (bool) – Whether to use batch normalization.
use_concate (bool) – Whether to concat enc outputs w/ dec lstm outputs.
reduction_factor (int) – Reduction factor.
spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
spk_embed_integration_type (str) – How to integrate speaker embedding.
use_gst (str) – Whether to use global style token.
gst_tokens (int) – Number of GST embeddings.
gst_heads (int) – Number of heads in GST multihead attention.
gst_conv_layers (int) – Number of conv layers in GST.
gst_conv_chans_list – (Sequence[int]): List of the number of channels of conv layers in GST.
gst_conv_kernel_size (int) – Kernel size of conv layers in GST.
gst_conv_stride (int) – Stride size of conv layers in GST.
gst_gru_layers (int) – Number of GRU layers in GST.
gst_gru_units (int) – Number of GRU units in GST.
dropout_rate (float) – Dropout rate.
zoneout_rate (float) – Zoneout rate.
use_masking (bool) – Whether to mask padded part in loss calculation.
use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
bce_pos_weight (float) – Weight of positive sample of stop token (only for use_masking=True).
loss_type (str) – Loss function type (“L1”, “L2”, or “L1+L2”).
use_guided_attn_loss (bool) – Whether to use guided attention loss.
guided_attn_loss_sigma (float) – Sigma in guided attention loss.
guided_attn_loss_lambda (float) – Lambda in guided attention loss.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, duration: Optional[Dict[str, torch.Tensor]] = None, duration_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, slur: torch.LongTensor = None, slur_lengths: torch.Tensor = None, ying: torch.Tensor = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, joint_training: bool = False, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Calculate forward propagation.

Parameters:

text (LongTensor) – Batch of padded character ids (B, T_text).
text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
feats (Tensor) – Batch of padded target features (B, T_feats, odim).
feats_lengths (LongTensor) –

Batch of the lengths of each target (B,).
label (Optional[Dict]): key is “lab” or “score”;

value (LongTensor): Batch of padded label ids (B, Tmax).
label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).
pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).
duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).
duration_length (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded duration (B, ).
slur (LongTensor) – Batch of padded slur (B, Tmax).
slur_lengths (LongTensor) – Batch of the lengths of padded slur (B, ).
spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).
lids (Optional[Tensor]) – Batch of language IDs (B, 1).
joint_training (bool) – Whether to perform joint training with vocoder.

Returns:

Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.

Return type:

Tensor

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, duration: Optional[Dict[str, torch.Tensor]] = None, slur: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 30.0, use_att_constraint: bool = False, use_dynamic_filter: bool = False, backward_window: int = 1, forward_window: int = 3, use_teacher_forcing: bool = False) → Dict[str, torch.Tensor][source]¶

Generate the sequence of features given the sequences of characters.

Parameters:

text (LongTensor) – Input sequence of characters (T_text,).
feats (Optional[Tensor]) – Feature sequence to extract style (N, idim).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
pitch (FloatTensor) – Batch of padded f0 (Tmax).
duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (Tmax).
slur (LongTensor) – Batch of padded slur (B, Tmax).
spembs (Optional[Tensor]) – Speaker embedding (spk_embed_dim,).
sids (Optional[Tensor]) – Speaker ID (1,).
lids (Optional[Tensor]) – Language ID (1,).
threshold (float) – Threshold in inference.
minlenratio (float) – Minimum length ratio in inference.
maxlenratio (float) – Maximum length ratio in inference.
use_att_constraint (bool) – Whether to apply attention constraint.
use_dynamic_filter (bool) – Whether to apply dynamic filter.
backward_window (int) – Backward window in attention constraint or dynamic filter.
forward_window (int) – Forward window in attention constraint or dynamic filter.
use_teacher_forcing (bool) – Whether to use teacher forcing.

Returns:

Output dict including the following items:

feat_gen (Tensor): Output sequence of features (T_feats, odim).
prob (Tensor): Output sequence of stop probabilities (T_feats,).
att_w (Tensor): Attention weights (T_feats, T).

Return type:

Dict[str, Tensor]

espnet2.svs.singing_tacotron.decoder¶

Singing Tacotron decoder related modules.

class espnet2.svs.singing_tacotron.decoder.Decoder(idim, odim, att, dlayers=2, dunits=1024, prenet_layers=2, prenet_units=256, postnet_layers=5, postnet_chans=512, postnet_filts=5, output_activation_fn=None, cumulate_att_w=True, use_batch_norm=True, use_concate=True, dropout_rate=0.5, zoneout_rate=0.1, reduction_factor=1)[source]¶

Bases: torch.nn.modules.module.Module

Decoder module of Spectrogram prediction network.

This is a module of decoder of Spectrogram prediction network in Singing Tacotron, which described in `https://arxiv.org/pdf/2202.07907v1.pdf`_. The decoder generates the sequence of features from the sequence of the hidden states.

Filter for End-to-end Singing Voice Synthesis`:: https://arxiv.org/pdf/2202.07907v1.pdf

Initialize Singing Tacotron decoder module.

Parameters:

idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
att (torch.nn.Module) – Instance of attention class.
dlayers (int, optional) – The number of decoder lstm layers.
dunits (int, optional) – The number of decoder lstm units.
prenet_layers (int, optional) – The number of prenet layers.
prenet_units (int, optional) – The number of prenet units.
postnet_layers (int, optional) – The number of postnet layers.
postnet_filts (int, optional) – The number of postnet filter size.
postnet_chans (int, optional) – The number of postnet filter channels.
output_activation_fn (torch.nn.Module, optional) – Activation function for outputs.
cumulate_att_w (bool, optional) – Whether to cumulate previous attention weight.
use_batch_norm (bool, optional) – Whether to use batch normalization.
use_concate (bool, optional) – Whether to concatenate encoder embedding with decoder lstm outputs.
dropout_rate (float, optional) – Dropout rate.
zoneout_rate (float, optional) – Zoneout rate.
reduction_factor (int, optional) – Reduction factor.

forward(hs, hlens, trans_token, ys)[source]¶

Calculate forward propagation.

Parameters:

hs (Tensor) – Batch of the sequences of padded hidden states (B, Tmax, idim).
hlens (LongTensor) – Batch of lengths of each input batch (B,).
trans_token (Tensor) – Global transition token for duration (B x Tmax x 1)
ys (Tensor) – Batch of the sequences of padded target features (B, Lmax, odim).

Returns:

Batch of output tensors after postnet (B, Lmax, odim). Tensor: Batch of output tensors before postnet (B, Lmax, odim). Tensor: Batch of logits of stop prediction (B, Lmax). Tensor: Batch of attention weights (B, Lmax, Tmax).

Return type:

Tensor

Note

This computation is performed in teacher-forcing manner.

inference(h, trans_token, threshold=0.5, minlenratio=0.0, maxlenratio=30.0, use_att_constraint=False, use_dynamic_filter=True, backward_window=1, forward_window=3)[source]¶

Generate the sequence of features given the sequences of characters.

Parameters:

h (Tensor) – Input sequence of encoder hidden states (T, C).
trans_token (Tensor) – Global transition token for duration.
threshold (float, optional) – Threshold to stop generation.
minlenratio (float, optional) – Minimum length ratio. If set to 1.0 and the length of input is 10, the minimum length of outputs will be 10 * 1 = 10.
minlenratio – Minimum length ratio. If set to 10 and the length of input is 10, the maximum length of outputs will be 10 * 10 = 100.
use_att_constraint (bool) – Whether to apply attention constraint introduced in Deep Voice 3.
use_dynamic_filter (bool) – Whether to apply dynamic filter introduced in `Singing Tacotron`_.
backward_window (int) – Backward window size in attention constraint.
forward_window (int) – Forward window size in attention constraint.

Returns:

Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).

Return type:

Tensor

Note

This computation is performed in auto-regressive manner.

espnet2.svs.singing_tacotron.decoder.decoder_init(m)[source]¶: Initialize decoder parameters.

espnet2.svs.xiaoice.init¶

espnet2.svs.xiaoice.XiaoiceSing¶

XiaoiceSing related modules.

class espnet2.svs.xiaoice.XiaoiceSing.XiaoiceSing(idim: int, odim: int, midi_dim: int = 129, duration_dim: int = 500, adim: int = 384, aheads: int = 4, elayers: int = 6, eunits: int = 1536, dlayers: int = 6, dunits: int = 1536, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, postnet_dropout_rate: float = 0.5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, reduction_factor: int = 1, encoder_type: str = 'transformer', decoder_type: str = 'transformer', transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, conformer_rel_pos_type: str = 'legacy', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, zero_triu: bool = False, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False, loss_function: str = 'XiaoiceSing2', loss_type: str = 'L1', lambda_mel: float = 1, lambda_dur: float = 0.1, lambda_pitch: float = 0.01, lambda_vuv: float = 0.01)[source]¶

Bases: espnet2.svs.abs_svs.AbsSVS

XiaoiceSing module for Singing Voice Synthesis.

This is a module of XiaoiceSing. A high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. It follows the main architecture of FastSpeech while proposing some singing-specific design:

Add features from musical score (e.g.note pitch and length)

Add a residual connection in F0 prediction to attenuate off-key issues

3) The duration of all the phonemes in a musical note is accumulated to calculate the syllable duration loss for rhythm enhancement (syllable loss)

Initialize XiaoiceSing module.

Parameters:

idim (int) – Dimension of the label inputs.
odim (int) – Dimension of the outputs.
midi_dim (int) – Dimension of the midi inputs.
duration_dim (int) – Dimension of the duration inputs.
elayers (int) – Number of encoder layers.
eunits (int) – Number of encoder hidden units.
dlayers (int) – Number of decoder layers.
dunits (int) – Number of decoder hidden units.
postnet_layers (int) – Number of postnet layers.
postnet_chans (int) – Number of postnet channels.
postnet_filts (int) – Kernel size of postnet.
postnet_dropout_rate (float) – Dropout rate in postnet.
use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.
use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.
encoder_normalize_before (bool) – Whether to apply layernorm layer before encoder block.
decoder_normalize_before (bool) – Whether to apply layernorm layer before decoder block.
encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.
decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.
duration_predictor_layers (int) – Number of duration predictor layers.
duration_predictor_chans (int) – Number of duration predictor channels.
duration_predictor_kernel_size (int) – Kernel size of duration predictor.
duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.
reduction_factor (int) – Reduction factor.
encoder_type (str) – Encoder type (“transformer” or “conformer”).
decoder_type (str) – Decoder type (“transformer” or “conformer”).
transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention and positional encoding.
transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.
transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.
transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention & positional encoding.
transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.
transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.
spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
spk_embed_integration_type – How to integrate speaker embedding.
init_type (str) – How to initialize transformer parameters.
init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.
init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.
use_masking (bool) – Whether to apply masking for padded part in loss calculation.
use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
loss_function (str) – Loss functions (“FastSpeech1” or “XiaoiceSing2”)
loss_type (str) – Mel loss type (“L1” (MAE), “L2” (MSE) or “L1+L2”)
lambda_mel (float) – Loss scaling coefficient for Mel loss.
lambda_dur (float) – Loss scaling coefficient for duration loss.
lambda_pitch (float) – Loss scaling coefficient for pitch loss.
lambda_vuv (float) – Loss scaling coefficient for VUV loss.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, duration_lengths: Optional[Dict[str, torch.Tensor]] = None, slur: torch.LongTensor = None, slur_lengths: torch.Tensor = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, joint_training: bool = False, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Calculate forward propagation.

Parameters:

text (LongTensor) – Batch of padded character ids (B, T_text).
text_lengths (LongTensor) – Batch of lengths of each input (B,).
feats (Tensor) – Batch of padded target features (B, T_feats, odim).
feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).
pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).
duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).
duration_length (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded duration (B, ).
slur (LongTensor) – Batch of padded slur (B, Tmax).
slur_lengths (LongTensor) – Batch of the lengths of padded slur (B, ).
spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).
lids (Optional[Tensor]) – Batch of language IDs (B, 1).
joint_training (bool) – Whether to perform joint training with vocoder.

Returns:

Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.

Return type:

Tensor

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, slur: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, use_teacher_forcing: torch.Tensor = False, joint_training: bool = False) → Dict[str, torch.Tensor][source]¶

Generate the sequence of features given the sequences of characters.

Parameters:

text (LongTensor) – Input sequence of characters (T_text,).
feats (Optional[Tensor]) – Feature sequence to extract style (N, idim).
durations (Optional[LongTensor]) – Groundtruth of duration (T_text + 1,).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (Tmax).
slur (LongTensor) – Batch of padded slur (B, Tmax).
spembs (Optional[Tensor]) – Speaker embedding (spk_embed_dim,).
sids (Optional[Tensor]) – Speaker ID (1,).
lids (Optional[Tensor]) – Language ID (1,).
alpha (float) – Alpha to control the speed.

Returns:

Output dict including the following items:

feat_gen (Tensor): Output sequence of features (T_feats, odim).
duration (Tensor): Duration sequence (T_text + 1,).

Return type:

Dict[str, Tensor]

espnet2.svs.xiaoice.loss¶

XiaoiceSing2 related loss module for ESPnet2.

class espnet2.svs.xiaoice.loss.XiaoiceSing2Loss(use_masking: bool = True, use_weighted_masking: bool = False)[source]¶

Bases: torch.nn.modules.module.Module

Loss function module for FastSpeech2.

Initialize feed-forward Transformer loss module.

Parameters:

use_masking (bool) – Whether to apply masking for padded part in loss calculation.
use_weighted_masking (bool) – Whether to weighted masking in loss calculation.

forward(after_outs: torch.Tensor, before_outs: torch.Tensor, d_outs: torch.Tensor, p_outs: torch.Tensor, v_outs: torch.Tensor, ys: torch.Tensor, ds: torch.Tensor, ps: torch.Tensor, vs: torch.Tensor, ilens: torch.Tensor, olens: torch.Tensor, loss_type: str = 'L1') → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Calculate forward propagation.

Parameters:

after_outs (Tensor) – Batch of outputs after postnets (B, T_feats, odim).
before_outs (Tensor) – Batch of outputs before postnets (B, T_feats, odim).
d_outs (LongTensor) – Batch of outputs of duration predictor (B, T_text).
p_outs (Tensor) – Batch of outputs of log_f0 (B, T_text, 1).
v_outs (Tensor) – Batch of outputs of VUV (B, T_text, 1).
ys (Tensor) – Batch of target features (B, T_feats, odim).
ds (LongTensor) – Batch of durations (B, T_text).
ps (Tensor) – Batch of target log_f0 (B, T_text, 1).
vs (Tensor) – Batch of target VUV (B, T_text, 1).
ilens (LongTensor) – Batch of the lengths of each input (B,).
olens (LongTensor) – Batch of the lengths of each target (B,).
loss_type (str) – Mel loss type (“L1” (MAE), “L2” (MSE) or “L1+L2”)

Returns:

Mel loss value. Tensor: Duration predictor loss value. Tensor: Pitch predictor loss value. Tensor: VUV predictor loss value.

Return type:

Tensor

espnet2.svs package¶

espnet2.svs.__init__¶

espnet2.svs.abs_svs¶

espnet2.svs.espnet_model¶

espnet2.svs.feats_extract.__init__¶

espnet2.svs.feats_extract.score_feats_extract¶

espnet2.svs.naive_rnn.__init__¶

espnet2.svs.naive_rnn.naive_rnn_dp¶

espnet2.svs.naive_rnn.naive_rnn¶

espnet2.svs.singing_tacotron.encoder¶

espnet2.svs.singing_tacotron.__init__¶

espnet2.svs.singing_tacotron.singing_tacotron¶

espnet2.svs.singing_tacotron.decoder¶

espnet2.svs.xiaoice.__init__¶

espnet2.svs.xiaoice.XiaoiceSing¶

espnet2.svs.xiaoice.loss¶

espnet2.svs.init¶

espnet2.svs.feats_extract.init¶

espnet2.svs.naive_rnn.init¶

espnet2.svs.singing_tacotron.init¶

espnet2.svs.xiaoice.init¶