espnet2.svs package

espnet2.svs.abs_svs

Singing-voice-synthesis abstrast class.

class espnet2.svs.abs_svs.AbsSVS[source]

Bases: torch.nn.modules.module.Module, abc.ABC

SVS abstract class.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate outputs and return the loss tensor.

abstract inference(text: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]

Return predicted output as a dict.

property require_raw_singing

Return whether or not raw_singing is required.

property require_vocoder

Return whether or not vocoder is required.

espnet2.svs.__init__

espnet2.svs.espnet_model

Singing-voice-synthesis ESPnet model.

class espnet2.svs.espnet_model.ESPnetSVSModel(text_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], score_feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], label_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], pitch_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], tempo_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], beat_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], energy_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], pitch_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], energy_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], svs: espnet2.svs.abs_svs.AbsSVS)[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

ESPnet model for singing voice synthesis task.

Initialize ESPnetSVSModel module.

collect_feats(text: torch.Tensor, text_lengths: torch.Tensor, singing: torch.Tensor, singing_lengths: torch.Tensor, label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, label_lab: Optional[torch.Tensor] = None, label_lab_lengths: Optional[torch.Tensor] = None, label_score: Optional[torch.Tensor] = None, label_score_lengths: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, midi_lab: Optional[torch.Tensor] = None, midi_lab_lengths: Optional[torch.Tensor] = None, midi_score: Optional[torch.Tensor] = None, midi_score_lengths: Optional[torch.Tensor] = None, tempo_lab: Optional[torch.Tensor] = None, tempo_lab_lengths: Optional[torch.Tensor] = None, tempo_score: Optional[torch.Tensor] = None, tempo_score_lengths: Optional[torch.Tensor] = None, beat_phn: Optional[torch.Tensor] = None, beat_phn_lengths: Optional[torch.Tensor] = None, beat_ruled_phn: Optional[torch.Tensor] = None, beat_ruled_phn_lengths: Optional[torch.Tensor] = None, beat_syb: Optional[torch.Tensor] = None, beat_syb_lengths: Optional[torch.Tensor] = None, beat_lab: Optional[torch.Tensor] = None, beat_lab_lengths: Optional[torch.Tensor] = None, beat_score_phn: Optional[torch.Tensor] = None, beat_score_phn_lengths: Optional[torch.Tensor] = None, beat_score_syb: Optional[torch.Tensor] = None, beat_score_syb_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, **kwargs) → Dict[str, torch.Tensor][source]

Calculate features and return them as a dict.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • singing (Tensor) – Singing waveform tensor (B, T_wav).

  • singing_lengths (Tensor) – Singing length tensor (B,).

  • label* is label id sequence ---- (----) –

  • label (Option[Tensor]) – Label tensor (B, T_label).

  • label_lengths (Optional[Tensor]) – Label lrngth tensor (B,).

  • label_lab (Optional[Tensor]) – Label tensor (B, T_wav).

  • label_lab_lengths (Optional[Tensor]) – Label length tensor (B,).

  • label_score (Optional[Tensor]) – Label tensor (B, T_score).

  • label_score_lengths (Optional[Tensor]) – Label length tensor (B,).

  • phn_cnt (Optional[Tensor]) – Number of phones in each syllable (B, T_syb)

  • midi_* is midi id sequence ---- (----) –

  • midi (Option[Tensor]) – Midi tensor (B, T_label).

  • midi_lengths (Optional[Tensor]) – Midi lrngth tensor (B,).

  • midi_lab (Optional[Tensor]) – Midi tensor (B, T_wav).

  • midi_lab_lengths (Optional[Tensor]) – Midi length tensor (B,).

  • midi_score (Optional[Tensor]) – Midi tensor (B, T_score).

  • midi_score_lengths (Optional[Tensor]) – Midi length tensor (B,).

  • temp* is bpm ---- (----) –

  • tempo_lab (Optional[Tensor]) – Tempo tensor (B, T_wav).

  • tempo_lab_lengths (Optional[Tensor]) – Tempo length tensor (B,).

  • tempo_score (Optional[Tensor]) – Tempo tensor (B, T_score).

  • tempo_score_lengths (Optional[Tensor]) – Tempo length tensor (B,).

  • beat* is duration in time_shift ---- (----) –

  • beat_phn (Optional[Tensor]) – Beat tensor (B, T_label).

  • beat_phn_lengths (Optional[Tensor]) – Beat length tensor (B,).

  • beat_ruled_phn (Optional[Tensor]) – Beat tensor (B, T_phone).

  • beat_ruled_phn_lengths (Optional[Tensor]) – Beat length tensor (B,).

  • beat_syb (Optional[Tensor]) – Beat tensor (B, T_syb).

  • beat_syb_lengths (Optional[Tensor]) – Beat length tensor (B,).

  • beat_lab (Optional[Tensor]) – Beat tensor (B, T_wav).

  • beat_lab_lengths (Optional[Tensor]) – Beat length tensor (B,).

  • beat_score_phn (Optional[Tensor]) – Beat tensor (B, T_score).

  • beat_score_phn_lengths (Optional[Tensor]) – Beat length tensor (B,).

  • beat_score_syb (Optional[Tensor]) – Beat tensor (B, T_score).

  • beat_score_syb_lengths (Optional[Tensor]) – Beat length tensor (B,).

  • pitch (Optional[Tensor]) – Pitch tensor (B, T_wav). - f0 sequence

  • pitch_lengths (Optional[Tensor]) – Pitch length tensor (B,).

  • energy (Optional[Tensor) – Energy tensor.

  • energy_lengths (Optional[Tensor) – Energy length tensor (B,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).

  • sids (Optional[Tensor]) – Speaker ID tensor (B, 1).

  • lids (Optional[Tensor]) – Language ID tensor (B, 1).

Returns:

Dict of features.

Return type:

Dict[str, Tensor]

forward(text: torch.Tensor, text_lengths: torch.Tensor, singing: torch.Tensor, singing_lengths: torch.Tensor, label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, label_lab: Optional[torch.Tensor] = None, label_lab_lengths: Optional[torch.Tensor] = None, label_score: Optional[torch.Tensor] = None, label_score_lengths: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, midi_lab: Optional[torch.Tensor] = None, midi_lab_lengths: Optional[torch.Tensor] = None, midi_score: Optional[torch.Tensor] = None, midi_score_lengths: Optional[torch.Tensor] = None, tempo_lab: Optional[torch.Tensor] = None, tempo_lab_lengths: Optional[torch.Tensor] = None, tempo_score: Optional[torch.Tensor] = None, tempo_score_lengths: Optional[torch.Tensor] = None, beat_phn: Optional[torch.Tensor] = None, beat_phn_lengths: Optional[torch.Tensor] = None, beat_ruled_phn: Optional[torch.Tensor] = None, beat_ruled_phn_lengths: Optional[torch.Tensor] = None, beat_syb: Optional[torch.Tensor] = None, beat_syb_lengths: Optional[torch.Tensor] = None, beat_lab: Optional[torch.Tensor] = None, beat_lab_lengths: Optional[torch.Tensor] = None, beat_score_phn: Optional[torch.Tensor] = None, beat_score_phn_lengths: Optional[torch.Tensor] = None, beat_score_syb: Optional[torch.Tensor] = None, beat_score_syb_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, flag_IsValid=False, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Caclualte outputs and return the loss tensor.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • singing (Tensor) – Singing waveform tensor (B, T_wav).

  • singing_lengths (Tensor) – Singing length tensor (B,).

  • label* is label id sequence ---- (----) –

  • label (Option[Tensor]) – Label tensor (B, T_label).

  • label_lengths (Optional[Tensor]) – Label lrngth tensor (B,).

  • label_lab (Optional[Tensor]) – Label tensor (B, T_wav).

  • label_lab_lengths (Optional[Tensor]) – Label length tensor (B,).

  • label_score (Optional[Tensor]) – Label tensor (B, T_score).

  • label_score_lengths (Optional[Tensor]) – Label length tensor (B,).

  • phn_cnt (Optional[Tensor]) – Number of phones in each syllable (B, T_syb)

  • midi_* is midi id sequence ---- (----) –

  • midi (Option[Tensor]) – Midi tensor (B, T_label).

  • midi_lengths (Optional[Tensor]) – Midi lrngth tensor (B,).

  • midi_lab (Optional[Tensor]) – Midi tensor (B, T_wav).

  • midi_lab_lengths (Optional[Tensor]) – Midi length tensor (B,).

  • midi_score (Optional[Tensor]) – Midi tensor (B, T_score).

  • midi_score_lengths (Optional[Tensor]) – Midi length tensor (B,).

  • temp* is bpm ---- (----) –

  • tempo_lab (Optional[Tensor]) – Tempo tensor (B, T_wav).

  • tempo_lab_lengths (Optional[Tensor]) – Tempo length tensor (B,).

  • tempo_score (Optional[Tensor]) – Tempo tensor (B, T_score).

  • tempo_score_lengths (Optional[Tensor]) – Tempo length tensor (B,).

  • beat* is duration in time_shift ---- (----) –

  • beat_phn (Optional[Tensor]) – Beat tensor (B, T_label).

  • beat_phn_lengths (Optional[Tensor]) – Beat length tensor (B,).

  • beat_ruled_phn (Optional[Tensor]) – Beat tensor (B, T_phone).

  • beat_ruled_phn_lengths (Optional[Tensor]) – Beat length tensor (B,).

  • beat_syb (Optional[Tensor]) – Beat tensor (B, T_phone).

  • beat_syb_lengths (Optional[Tensor]) – Beat length tensor (B,).

  • beat_lab (Optional[Tensor]) – Beat tensor (B, T_wav).

  • beat_lab_lengths (Optional[Tensor]) – Beat length tensor (B,).

  • beat_score_phn (Optional[Tensor]) – Beat tensor (B, T_score).

  • beat_score_phn_lengths (Optional[Tensor]) – Beat length tensor (B,).

  • beat_score_syb (Optional[Tensor]) – Beat tensor (B, T_score).

  • beat_score_syb_lengths (Optional[Tensor]) – Beat length tensor (B,).

  • pitch (Optional[Tensor]) – Pitch tensor (B, T_wav). - f0 sequence

  • pitch_lengths (Optional[Tensor]) – Pitch length tensor (B,).

  • energy (Optional[Tensor]) – Energy tensor.

  • energy_lengths (Optional[Tensor]) – Energy length tensor (B,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).

  • sids (Optional[Tensor]) – Speaker ID tensor (B, 1).

  • lids (Optional[Tensor]) – Language ID tensor (B, 1).

  • kwargs – “utt_id” is among the input.

Returns:

Loss scalar tensor. Dict[str, float]: Statistics to be monitored. Tensor: Weight tensor to summarize losses.

Return type:

Tensor

inference(text: torch.Tensor, singing: Optional[torch.Tensor] = None, label: Optional[torch.Tensor] = None, label_lab: Optional[torch.Tensor] = None, label_score: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lab: Optional[torch.Tensor] = None, midi_score: Optional[torch.Tensor] = None, tempo_lab: Optional[torch.Tensor] = None, tempo_score: Optional[torch.Tensor] = None, beat_phn: Optional[torch.Tensor] = None, beat_ruled_phn: Optional[torch.Tensor] = None, beat_syb: Optional[torch.Tensor] = None, beat_lab: Optional[torch.Tensor] = None, beat_score_phn: Optional[torch.Tensor] = None, beat_score_syb: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, **decode_config) → Dict[str, torch.Tensor][source]

Caclualte features and return them as a dict.

Parameters:
  • text (Tensor) – Text index tensor (T_text).

  • singing (Tensor) – Singing waveform tensor (T_wav).

  • label* is label id sequence ---- (----) –

  • label (Option[Tensor]) – Label tensor (T_label).

  • label_lab (Optional[Tensor]) – Label tensor (T_wav).

  • label_score (Optional[Tensor]) – Label tensor (T_score).

  • phn_cnt (Optional[Tensor]) – Number of phones in each syllable (T_syb)

  • midi_* is midi id sequence ---- (----) –

  • midi (Option[Tensor]) – Midi tensor (T_label).

  • midi_lab (Optional[Tensor]) – Midi tensor (T_wav).

  • midi_score (Optional[Tensor]) – Midi tensor (T_score).

  • temp* is bpm ---- (----) –

  • tempo_lab (Optional[Tensor]) – Tempo tensor (T_wav).

  • tempo_score (Optional[Tensor]) – Tempo tensor (T_score).

  • beat* is duration in time_shift ---- (----) –

  • beat_phn (Optional[Tensor]) – Beat tensor (T_label).

  • beat_ruled_phn (Optional[Tensor]) – Beat tensor (T_phone).

  • beat_syb (Optional[Tensor]) – Beat tensor (T_phone).

  • beat_lab (Optional[Tensor]) – Beat tensor (T_wav).

  • beat_score_phn (Optional[Tensor]) – Beat tensor (T_score).

  • beat_score_syb (Optional[Tensor]) – Beat tensor (T_score).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (D,).

  • sids (Optional[Tensor]) – Speaker ID tensor (1,).

  • lids (Optional[Tensor]) – Language ID tensor (1,).

  • pitch (Optional[Tensor) – Pitch tensor (T_wav).

  • energy (Optional[Tensor) – Energy tensor.

Returns:

Dict of outputs.

Return type:

Dict[str, Tensor]

espnet2.svs.espnet_model.cal_ds(ilen, label, midi, beat, ref_len, ref_label, ref_midi, ref_beat)[source]

Calculate frame expanding length for each label.

espnet2.svs.espnet_model.cal_ds_syb(ds, phn_cnt)[source]

Calculate frame expanding length for each syllable.

espnet2.svs.feats_extract.score_feats_extract

class espnet2.svs.feats_extract.score_feats_extract.FrameScoreFeats(fs: Union[int, str] = 22050, n_fft: int = 1024, win_length: int = 512, hop_length: int = 128, window: str = 'hann', center: bool = True)[source]

Bases: espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, tempo: Optional[torch.Tensor] = None, tempo_lengths: Optional[torch.Tensor] = None, beat: Optional[torch.Tensor] = None, beat_lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

FrameScoreFeats forward function.

Parameters:
  • label – (Batch, Nsamples)

  • label_lengths – (Batch)

  • midi – (Batch, Nsamples)

  • midi_lengths – (Batch)

  • tempo – (Batch, Nsamples)

  • tempo_lengths – (Batch)

Returns:

(Batch, Frames)

Return type:

output

get_parameters() → Dict[str, Any][source]
label_aggregate(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

lage_aggregate function.

Parameters:
  • input – (Batch, Nsamples, Label_dim)

  • input_lengths – (Batch)

Returns:

(Batch, Frames, Label_dim)

Return type:

output

output_size() → int[source]
espnet2.svs.feats_extract.score_feats_extract.ListsToTensor(xs)[source]
class espnet2.svs.feats_extract.score_feats_extract.SyllableScoreFeats(fs: Union[int, str] = 22050, n_fft: int = 1024, win_length: int = 512, hop_length: int = 128, window: str = 'hann', center: bool = True)[source]

Bases: espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, tempo: Optional[torch.Tensor] = None, tempo_lengths: Optional[torch.Tensor] = None, beat: Optional[torch.Tensor] = None, beat_lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

SyllableScoreFeats forward function.

Parameters:
  • label – (Batch, Nsamples)

  • label_lengths – (Batch)

  • midi – (Batch, Nsamples)

  • midi_lengths – (Batch)

  • tempo – (Batch, Nsamples)

  • tempo_lengths – (Batch)

  • beat – (Batch, Nsamples)

  • beat_lengths – (Batch)

Returns:

(Batch, Frames)

Return type:

output

get_parameters() → Dict[str, Any][source]
get_segments(label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, tempo: Optional[torch.Tensor] = None, tempo_lengths: Optional[torch.Tensor] = None, beat: Optional[torch.Tensor] = None, beat_lengths: Optional[torch.Tensor] = None)[source]
output_size() → int[source]

espnet2.svs.feats_extract.__init__

espnet2.svs.naive_rnn.__init__

espnet2.svs.naive_rnn.naive_rnn

Naive-SVS related modules.

class espnet2.svs.naive_rnn.naive_rnn.NaiveRNN(idim: int, odim: int, midi_dim: int = 129, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, elayers: int = 3, eunits: int = 1024, ebidirectional: bool = True, midi_embed_integration_type: str = 'add', dlayers: int = 3, dunits: int = 1024, dbidirectional: bool = True, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, use_batch_norm: bool = True, reduction_factor: int = 1, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', eprenet_dropout_rate: float = 0.5, edropout_rate: float = 0.1, ddropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', use_masking: bool = False, use_weighted_masking: bool = False, loss_type: str = 'L1')[source]

Bases: espnet2.svs.abs_svs.AbsSVS

NaiveRNN-SVS module.

This is an implementation of naive RNN for singing voice synthesis The features are processed directly over time-domain from music score and predict the singing voice features

Initialize NaiveRNN module.

Args: TODO(Yuning)

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, tempo: Optional[Dict[str, torch.Tensor]] = None, tempo_lengths: Optional[Dict[str, torch.Tensor]] = None, beat: Optional[Dict[str, torch.Tensor]] = None, beat_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • text (LongTensor) – Batch of padded character ids (B, Tmax).

  • text_lengths (LongTensor) – Batch of lengths of each input batch (B,).

  • feats (Tensor) – Batch of padded target features (B, Lmax, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).

  • label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).

  • melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).

  • tempo (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded tempo (B, Tmax).

  • tempo_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded tempo (B, ).

  • beat (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded beat (B, Tmax).

  • beat_length (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded beat (B, ).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).

  • duration (Optional[Dict]) – key is “phn”, “syb”; value (LongTensor): Batch of padded beat (B, Tmax).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

GS Fix:

arguements from forward func. V.S. **batch from espnet_model.py label == durations | phone sequence melody -> pitch sequence

Returns:

Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.

Return type:

Tensor

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, tempo: Optional[Dict[str, torch.Tensor]] = None, beat: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, use_teacher_forcing: torch.Tensor = False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • text (LongTensor) – Batch of padded character ids (Tmax).

  • feats (Tensor) – Batch of padded target features (Lmax, odim).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).

  • tempo (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded tempo (Tmax).

  • beat (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded beat (Tmax).

  • pitch (FloatTensor) – Batch of padded f0 (Tmax).

  • duration (Optional[Dict]) – key is “phn”, “syb”; value (LongTensor): Batch of padded beat (Tmax).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (1).

  • lids (Optional[Tensor]) – Batch of language IDs (1).

Returns:

Output dict including the following items: * feat_gen (Tensor): Output sequence of features (T_feats, odim).

Return type:

Dict[str, Tensor]

class espnet2.svs.naive_rnn.naive_rnn.NaiveRNNLoss(use_masking=True, use_weighted_masking=False)[source]

Bases: torch.nn.modules.module.Module

Loss function module for Tacotron2.

Initialize Tactoron2 loss module.

Parameters:
  • use_masking (bool) – Whether to apply masking for padded part in loss calculation.

  • use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.

forward(after_outs, before_outs, ys, olens)[source]

Calculate forward propagation.

Parameters:
  • after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).

  • before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

Returns:

L1 loss value. Tensor: Mean square error loss value.

Return type:

Tensor

espnet2.svs.naive_rnn.naive_rnn_dp

NaiveRNN-DP-SVS related modules.

class espnet2.svs.naive_rnn.naive_rnn_dp.NaiveRNNDP(idim: int, odim: int, midi_dim: int = 129, embed_dim: int = 512, tempo_dim: int = 500, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, elayers: int = 3, eunits: int = 1024, ebidirectional: bool = True, midi_embed_integration_type: str = 'add', dlayers: int = 3, dunits: int = 1024, dbidirectional: bool = True, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, use_batch_norm: bool = True, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, reduction_factor: int = 1, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', eprenet_dropout_rate: float = 0.5, edropout_rate: float = 0.1, ddropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', use_masking: bool = False, use_weighted_masking: bool = False)[source]

Bases: espnet2.svs.abs_svs.AbsSVS

NaiveRNNDP-SVS module.

This is an implementation of naive RNN with duration prediction for singing voice synthesis The features are processed directly over time-domain from music score and predict the singing voice features

Initialize NaiveRNN module.

Args: TODO(Yuning)

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, tempo: Optional[Dict[str, torch.Tensor]] = None, tempo_lengths: Optional[Dict[str, torch.Tensor]] = None, beat: Optional[Dict[str, torch.Tensor]] = None, beat_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • text (LongTensor) – Batch of padded character ids (B, Tmax).

  • text_lengths (LongTensor) – Batch of lengths of each input batch (B,).

  • feats (Tensor) – Batch of padded target features (B, Lmax, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).

  • label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).

  • melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).

  • tempo (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded tempo (B, Tmax).

  • tempo_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded tempo (B, ).

  • beat (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded beat (B, Tmax).

  • beat_length (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded beat (B, ).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).

  • duration (Optional[Dict]) – key is “phn”, “syb”; value (LongTensor): Batch of padded beat (B, Tmax).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

GS Fix:

arguements from forward func. V.S. **batch from espnet_model.py label == durations | phone sequence melody -> pitch sequence

Returns:

Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.

Return type:

Tensor

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, tempo: Optional[Dict[str, torch.Tensor]] = None, beat: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, use_teacher_forcing: torch.Tensor = False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • text (LongTensor) – Batch of padded character ids (Tmax).

  • feats (Tensor) – Batch of padded target features (Lmax, odim).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).

  • tempo (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded tempo (Tmax).

  • beat (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded beat (Tmax).

  • pitch (FloatTensor) – Batch of padded f0 (Tmax).

  • duration (Optional[Dict]) – key is “phn”, “syb”; value (LongTensor): Batch of padded beat (Tmax).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (1).

  • lids (Optional[Tensor]) – Batch of language IDs (1).

Returns:

Output dict including the following items:
  • feat_gen (Tensor): Output sequence of features (T_feats, odim).

Return type:

Dict[str, Tensor]

espnet2.svs.xiaoice.XiaoiceSing

XiaoiceSing related modules.

class espnet2.svs.xiaoice.XiaoiceSing.XiaoiceSing(idim: int, odim: int, midi_dim: int = 129, tempo_dim: int = 500, embed_dim: int = 512, adim: int = 384, aheads: int = 4, elayers: int = 6, eunits: int = 1536, dlayers: int = 6, dunits: int = 1536, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, postnet_dropout_rate: float = 0.5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, reduction_factor: int = 1, encoder_type: str = 'transformer', decoder_type: str = 'transformer', transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, conformer_rel_pos_type: str = 'legacy', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, zero_triu: bool = False, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False, loss_type: str = 'L1')[source]

Bases: espnet2.svs.abs_svs.AbsSVS

XiaoiceSing module for Singing Voice Synthesis.

This is a module of XiaoiceSing. A high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. It follows the main architecture of FastSpeech while proposing some singing-specific design:

  1. Add features from musical score (e.g.note pitch and length)

  2. Add a residual connection in F0 prediction to attenuate off-key issues

3) The duration of all the phonemes in a musical note is accumulated to calculate the syllable duration loss for rhythm enhancement (syllable loss)

Initialize XiaoiceSing module.

Parameters:
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • elayers (int) – Number of encoder layers.

  • eunits (int) – Number of encoder hidden units.

  • dlayers (int) – Number of decoder layers.

  • dunits (int) – Number of decoder hidden units.

  • postnet_layers (int) – Number of postnet layers.

  • postnet_chans (int) – Number of postnet channels.

  • postnet_filts (int) – Kernel size of postnet.

  • postnet_dropout_rate (float) – Dropout rate in postnet.

  • use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.

  • use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.

  • encoder_normalize_before (bool) – Whether to apply layernorm layer before encoder block.

  • decoder_normalize_before (bool) – Whether to apply layernorm layer before decoder block.

  • encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.

  • decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.

  • duration_predictor_layers (int) – Number of duration predictor layers.

  • duration_predictor_chans (int) – Number of duration predictor channels.

  • duration_predictor_kernel_size (int) – Kernel size of duration predictor.

  • duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.

  • reduction_factor (int) – Reduction factor.

  • encoder_type (str) – Encoder type (“transformer” or “conformer”).

  • decoder_type (str) – Decoder type (“transformer” or “conformer”).

  • transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention and positional encoding.

  • transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.

  • transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.

  • transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention & positional encoding.

  • transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.

  • transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.

  • spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.

  • langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.

  • spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.

  • spk_embed_integration_type – How to integrate speaker embedding.

  • init_type (str) – How to initialize transformer parameters.

  • init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.

  • init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.

  • use_masking (bool) – Whether to apply masking for padded part in loss calculation.

  • use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, tempo: Optional[Dict[str, torch.Tensor]] = None, tempo_lengths: Optional[Dict[str, torch.Tensor]] = None, beat: Optional[Dict[str, torch.Tensor]] = None, beat_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • text (LongTensor) – Batch of padded character ids (B, T_text).

  • text_lengths (LongTensor) – Batch of lengths of each input (B,).

  • feats (Tensor) – Batch of padded target features (B, T_feats, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).

  • label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).

  • melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).

  • tempo (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded tempo (B, Tmax).

  • tempo_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded tempo (B, ).

  • beat (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded beat (B, Tmax).

  • beat_length (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded beat (B, ).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).

  • duration (Optional[Dict]) – key is “phn”, “syb”; value (LongTensor): Batch of padded beat (B, Tmax).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

Returns:

Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.

Return type:

Tensor

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, tempo: Optional[Dict[str, torch.Tensor]] = None, beat: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None) → Dict[str, torch.Tensor][source]

Generate the sequence of features given the sequences of characters.

Parameters:
  • text (LongTensor) – Input sequence of characters (T_text,).

  • feats (Optional[Tensor]) – Feature sequence to extract style (N, idim).

  • durations (Optional[LongTensor]) – Groundtruth of duration (T_text + 1,).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).

  • tempo (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded tempo (Tmax).

  • beat (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded beat (Tmax).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • duration (Optional[Dict]) – key is “phn”, “syb”; value (LongTensor): Batch of padded beat (Tmax).

  • spembs (Optional[Tensor]) – Speaker embedding (spk_embed_dim,).

  • sids (Optional[Tensor]) – Speaker ID (1,).

  • lids (Optional[Tensor]) – Language ID (1,).

  • alpha (float) – Alpha to control the speed.

Returns:

Output dict including the following items:
  • feat_gen (Tensor): Output sequence of features (T_feats, odim).

  • duration (Tensor): Duration sequence (T_text + 1,).

Return type:

Dict[str, Tensor]

espnet2.svs.xiaoice.__init__