espnet2.svs package¶
espnet2.svs.abs_svs¶
Singing-voice-synthesis abstrast class.
-
class
espnet2.svs.abs_svs.
AbsSVS
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
SVS abstract class.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Calculate outputs and return the loss tensor.
-
abstract
inference
(text: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶ Return predicted output as a dict.
-
property
require_raw_singing
¶ Return whether or not raw_singing is required.
-
property
require_vocoder
¶ Return whether or not vocoder is required.
-
abstract
espnet2.svs.__init__¶
espnet2.svs.espnet_model¶
Singing-voice-synthesis ESPnet model.
-
class
espnet2.svs.espnet_model.
ESPnetSVSModel
(text_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], score_feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], label_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], pitch_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], tempo_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], beat_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], energy_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], pitch_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], energy_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], svs: espnet2.svs.abs_svs.AbsSVS)[source]¶ Bases:
espnet2.train.abs_espnet_model.AbsESPnetModel
ESPnet model for singing voice synthesis task.
Initialize ESPnetSVSModel module.
-
collect_feats
(text: torch.Tensor, text_lengths: torch.Tensor, singing: torch.Tensor, singing_lengths: torch.Tensor, label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, label_lab: Optional[torch.Tensor] = None, label_lab_lengths: Optional[torch.Tensor] = None, label_score: Optional[torch.Tensor] = None, label_score_lengths: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, midi_lab: Optional[torch.Tensor] = None, midi_lab_lengths: Optional[torch.Tensor] = None, midi_score: Optional[torch.Tensor] = None, midi_score_lengths: Optional[torch.Tensor] = None, tempo_lab: Optional[torch.Tensor] = None, tempo_lab_lengths: Optional[torch.Tensor] = None, tempo_score: Optional[torch.Tensor] = None, tempo_score_lengths: Optional[torch.Tensor] = None, beat_phn: Optional[torch.Tensor] = None, beat_phn_lengths: Optional[torch.Tensor] = None, beat_ruled_phn: Optional[torch.Tensor] = None, beat_ruled_phn_lengths: Optional[torch.Tensor] = None, beat_syb: Optional[torch.Tensor] = None, beat_syb_lengths: Optional[torch.Tensor] = None, beat_lab: Optional[torch.Tensor] = None, beat_lab_lengths: Optional[torch.Tensor] = None, beat_score_phn: Optional[torch.Tensor] = None, beat_score_phn_lengths: Optional[torch.Tensor] = None, beat_score_syb: Optional[torch.Tensor] = None, beat_score_syb_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, **kwargs) → Dict[str, torch.Tensor][source]¶ Calculate features and return them as a dict.
- Parameters:
text (Tensor) – Text index tensor (B, T_text).
text_lengths (Tensor) – Text length tensor (B,).
singing (Tensor) – Singing waveform tensor (B, T_wav).
singing_lengths (Tensor) – Singing length tensor (B,).
label* is label id sequence ---- (----) –
label (Option[Tensor]) – Label tensor (B, T_label).
label_lengths (Optional[Tensor]) – Label lrngth tensor (B,).
label_lab (Optional[Tensor]) – Label tensor (B, T_wav).
label_lab_lengths (Optional[Tensor]) – Label length tensor (B,).
label_score (Optional[Tensor]) – Label tensor (B, T_score).
label_score_lengths (Optional[Tensor]) – Label length tensor (B,).
phn_cnt (Optional[Tensor]) – Number of phones in each syllable (B, T_syb)
midi_* is midi id sequence ---- (----) –
midi (Option[Tensor]) – Midi tensor (B, T_label).
midi_lengths (Optional[Tensor]) – Midi lrngth tensor (B,).
midi_lab (Optional[Tensor]) – Midi tensor (B, T_wav).
midi_lab_lengths (Optional[Tensor]) – Midi length tensor (B,).
midi_score (Optional[Tensor]) – Midi tensor (B, T_score).
midi_score_lengths (Optional[Tensor]) – Midi length tensor (B,).
temp* is bpm ---- (----) –
tempo_lab (Optional[Tensor]) – Tempo tensor (B, T_wav).
tempo_lab_lengths (Optional[Tensor]) – Tempo length tensor (B,).
tempo_score (Optional[Tensor]) – Tempo tensor (B, T_score).
tempo_score_lengths (Optional[Tensor]) – Tempo length tensor (B,).
beat* is duration in time_shift ---- (----) –
beat_phn (Optional[Tensor]) – Beat tensor (B, T_label).
beat_phn_lengths (Optional[Tensor]) – Beat length tensor (B,).
beat_ruled_phn (Optional[Tensor]) – Beat tensor (B, T_phone).
beat_ruled_phn_lengths (Optional[Tensor]) – Beat length tensor (B,).
beat_syb (Optional[Tensor]) – Beat tensor (B, T_syb).
beat_syb_lengths (Optional[Tensor]) – Beat length tensor (B,).
beat_lab (Optional[Tensor]) – Beat tensor (B, T_wav).
beat_lab_lengths (Optional[Tensor]) – Beat length tensor (B,).
beat_score_phn (Optional[Tensor]) – Beat tensor (B, T_score).
beat_score_phn_lengths (Optional[Tensor]) – Beat length tensor (B,).
beat_score_syb (Optional[Tensor]) – Beat tensor (B, T_score).
beat_score_syb_lengths (Optional[Tensor]) – Beat length tensor (B,).
pitch (Optional[Tensor]) – Pitch tensor (B, T_wav). - f0 sequence
pitch_lengths (Optional[Tensor]) – Pitch length tensor (B,).
energy (Optional[Tensor) – Energy tensor.
energy_lengths (Optional[Tensor) – Energy length tensor (B,).
spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).
sids (Optional[Tensor]) – Speaker ID tensor (B, 1).
lids (Optional[Tensor]) – Language ID tensor (B, 1).
- Returns:
Dict of features.
- Return type:
Dict[str, Tensor]
-
forward
(text: torch.Tensor, text_lengths: torch.Tensor, singing: torch.Tensor, singing_lengths: torch.Tensor, label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, label_lab: Optional[torch.Tensor] = None, label_lab_lengths: Optional[torch.Tensor] = None, label_score: Optional[torch.Tensor] = None, label_score_lengths: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, midi_lab: Optional[torch.Tensor] = None, midi_lab_lengths: Optional[torch.Tensor] = None, midi_score: Optional[torch.Tensor] = None, midi_score_lengths: Optional[torch.Tensor] = None, tempo_lab: Optional[torch.Tensor] = None, tempo_lab_lengths: Optional[torch.Tensor] = None, tempo_score: Optional[torch.Tensor] = None, tempo_score_lengths: Optional[torch.Tensor] = None, beat_phn: Optional[torch.Tensor] = None, beat_phn_lengths: Optional[torch.Tensor] = None, beat_ruled_phn: Optional[torch.Tensor] = None, beat_ruled_phn_lengths: Optional[torch.Tensor] = None, beat_syb: Optional[torch.Tensor] = None, beat_syb_lengths: Optional[torch.Tensor] = None, beat_lab: Optional[torch.Tensor] = None, beat_lab_lengths: Optional[torch.Tensor] = None, beat_score_phn: Optional[torch.Tensor] = None, beat_score_phn_lengths: Optional[torch.Tensor] = None, beat_score_syb: Optional[torch.Tensor] = None, beat_score_syb_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, flag_IsValid=False, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Caclualte outputs and return the loss tensor.
- Parameters:
text (Tensor) – Text index tensor (B, T_text).
text_lengths (Tensor) – Text length tensor (B,).
singing (Tensor) – Singing waveform tensor (B, T_wav).
singing_lengths (Tensor) – Singing length tensor (B,).
label* is label id sequence ---- (----) –
label (Option[Tensor]) – Label tensor (B, T_label).
label_lengths (Optional[Tensor]) – Label lrngth tensor (B,).
label_lab (Optional[Tensor]) – Label tensor (B, T_wav).
label_lab_lengths (Optional[Tensor]) – Label length tensor (B,).
label_score (Optional[Tensor]) – Label tensor (B, T_score).
label_score_lengths (Optional[Tensor]) – Label length tensor (B,).
phn_cnt (Optional[Tensor]) – Number of phones in each syllable (B, T_syb)
midi_* is midi id sequence ---- (----) –
midi (Option[Tensor]) – Midi tensor (B, T_label).
midi_lengths (Optional[Tensor]) – Midi lrngth tensor (B,).
midi_lab (Optional[Tensor]) – Midi tensor (B, T_wav).
midi_lab_lengths (Optional[Tensor]) – Midi length tensor (B,).
midi_score (Optional[Tensor]) – Midi tensor (B, T_score).
midi_score_lengths (Optional[Tensor]) – Midi length tensor (B,).
temp* is bpm ---- (----) –
tempo_lab (Optional[Tensor]) – Tempo tensor (B, T_wav).
tempo_lab_lengths (Optional[Tensor]) – Tempo length tensor (B,).
tempo_score (Optional[Tensor]) – Tempo tensor (B, T_score).
tempo_score_lengths (Optional[Tensor]) – Tempo length tensor (B,).
beat* is duration in time_shift ---- (----) –
beat_phn (Optional[Tensor]) – Beat tensor (B, T_label).
beat_phn_lengths (Optional[Tensor]) – Beat length tensor (B,).
beat_ruled_phn (Optional[Tensor]) – Beat tensor (B, T_phone).
beat_ruled_phn_lengths (Optional[Tensor]) – Beat length tensor (B,).
beat_syb (Optional[Tensor]) – Beat tensor (B, T_phone).
beat_syb_lengths (Optional[Tensor]) – Beat length tensor (B,).
beat_lab (Optional[Tensor]) – Beat tensor (B, T_wav).
beat_lab_lengths (Optional[Tensor]) – Beat length tensor (B,).
beat_score_phn (Optional[Tensor]) – Beat tensor (B, T_score).
beat_score_phn_lengths (Optional[Tensor]) – Beat length tensor (B,).
beat_score_syb (Optional[Tensor]) – Beat tensor (B, T_score).
beat_score_syb_lengths (Optional[Tensor]) – Beat length tensor (B,).
pitch (Optional[Tensor]) – Pitch tensor (B, T_wav). - f0 sequence
pitch_lengths (Optional[Tensor]) – Pitch length tensor (B,).
energy (Optional[Tensor]) – Energy tensor.
energy_lengths (Optional[Tensor]) – Energy length tensor (B,).
spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).
sids (Optional[Tensor]) – Speaker ID tensor (B, 1).
lids (Optional[Tensor]) – Language ID tensor (B, 1).
kwargs – “utt_id” is among the input.
- Returns:
Loss scalar tensor. Dict[str, float]: Statistics to be monitored. Tensor: Weight tensor to summarize losses.
- Return type:
Tensor
-
inference
(text: torch.Tensor, singing: Optional[torch.Tensor] = None, label: Optional[torch.Tensor] = None, label_lab: Optional[torch.Tensor] = None, label_score: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lab: Optional[torch.Tensor] = None, midi_score: Optional[torch.Tensor] = None, tempo_lab: Optional[torch.Tensor] = None, tempo_score: Optional[torch.Tensor] = None, beat_phn: Optional[torch.Tensor] = None, beat_ruled_phn: Optional[torch.Tensor] = None, beat_syb: Optional[torch.Tensor] = None, beat_lab: Optional[torch.Tensor] = None, beat_score_phn: Optional[torch.Tensor] = None, beat_score_syb: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, **decode_config) → Dict[str, torch.Tensor][source]¶ Caclualte features and return them as a dict.
- Parameters:
text (Tensor) – Text index tensor (T_text).
singing (Tensor) – Singing waveform tensor (T_wav).
label* is label id sequence ---- (----) –
label (Option[Tensor]) – Label tensor (T_label).
label_lab (Optional[Tensor]) – Label tensor (T_wav).
label_score (Optional[Tensor]) – Label tensor (T_score).
phn_cnt (Optional[Tensor]) – Number of phones in each syllable (T_syb)
midi_* is midi id sequence ---- (----) –
midi (Option[Tensor]) – Midi tensor (T_label).
midi_lab (Optional[Tensor]) – Midi tensor (T_wav).
midi_score (Optional[Tensor]) – Midi tensor (T_score).
temp* is bpm ---- (----) –
tempo_lab (Optional[Tensor]) – Tempo tensor (T_wav).
tempo_score (Optional[Tensor]) – Tempo tensor (T_score).
beat* is duration in time_shift ---- (----) –
beat_phn (Optional[Tensor]) – Beat tensor (T_label).
beat_ruled_phn (Optional[Tensor]) – Beat tensor (T_phone).
beat_syb (Optional[Tensor]) – Beat tensor (T_phone).
beat_lab (Optional[Tensor]) – Beat tensor (T_wav).
beat_score_phn (Optional[Tensor]) – Beat tensor (T_score).
beat_score_syb (Optional[Tensor]) – Beat tensor (T_score).
spembs (Optional[Tensor]) – Speaker embedding tensor (D,).
sids (Optional[Tensor]) – Speaker ID tensor (1,).
lids (Optional[Tensor]) – Language ID tensor (1,).
pitch (Optional[Tensor) – Pitch tensor (T_wav).
energy (Optional[Tensor) – Energy tensor.
- Returns:
Dict of outputs.
- Return type:
Dict[str, Tensor]
-
espnet2.svs.feats_extract.score_feats_extract¶
-
class
espnet2.svs.feats_extract.score_feats_extract.
FrameScoreFeats
(fs: Union[int, str] = 22050, n_fft: int = 1024, win_length: int = 512, hop_length: int = 128, window: str = 'hann', center: bool = True)[source]¶ Bases:
espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract
-
extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, tempo: Optional[torch.Tensor] = None, tempo_lengths: Optional[torch.Tensor] = None, beat: Optional[torch.Tensor] = None, beat_lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ FrameScoreFeats forward function.
- Parameters:
label – (Batch, Nsamples)
label_lengths – (Batch)
midi – (Batch, Nsamples)
midi_lengths – (Batch)
tempo – (Batch, Nsamples)
tempo_lengths – (Batch)
- Returns:
(Batch, Frames)
- Return type:
output
-
-
class
espnet2.svs.feats_extract.score_feats_extract.
SyllableScoreFeats
(fs: Union[int, str] = 22050, n_fft: int = 1024, win_length: int = 512, hop_length: int = 128, window: str = 'hann', center: bool = True)[source]¶ Bases:
espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract
-
extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, tempo: Optional[torch.Tensor] = None, tempo_lengths: Optional[torch.Tensor] = None, beat: Optional[torch.Tensor] = None, beat_lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ SyllableScoreFeats forward function.
- Parameters:
label – (Batch, Nsamples)
label_lengths – (Batch)
midi – (Batch, Nsamples)
midi_lengths – (Batch)
tempo – (Batch, Nsamples)
tempo_lengths – (Batch)
beat – (Batch, Nsamples)
beat_lengths – (Batch)
- Returns:
(Batch, Frames)
- Return type:
output
-
get_segments
(label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, tempo: Optional[torch.Tensor] = None, tempo_lengths: Optional[torch.Tensor] = None, beat: Optional[torch.Tensor] = None, beat_lengths: Optional[torch.Tensor] = None)[source]¶
-
espnet2.svs.feats_extract.__init__¶
espnet2.svs.naive_rnn.__init__¶
espnet2.svs.naive_rnn.naive_rnn¶
Naive-SVS related modules.
-
class
espnet2.svs.naive_rnn.naive_rnn.
NaiveRNN
(idim: int, odim: int, midi_dim: int = 129, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, elayers: int = 3, eunits: int = 1024, ebidirectional: bool = True, midi_embed_integration_type: str = 'add', dlayers: int = 3, dunits: int = 1024, dbidirectional: bool = True, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, use_batch_norm: bool = True, reduction_factor: int = 1, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', eprenet_dropout_rate: float = 0.5, edropout_rate: float = 0.1, ddropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', use_masking: bool = False, use_weighted_masking: bool = False, loss_type: str = 'L1')[source]¶ Bases:
espnet2.svs.abs_svs.AbsSVS
NaiveRNN-SVS module.
This is an implementation of naive RNN for singing voice synthesis The features are processed directly over time-domain from music score and predict the singing voice features
Initialize NaiveRNN module.
Args: TODO(Yuning)
-
forward
(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, tempo: Optional[Dict[str, torch.Tensor]] = None, tempo_lengths: Optional[Dict[str, torch.Tensor]] = None, beat: Optional[Dict[str, torch.Tensor]] = None, beat_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Calculate forward propagation.
- Parameters:
text (LongTensor) – Batch of padded character ids (B, Tmax).
text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
feats (Tensor) – Batch of padded target features (B, Lmax, odim).
feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).
tempo (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded tempo (B, Tmax).
tempo_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded tempo (B, ).
beat (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded beat (B, Tmax).
beat_length (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded beat (B, ).
pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).
duration (Optional[Dict]) – key is “phn”, “syb”; value (LongTensor): Batch of padded beat (B, Tmax).
spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).
lids (Optional[Tensor]) – Batch of language IDs (B, 1).
- GS Fix:
arguements from forward func. V.S. **batch from espnet_model.py label == durations | phone sequence melody -> pitch sequence
- Returns:
Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.
- Return type:
Tensor
-
inference
(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, tempo: Optional[Dict[str, torch.Tensor]] = None, beat: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, use_teacher_forcing: torch.Tensor = False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Calculate forward propagation.
- Parameters:
text (LongTensor) – Batch of padded character ids (Tmax).
feats (Tensor) – Batch of padded target features (Lmax, odim).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
tempo (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded tempo (Tmax).
beat (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded beat (Tmax).
pitch (FloatTensor) – Batch of padded f0 (Tmax).
duration (Optional[Dict]) – key is “phn”, “syb”; value (LongTensor): Batch of padded beat (Tmax).
spembs (Optional[Tensor]) – Batch of speaker embeddings (spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (1).
lids (Optional[Tensor]) – Batch of language IDs (1).
- Returns:
Output dict including the following items: * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- Return type:
Dict[str, Tensor]
-
-
class
espnet2.svs.naive_rnn.naive_rnn.
NaiveRNNLoss
(use_masking=True, use_weighted_masking=False)[source]¶ Bases:
torch.nn.modules.module.Module
Loss function module for Tacotron2.
Initialize Tactoron2 loss module.
- Parameters:
use_masking (bool) – Whether to apply masking for padded part in loss calculation.
use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
-
forward
(after_outs, before_outs, ys, olens)[source]¶ Calculate forward propagation.
- Parameters:
after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).
before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
- Returns:
L1 loss value. Tensor: Mean square error loss value.
- Return type:
Tensor
espnet2.svs.naive_rnn.naive_rnn_dp¶
NaiveRNN-DP-SVS related modules.
-
class
espnet2.svs.naive_rnn.naive_rnn_dp.
NaiveRNNDP
(idim: int, odim: int, midi_dim: int = 129, embed_dim: int = 512, tempo_dim: int = 500, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, elayers: int = 3, eunits: int = 1024, ebidirectional: bool = True, midi_embed_integration_type: str = 'add', dlayers: int = 3, dunits: int = 1024, dbidirectional: bool = True, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, use_batch_norm: bool = True, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, reduction_factor: int = 1, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', eprenet_dropout_rate: float = 0.5, edropout_rate: float = 0.1, ddropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', use_masking: bool = False, use_weighted_masking: bool = False)[source]¶ Bases:
espnet2.svs.abs_svs.AbsSVS
NaiveRNNDP-SVS module.
This is an implementation of naive RNN with duration prediction for singing voice synthesis The features are processed directly over time-domain from music score and predict the singing voice features
Initialize NaiveRNN module.
Args: TODO(Yuning)
-
forward
(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, tempo: Optional[Dict[str, torch.Tensor]] = None, tempo_lengths: Optional[Dict[str, torch.Tensor]] = None, beat: Optional[Dict[str, torch.Tensor]] = None, beat_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Calculate forward propagation.
- Parameters:
text (LongTensor) – Batch of padded character ids (B, Tmax).
text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
feats (Tensor) – Batch of padded target features (B, Lmax, odim).
feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).
tempo (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded tempo (B, Tmax).
tempo_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded tempo (B, ).
beat (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded beat (B, Tmax).
beat_length (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded beat (B, ).
pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).
duration (Optional[Dict]) – key is “phn”, “syb”; value (LongTensor): Batch of padded beat (B, Tmax).
spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).
lids (Optional[Tensor]) – Batch of language IDs (B, 1).
- GS Fix:
arguements from forward func. V.S. **batch from espnet_model.py label == durations | phone sequence melody -> pitch sequence
- Returns:
Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.
- Return type:
Tensor
-
inference
(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, tempo: Optional[Dict[str, torch.Tensor]] = None, beat: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, use_teacher_forcing: torch.Tensor = False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Calculate forward propagation.
- Parameters:
text (LongTensor) – Batch of padded character ids (Tmax).
feats (Tensor) – Batch of padded target features (Lmax, odim).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
tempo (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded tempo (Tmax).
beat (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded beat (Tmax).
pitch (FloatTensor) – Batch of padded f0 (Tmax).
duration (Optional[Dict]) – key is “phn”, “syb”; value (LongTensor): Batch of padded beat (Tmax).
spembs (Optional[Tensor]) – Batch of speaker embeddings (spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (1).
lids (Optional[Tensor]) – Batch of language IDs (1).
- Returns:
- Output dict including the following items:
feat_gen (Tensor): Output sequence of features (T_feats, odim).
- Return type:
Dict[str, Tensor]
-
espnet2.svs.xiaoice.XiaoiceSing¶
XiaoiceSing related modules.
-
class
espnet2.svs.xiaoice.XiaoiceSing.
XiaoiceSing
(idim: int, odim: int, midi_dim: int = 129, tempo_dim: int = 500, embed_dim: int = 512, adim: int = 384, aheads: int = 4, elayers: int = 6, eunits: int = 1536, dlayers: int = 6, dunits: int = 1536, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, postnet_dropout_rate: float = 0.5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, reduction_factor: int = 1, encoder_type: str = 'transformer', decoder_type: str = 'transformer', transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, conformer_rel_pos_type: str = 'legacy', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, zero_triu: bool = False, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False, loss_type: str = 'L1')[source]¶ Bases:
espnet2.svs.abs_svs.AbsSVS
XiaoiceSing module for Singing Voice Synthesis.
This is a module of XiaoiceSing. A high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. It follows the main architecture of FastSpeech while proposing some singing-specific design:
Add features from musical score (e.g.note pitch and length)
Add a residual connection in F0 prediction to attenuate off-key issues
3) The duration of all the phonemes in a musical note is accumulated to calculate the syllable duration loss for rhythm enhancement (syllable loss)
Initialize XiaoiceSing module.
- Parameters:
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
elayers (int) – Number of encoder layers.
eunits (int) – Number of encoder hidden units.
dlayers (int) – Number of decoder layers.
dunits (int) – Number of decoder hidden units.
postnet_layers (int) – Number of postnet layers.
postnet_chans (int) – Number of postnet channels.
postnet_filts (int) – Kernel size of postnet.
postnet_dropout_rate (float) – Dropout rate in postnet.
use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.
use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.
encoder_normalize_before (bool) – Whether to apply layernorm layer before encoder block.
decoder_normalize_before (bool) – Whether to apply layernorm layer before decoder block.
encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.
decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.
duration_predictor_layers (int) – Number of duration predictor layers.
duration_predictor_chans (int) – Number of duration predictor channels.
duration_predictor_kernel_size (int) – Kernel size of duration predictor.
duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.
reduction_factor (int) – Reduction factor.
encoder_type (str) – Encoder type (“transformer” or “conformer”).
decoder_type (str) – Decoder type (“transformer” or “conformer”).
transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention and positional encoding.
transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.
transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.
transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention & positional encoding.
transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.
transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.
spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
spk_embed_integration_type – How to integrate speaker embedding.
init_type (str) – How to initialize transformer parameters.
init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.
init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.
use_masking (bool) – Whether to apply masking for padded part in loss calculation.
use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
-
forward
(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, melody_lengths: Optional[Dict[str, torch.Tensor]] = None, tempo: Optional[Dict[str, torch.Tensor]] = None, tempo_lengths: Optional[Dict[str, torch.Tensor]] = None, beat: Optional[Dict[str, torch.Tensor]] = None, beat_lengths: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, flag_IsValid=False) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Calculate forward propagation.
- Parameters:
text (LongTensor) – Batch of padded character ids (B, T_text).
text_lengths (LongTensor) – Batch of lengths of each input (B,).
feats (Tensor) – Batch of padded target features (B, T_feats, odim).
feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
melody_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).
tempo (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded tempo (B, Tmax).
tempo_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded tempo (B, ).
beat (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded beat (B, Tmax).
beat_length (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded beat (B, ).
pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).
duration (Optional[Dict]) – key is “phn”, “syb”; value (LongTensor): Batch of padded beat (B, Tmax).
spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).
lids (Optional[Tensor]) – Batch of language IDs (B, 1).
- Returns:
Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.
- Return type:
Tensor
-
inference
(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, tempo: Optional[Dict[str, torch.Tensor]] = None, beat: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None) → Dict[str, torch.Tensor][source]¶ Generate the sequence of features given the sequences of characters.
- Parameters:
text (LongTensor) – Input sequence of characters (T_text,).
feats (Optional[Tensor]) – Feature sequence to extract style (N, idim).
durations (Optional[LongTensor]) – Groundtruth of duration (T_text + 1,).
label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
tempo (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded tempo (Tmax).
beat (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded beat (Tmax).
pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
duration (Optional[Dict]) – key is “phn”, “syb”; value (LongTensor): Batch of padded beat (Tmax).
spembs (Optional[Tensor]) – Speaker embedding (spk_embed_dim,).
sids (Optional[Tensor]) – Speaker ID (1,).
lids (Optional[Tensor]) – Language ID (1,).
alpha (float) – Alpha to control the speed.
- Returns:
- Output dict including the following items:
feat_gen (Tensor): Output sequence of features (T_feats, odim).
duration (Tensor): Duration sequence (T_text + 1,).
- Return type:
Dict[str, Tensor]