espnet2.svs.espnet_model.ESPnetSVSModel
espnet2.svs.espnet_model.ESPnetSVSModel
class espnet2.svs.espnet_model.ESPnetSVSModel(text_extract: AbsFeatsExtract | None, feats_extract: AbsFeatsExtract | None, score_feats_extract: AbsFeatsExtract | None, label_extract: AbsFeatsExtract | None, pitch_extract: AbsFeatsExtract | None, ying_extract: AbsFeatsExtract | None, duration_extract: AbsFeatsExtract | None, energy_extract: AbsFeatsExtract | None, normalize: InversibleInterface | None, pitch_normalize: InversibleInterface | None, energy_normalize: InversibleInterface | None, svs: AbsSVS)
Bases: AbsESPnetModel
ESPnet model for singing voice synthesis task.
Initialize ESPnetSVSModel module.
collect_feats(text: Tensor, text_lengths: Tensor, singing: Tensor, singing_lengths: Tensor, label: Tensor | None = None, label_lengths: Tensor | None = None, phn_cnt: Tensor | None = None, midi: Tensor | None = None, midi_lengths: Tensor | None = None, duration_phn: Tensor | None = None, duration_phn_lengths: Tensor | None = None, duration_ruled_phn: Tensor | None = None, duration_ruled_phn_lengths: Tensor | None = None, duration_syb: Tensor | None = None, duration_syb_lengths: Tensor | None = None, slur: Tensor | None = None, slur_lengths: Tensor | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, energy: Tensor | None = None, energy_lengths: Tensor | None = None, ying: Tensor | None = None, ying_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, **kwargs) → Dict[str, Tensor]
Caclualte features and return them as a dict.
- Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- singing (Tensor) – Singing waveform tensor (B, T_wav).
- singing_lengths (Tensor) – Singing length tensor (B,).
- label (Option *[*Tensor ]) – Label tensor (B, T_label).
- label_lengths (Optional *[*Tensor ]) – Label lrngth tensor (B,).
- phn_cnt (Optional *[*Tensor ]) – Number of phones in each syllable (B, T_syb)
- midi (Option *[*Tensor ]) – Midi tensor (B, T_label).
- midi_lengths (Optional *[*Tensor ]) – Midi lrngth tensor (B,).
- ---- ( ---- duration* is duration in time_shift)
- duration_phn (Optional *[*Tensor ]) – duration tensor (B, T_label).
- duration_phn_lengths (Optional *[*Tensor ]) – duration length tensor (B,).
- duration_ruled_phn (Optional *[*Tensor ]) – duration tensor (B, T_phone).
- duration_ruled_phn_lengths (Optional *[*Tensor ]) – duration length tensor (B,).
- duration_syb (Optional *[*Tensor ]) – duration tensor (B, T_syb).
- duration_syb_lengths (Optional *[*Tensor ]) – duration length tensor (B,).
- slur (Optional *[*Tensor ]) – slur tensor (B, T_slur).
- slur_lengths (Optional *[*Tensor ]) – slur length tensor (B,).
- pitch (Optional *[*Tensor ]) – Pitch tensor (B, T_wav). - f0 sequence
- pitch_lengths (Optional *[*Tensor ]) – Pitch length tensor (B,).
- energy (Optional *[*Tensor) – Energy tensor.
- energy_lengths (Optional *[*Tensor) – Energy length tensor (B,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, D).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (B, 1).
- lids (Optional *[*Tensor ]) – Language ID tensor (B, 1).
- Returns: Dict of features.
- Return type: Dict[str, Tensor]
forward(text: Tensor, text_lengths: Tensor, singing: Tensor, singing_lengths: Tensor, feats: Tensor | None = None, feats_lengths: Tensor | None = None, label: Tensor | None = None, label_lengths: Tensor | None = None, phn_cnt: Tensor | None = None, midi: Tensor | None = None, midi_lengths: Tensor | None = None, duration_phn: Tensor | None = None, duration_phn_lengths: Tensor | None = None, duration_ruled_phn: Tensor | None = None, duration_ruled_phn_lengths: Tensor | None = None, duration_syb: Tensor | None = None, duration_syb_lengths: Tensor | None = None, slur: Tensor | None = None, slur_lengths: Tensor | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, energy: Tensor | None = None, energy_lengths: Tensor | None = None, ying: Tensor | None = None, ying_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, flag_IsValid=False, **kwargs) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Caclualte outputs and return the loss tensor.
- Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- singing (Tensor) – Singing waveform tensor (B, T_wav).
- singing_lengths (Tensor) – Singing length tensor (B,).
- label (Option *[*Tensor ]) – Label tensor (B, T_label).
- label_lengths (Optional *[*Tensor ]) – Label lrngth tensor (B,).
- phn_cnt (Optional *[*Tensor ]) – Number of phones in each syllable (B, T_syb)
- midi (Option *[*Tensor ]) – Midi tensor (B, T_label).
- midi_lengths (Optional *[*Tensor ]) – Midi lrngth tensor (B,).
- duration_phn (Optional *[*Tensor ]) – duration tensor (B, T_label).
- duration_phn_lengths (Optional *[*Tensor ]) – duration length tensor (B,).
- duration_ruled_phn (Optional *[*Tensor ]) – duration tensor (B, T_phone).
- duration_ruled_phn_lengths (Optional *[*Tensor ]) – duration length tensor (B,).
- duration_syb (Optional *[*Tensor ]) – duration tensor (B, T_syllable).
- duration_syb_lengths (Optional *[*Tensor ]) – duration length tensor (B,).
- slur (Optional *[*Tensor ]) – slur tensor (B, T_slur).
- slur_lengths (Optional *[*Tensor ]) – slur length tensor (B,).
- pitch (Optional *[*Tensor ]) – Pitch tensor (B, T_wav). - f0 sequence
- pitch_lengths (Optional *[*Tensor ]) – Pitch length tensor (B,).
- energy (Optional *[*Tensor ]) – Energy tensor.
- energy_lengths (Optional *[*Tensor ]) – Energy length tensor (B,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, D).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (B, 1).
- lids (Optional *[*Tensor ]) – Language ID tensor (B, 1).
- kwargs – “utt_id” is among the input.
- Returns: Loss scalar tensor. Dict[str, float]: Statistics to be monitored. Tensor: Weight tensor to summarize losses.
- Return type: Tensor
inference(text: Tensor, singing: Tensor | None = None, label: Tensor | None = None, phn_cnt: Tensor | None = None, midi: Tensor | None = None, duration_phn: Tensor | None = None, duration_ruled_phn: Tensor | None = None, duration_syb: Tensor | None = None, slur: Tensor | None = None, pitch: Tensor | None = None, energy: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, **decode_config) → Dict[str, Tensor]
Caclualte features and return them as a dict.
- Parameters:
- text (Tensor) – Text index tensor (T_text).
- singing (Tensor) – Singing waveform tensor (T_wav).
- label (Option *[*Tensor ]) – Label tensor (T_label).
- phn_cnt (Optional *[*Tensor ]) – Number of phones in each syllable (T_syb)
- midi (Option *[*Tensor ]) – Midi tensor (T_l abel).
- duration_phn (Optional *[*Tensor ]) – duration tensor (T_label).
- duration_ruled_phn (Optional *[*Tensor ]) – duration tensor (T_phone).
- duration_syb (Optional *[*Tensor ]) – duration tensor (T_phone).
- slur (Optional *[*Tensor ]) – slur tensor (T_phone).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (D,).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (1,).
- lids (Optional *[*Tensor ]) – Language ID tensor (1,).
- pitch (Optional *[*Tensor) – Pitch tensor (T_wav).
- energy (Optional *[*Tensor) – Energy tensor.
- Returns: Dict of outputs.
- Return type: Dict[str, Tensor]