espnet2.gan_svs.joint.joint_score2wav.JointScore2Wav

About 4 min

espnet2.gan_svs.joint.joint_score2wav.JointScore2Wav

class espnet2.gan_svs.joint.joint_score2wav.JointScore2Wav(idim: int, odim: int, segment_size: int = 32, sampling_rate: int = 22050, score2mel_type: str = 'xiaoice', score2mel_params: Dict[str, Any] = {'adim': 384, 'aheads': 4, 'conformer_activation_type': 'swish', 'conformer_dec_kernel_size': 31, 'conformer_enc_kernel_size': 7, 'conformer_pos_enc_layer_type': 'rel_pos', 'conformer_rel_pos_type': 'latest', 'conformer_self_attn_layer_type': 'rel_selfattn', 'decoder_concat_after': False, 'decoder_normalize_before': True, 'decoder_type': 'transformer', 'dlayers': 6, 'dunits': 1536, 'duration_predictor_chans': 384, 'duration_predictor_dropout_rate': 0.1, 'duration_predictor_kernel_size': 3, 'duration_predictor_layers': 2, 'elayers': 6, 'encoder_concat_after': False, 'encoder_normalize_before': True, 'encoder_type': 'transformer', 'eunits': 1536, 'init_dec_alpha': 1.0, 'init_enc_alpha': 1.0, 'init_type': 'xavier_uniform', 'lambda_dur': 0.1, 'lambda_mel': 1, 'lambda_pitch': 0.01, 'lambda_vuv': 0.01, 'langs': None, 'loss_function': 'XiaoiceSing2', 'loss_type': 'L1', 'midi_dim': 129, 'positionwise_conv_kernel_size': 1, 'positionwise_layer_type': 'conv1d', 'postnet_chans': 512, 'postnet_dropout_rate': 0.5, 'postnet_filts': 5, 'postnet_layers': 5, 'reduction_factor': 1, 'spk_embed_dim': None, 'spk_embed_integration_type': 'add', 'spks': None, 'tempo_dim': 500, 'transformer_dec_attn_dropout_rate': 0.1, 'transformer_dec_dropout_rate': 0.1, 'transformer_dec_positional_dropout_rate': 0.1, 'transformer_enc_attn_dropout_rate': 0.1, 'transformer_enc_dropout_rate': 0.1, 'transformer_enc_positional_dropout_rate': 0.1, 'use_batch_norm': True, 'use_cnn_in_conformer': True, 'use_macaron_style_in_conformer': True, 'use_masking': False, 'use_scaled_pos_enc': True, 'use_weighted_masking': False, 'zero_triu': False}, vocoder_type: str = 'hifigan_generator', vocoder_params: Dict[str, Any] = {'bias': True, 'channels': 512, 'global_channels': -1, 'kernel_size': 7, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'resblock_kernel_sizes': [3, 7, 11], 'upsample_kernel_sizes': [16, 16, 4, 4], 'upsample_scales': [8, 8, 2, 2], 'use_additional_convs': True, 'use_weight_norm': True}, use_pqmf: bool = False, pqmf_params: Dict[str, Any] = {'beta': 9.0, 'cutoff_ratio': 0.142, 'subbands': 4, 'taps': 62}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, use_feat_match_loss: bool = True, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, use_mel_loss: bool = True, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_score2mel: float = 1.0, lambda_adv: float = 1.0, lambda_feat_match: float = 2.0, lambda_mel: float = 45.0, cache_generator_outputs: bool = False)

Bases: AbsGANSVS

General class to jointly train score2mel and vocoder parts.

Initialize JointScore2Wav module.

Parameters:
- idim (int) – Input vocabrary size.
- odim (int) – Acoustic feature dimension. The actual output channels will be 1 since the model is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.
- segment_size (int) – Segment size for random windowed inputs.
- sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.
- text2mel_type (str) – The text2mel model type.
- text2mel_params (Dict *[*str , Any ]) – Parameter dict for text2mel model.
- use_pqmf (bool) – Whether to use PQMF for multi-band vocoder.
- pqmf_params (Dict *[*str , Any ]) – Parameter dict for PQMF module.
- vocoder_type (str) – The vocoder model type.
- vocoder_params (Dict *[*str , Any ]) – Parameter dict for vocoder model.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- use_feat_match_loss (bool) – Whether to use feat match loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feat match loss.
- use_mel_loss (bool) – Whether to use mel loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_text2mel (float) – Loss scaling coefficient for text2mel model loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_feat_match (float) – Loss scaling coefficient for feat match loss.
- lambda_mel (float) – Loss scaling coefficient for mel loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.

forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, singing: Tensor, singing_lengths: Tensor, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: LongTensor | None = None, duration: Dict[str, Tensor] | None = None, slur: LongTensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, forward_generator: bool = True) → Dict[str, Any]

Perform generator forward.

Parameters:
- text (LongTensor) – Batch of padded character ids (B, Tmax).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, Lmax, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- singing (Tensor) – Singing waveform tensor (B, T_wav).
- singing_lengths (Tensor) – Singing length tensor (B,).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
- label_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- duration (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).
- slur (FloatTensor) – Batch of padded slur (B, Tmax).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- forward_generator (bool) – Whether to forward generator.
Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
Return type: Dict[str, Any]

inference(text: Tensor, feats: Tensor | None = None, label: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, duration: Dict[str, Tensor] | None = None, slur: Dict[str, Tensor] | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: int | None = None, use_teacher_forcing: bool = False) → Dict[str, Tensor]

Run inference.

Parameters:
- text (Tensor) – Input text index tensor (T_text,).
- feats (Tensor) – Feature tensor (T_feats, aux_channels).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
- duration (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- sids (Tensor) – Speaker index tensor (1,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (spk_embed_dim,).
- lids (Tensor) – Language index tensor (1,).
- noise_scale (float) – Noise scale value for flow.
- noise_scale_dur (float) – Noise scale value for duration predictor.
- alpha (float) – Alpha parameter to control the speed of generated singing.
- max_len (Optional *[*int ]) – Maximum length.
- use_teacher_forcing (bool) – Whether to use teacher forcing.
Returns:
- wav (Tensor): Generated waveform tensor (T_wav,).
- feat_gan (Tensor): Generated feature tensor (T_text, C).
Return type: Dict[str, Tensor]

property require_raw_singing

Return whether or not singing is required.

property require_vocoder

Return whether or not vocoder is required.