espnet2.gan_svs.vits.vits.VITS
espnet2.gan_svs.vits.vits.VITS
class espnet2.gan_svs.vits.vits.VITS(idim: int, odim: int, sampling_rate: int = 22050, generator_type: str = 'visinger', vocoder_generator_type: str = 'hifigan', generator_params: Dict[str, Any] = {'decoder_channels': 512, 'decoder_kernel_size': 7, 'decoder_resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'decoder_resblock_kernel_sizes': [3, 7, 11], 'decoder_upsample_kernel_sizes': [16, 16, 4, 4], 'decoder_upsample_scales': [8, 8, 2, 2], 'expand_f0_method': 'repeat', 'flow_base_dilation': 1, 'flow_dropout_rate': 0.0, 'flow_flows': 4, 'flow_kernel_size': 5, 'flow_layers': 4, 'global_channels': -1, 'hidden_channels': 192, 'hubert_channels': 0, 'langs': None, 'posterior_encoder_base_dilation': 1, 'posterior_encoder_dropout_rate': 0.0, 'posterior_encoder_kernel_size': 5, 'posterior_encoder_layers': 16, 'posterior_encoder_stacks': 1, 'projection_filters': [0, 1, 1, 1], 'projection_kernels': [0, 5, 7, 11], 'segment_size': 32, 'spk_embed_dim': None, 'spks': None, 'text_encoder_activation_type': 'swish', 'text_encoder_attention_dropout_rate': 0.0, 'text_encoder_attention_heads': 2, 'text_encoder_blocks': 6, 'text_encoder_conformer_kernel_size': 7, 'text_encoder_dropout_rate': 0.1, 'text_encoder_ffn_expand': 4, 'text_encoder_normalize_before': True, 'text_encoder_positional_dropout_rate': 0.0, 'text_encoder_positional_encoding_layer_type': 'rel_pos', 'text_encoder_positionwise_conv_kernel_size': 1, 'text_encoder_positionwise_layer_type': 'conv1d', 'text_encoder_self_attention_layer_type': 'rel_selfattn', 'use_conformer_conv_in_text_encoder': True, 'use_macaron_style_in_text_encoder': True, 'use_only_mean_in_flow': True, 'use_phoneme_predictor': False, 'use_weight_norm_in_decoder': True, 'use_weight_norm_in_flow': True, 'use_weight_norm_in_posterior_encoder': True}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'avocodo': {'combd': {'combd_d_d': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], 'combd_d_g': [[1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1]], 'combd_d_k': [[7, 11, 11, 11, 11, 5], [11, 21, 21, 21, 21, 5], [15, 41, 41, 41, 41, 5]], 'combd_d_p': [[3, 5, 5, 5, 5, 2], [5, 10, 10, 10, 10, 2], [7, 20, 20, 20, 20, 2]], 'combd_d_s': [[1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1]], 'combd_h_u': [[16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024]], 'combd_op_f': [1, 1, 1], 'combd_op_g': [1, 1, 1], 'combd_op_k': [3, 3, 3]}, 'pqmf_config': {'lv1': [2, 256, 0.25, 10.0], 'lv2': [4, 192, 0.13, 10.0]}, 'sbd': {'pqmf_config': {'fsbd': [64, 256, 0.1, 9.0], 'sbd': [16, 256, 0.03, 10.0]}, 'sbd_band_ranges': [[0, 6], [0, 11], [0, 16], [0, 64]], 'sbd_dilations': [[[5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11]], [[3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [2, 3, 5], [2, 3, 5]]], 'sbd_filters': [[64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [32, 64, 128, 128, 128]], 'sbd_kernel_sizes': [[[7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]], [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]]], 'sbd_strides': [[1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1]], 'sbd_transpose': [False, False, False, True], 'use_sbd': True}}, 'hifigan_multi_scale_multi_period_discriminator': {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_adv: float = 1.0, lambda_mel: float = 45.0, lambda_feat_match: float = 2.0, lambda_dur: float = 0.1, lambda_kl: float = 1.0, lambda_pitch: float = 10.0, lambda_phoneme: float = 1.0, lambda_c_yin: float = 45.0, cache_generator_outputs: bool = True)
Bases: AbsGANSVS
VITS module (generator + discriminator).
This is a module of VITS described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.
Initialize VITS module.
- Parameters:
- idim (int) – Input vocabrary size.
- odim (int) – Acoustic feature dimension. The actual output channels will be 1 since VITS is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.
- sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.
- generator_type (str) – Generator type.
- vocoder_generator_type (str) – Type of vocoder generator to use in the model.
- generator_params (Dict *[*str , Any ]) – Parameter dict for generator.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feat match loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_mel (float) – Loss scaling coefficient for mel spectrogram loss.
- lambda_feat_match (float) – Loss scaling coefficient for feat match loss.
- lambda_dur (float) – Loss scaling coefficient for duration loss.
- lambda_kl (float) – Loss scaling coefficient for KL divergence loss.
- lambda_pitch (float) – Loss scaling coefficient for pitch loss.
- lambda_phoneme (float) – Loss scaling coefficient for phoneme loss.
- lambda_c_yin (float) – Loss scaling coefficient for yin loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.
forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, singing: Tensor, singing_lengths: Tensor, ssl_feats: Tensor | None = None, ssl_feats_lengths: Tensor | None = None, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: LongTensor | None = None, ying: Tensor | None = None, duration: Dict[str, Tensor] | None = None, slur: LongTensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, forward_generator: bool = True) → Dict[str, Any]
Perform generator forward.
- Parameters:
- text (LongTensor) – Batch of padded character ids (B, T_text).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, T_feats, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- singing (Tensor) – Singing waveform tensor (B, T_wav).
- singing_lengths (Tensor) – Singing length tensor (B,).
- ssl_feats (Tensor) – SSL feature tensor (B, T_feats, hubert_channels).
- ssl_feats_lengths (Tensor) – SSL feature length tensor (B,).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, T_text).
- label_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, T_text).
- pitch (FloatTensor) – Batch of padded f0 (B, T_feats).
- ying (Optional *[*Tensor ]) – Batch of padded ying (B, T_feats).
- duration (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, T_text).
- slur (FloatTensor) – Batch of padded slur (B, T_text).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- forward_generator (bool) – Whether to forward generator.
- Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
- Return type: Dict[str, Any]
inference(text: Tensor, feats: Tensor | None = None, ssl_feats: Tensor | None = None, label: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, duration: Dict[str, Tensor] | None = None, slur: Dict[str, Tensor] | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: int | None = None, use_teacher_forcing: bool = False) → Dict[str, Tensor]
Run inference.
- Parameters:
- text (Tensor) – Input text index tensor (T_text,).
- feats (Tensor) – Feature tensor (T_feats, aux_channels).
- ssl_feats (Tensor) – SSL Feature tensor (T_feats, hubert_channels).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, T_text).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, T_text).
- pitch (FloatTensor) – Batch of padded f0 (B, T_feats).
- slur (LongTensor) – Batch of padded slur (B, T_text).
- sids (Tensor) – Speaker index tensor (1,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (spk_embed_dim,).
- lids (Tensor) – Language index tensor (1,).
- noise_scale (float) – Noise scale value for flow.
- noise_scale_dur (float) – Noise scale value for duration predictor.
- alpha (float) – Alpha parameter to control the speed of generated singing.
- max_len (Optional *[*int ]) – Maximum length.
- use_teacher_forcing (bool) – Whether to use teacher forcing.
- duration (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, T_text).
- Returns:
- wav (Tensor): Generated waveform tensor (T_wav,).
- Return type: Dict[str, Tensor]
property require_raw_singing
Return whether or not singing is required.
property require_vocoder
Return whether or not vocoder is required.