espnet2.gan_tts.vits.vits.VITS
espnet2.gan_tts.vits.vits.VITS
class espnet2.gan_tts.vits.vits.VITS(idim: int, odim: int, sampling_rate: int = 22050, generator_type: str = 'vits_generator', generator_params: Dict[str, Any] = {'decoder_channels': 512, 'decoder_kernel_size': 7, 'decoder_resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'decoder_resblock_kernel_sizes': [3, 7, 11], 'decoder_upsample_kernel_sizes': [16, 16, 4, 4], 'decoder_upsample_scales': [8, 8, 2, 2], 'flow_base_dilation': 1, 'flow_dropout_rate': 0.0, 'flow_flows': 4, 'flow_kernel_size': 5, 'flow_layers': 4, 'global_channels': -1, 'hidden_channels': 192, 'langs': None, 'posterior_encoder_base_dilation': 1, 'posterior_encoder_dropout_rate': 0.0, 'posterior_encoder_kernel_size': 5, 'posterior_encoder_layers': 16, 'posterior_encoder_stacks': 1, 'segment_size': 32, 'spk_embed_dim': None, 'spks': None, 'stochastic_duration_predictor_dds_conv_layers': 3, 'stochastic_duration_predictor_dropout_rate': 0.5, 'stochastic_duration_predictor_flows': 4, 'stochastic_duration_predictor_kernel_size': 3, 'text_encoder_activation_type': 'swish', 'text_encoder_attention_dropout_rate': 0.0, 'text_encoder_attention_heads': 2, 'text_encoder_blocks': 6, 'text_encoder_conformer_kernel_size': 7, 'text_encoder_dropout_rate': 0.1, 'text_encoder_ffn_expand': 4, 'text_encoder_normalize_before': True, 'text_encoder_positional_dropout_rate': 0.0, 'text_encoder_positional_encoding_layer_type': 'rel_pos', 'text_encoder_positionwise_conv_kernel_size': 1, 'text_encoder_positionwise_layer_type': 'conv1d', 'text_encoder_self_attention_layer_type': 'rel_selfattn', 'use_conformer_conv_in_text_encoder': True, 'use_macaron_style_in_text_encoder': True, 'use_only_mean_in_flow': True, 'use_weight_norm_in_decoder': True, 'use_weight_norm_in_flow': True, 'use_weight_norm_in_posterior_encoder': True}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_adv: float = 1.0, lambda_mel: float = 45.0, lambda_feat_match: float = 2.0, lambda_dur: float = 1.0, lambda_kl: float = 1.0, cache_generator_outputs: bool = True, plot_pred_mos: bool = False, mos_pred_tool: str = 'utmos')
Bases: AbsGANTTS
VITS module (generator + discriminator).
This is a module of VITS described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.
Initialize VITS module.
- Parameters:
- idim (int) – Input vocabrary size.
- odim (int) – Acoustic feature dimension. The actual output channels will be 1 since VITS is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.
- sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.
- generator_type (str) – Generator type.
- generator_params (Dict *[*str , Any ]) – Parameter dict for generator.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feat match loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_mel (float) – Loss scaling coefficient for mel spectrogram loss.
- lambda_feat_match (float) – Loss scaling coefficient for feat match loss.
- lambda_dur (float) – Loss scaling coefficient for duration loss.
- lambda_kl (float) – Loss scaling coefficient for KL divergence loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.
- plot_pred_mos (bool) – Whether to plot predicted MOS during the training.
- mos_pred_tool (str) – MOS prediction tool name.
forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, speech: Tensor, speech_lengths: Tensor, sids: Tensor | None = None, spembs: Tensor | None = None, lids: Tensor | None = None, forward_generator: bool = True) → Dict[str, Any]
Perform generator forward.
- Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- feats (Tensor) – Feature tensor (B, T_feats, aux_channels).
- feats_lengths (Tensor) – Feature length tensor (B,).
- speech (Tensor) – Speech waveform tensor (B, T_wav).
- speech_lengths (Tensor) – Speech length tensor (B,).
- sids (Optional *[*Tensor ]) – Speaker index tensor (B,) or (B, 1).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, spk_embed_dim).
- lids (Optional *[*Tensor ]) – Language index tensor (B,) or (B, 1).
- forward_generator (bool) – Whether to forward generator.
- Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
- Return type: Dict[str, Any]
inference(text: Tensor, feats: Tensor | None = None, sids: Tensor | None = None, spembs: Tensor | None = None, lids: Tensor | None = None, durations: Tensor | None = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: int | None = None, use_teacher_forcing: bool = False) → Dict[str, Tensor]
Run inference.
- Parameters:
- text (Tensor) – Input text index tensor (T_text,).
- feats (Tensor) – Feature tensor (T_feats, aux_channels).
- sids (Tensor) – Speaker index tensor (1,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (spk_embed_dim,).
- lids (Tensor) – Language index tensor (1,).
- durations (Tensor) – Ground-truth duration tensor (T_text,).
- noise_scale (float) – Noise scale value for flow.
- noise_scale_dur (float) – Noise scale value for duration predictor.
- alpha (float) – Alpha parameter to control the speed of generated speech.
- max_len (Optional *[*int ]) – Maximum length.
- use_teacher_forcing (bool) – Whether to use teacher forcing.
- Returns:
- wav (Tensor): Generated waveform tensor (T_wav,).
- att_w (Tensor): Monotonic attention weight tensor (T_feats, T_text).
- duration (Tensor): Predicted duration tensor (T_text,).
- Return type: Dict[str, Tensor]
property require_raw_speech
Return whether or not speech is required.
property require_vocoder
Return whether or not vocoder is required.