espnet2.gan_tts.joint.joint_text2wav.JointText2Wav
espnet2.gan_tts.joint.joint_text2wav.JointText2Wav
class espnet2.gan_tts.joint.joint_text2wav.JointText2Wav(idim: int, odim: int, segment_size: int = 32, sampling_rate: int = 22050, text2mel_type: str = 'fastspeech2', text2mel_params: Dict[str, Any] = {'adim': 384, 'aheads': 2, 'conformer_activation_type': 'swish', 'conformer_dec_kernel_size': 31, 'conformer_enc_kernel_size': 7, 'conformer_pos_enc_layer_type': 'rel_pos', 'conformer_rel_pos_type': 'latest', 'conformer_self_attn_layer_type': 'rel_selfattn', 'decoder_concat_after': False, 'decoder_normalize_before': True, 'decoder_type': 'conformer', 'dlayers': 4, 'dunits': 1536, 'duration_predictor_chans': 384, 'duration_predictor_dropout_rate': 0.1, 'duration_predictor_kernel_size': 3, 'duration_predictor_layers': 2, 'elayers': 4, 'encoder_concat_after': False, 'encoder_normalize_before': True, 'encoder_type': 'conformer', 'energy_embed_dropout': 0.5, 'energy_embed_kernel_size': 1, 'energy_predictor_chans': 384, 'energy_predictor_dropout': 0.5, 'energy_predictor_kernel_size': 3, 'energy_predictor_layers': 2, 'eunits': 1536, 'gst_conv_chans_list': [32, 32, 64, 64, 128, 128], 'gst_conv_kernel_size': 3, 'gst_conv_layers': 6, 'gst_conv_stride': 2, 'gst_gru_layers': 1, 'gst_gru_units': 128, 'gst_heads': 4, 'gst_tokens': 10, 'init_dec_alpha': 1.0, 'init_enc_alpha': 1.0, 'init_type': 'xavier_uniform', 'langs': -1, 'pitch_embed_dropout': 0.5, 'pitch_embed_kernel_size': 1, 'pitch_predictor_chans': 384, 'pitch_predictor_dropout': 0.5, 'pitch_predictor_kernel_size': 5, 'pitch_predictor_layers': 5, 'positionwise_conv_kernel_size': 1, 'positionwise_layer_type': 'conv1d', 'postnet_chans': 512, 'postnet_dropout_rate': 0.5, 'postnet_filts': 5, 'postnet_layers': 5, 'reduction_factor': 1, 'spk_embed_dim': None, 'spk_embed_integration_type': 'add', 'spks': -1, 'stop_gradient_from_energy_predictor': False, 'stop_gradient_from_pitch_predictor': True, 'transformer_dec_attn_dropout_rate': 0.1, 'transformer_dec_dropout_rate': 0.1, 'transformer_dec_positional_dropout_rate': 0.1, 'transformer_enc_attn_dropout_rate': 0.1, 'transformer_enc_dropout_rate': 0.1, 'transformer_enc_positional_dropout_rate': 0.1, 'use_batch_norm': True, 'use_cnn_in_conformer': True, 'use_gst': False, 'use_macaron_style_in_conformer': True, 'use_masking': False, 'use_scaled_pos_enc': True, 'use_weighted_masking': False, 'zero_triu': False}, vocoder_type: str = 'hifigan_generator', vocoder_params: Dict[str, Any] = {'bias': True, 'channels': 512, 'global_channels': -1, 'kernel_size': 7, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'resblock_kernel_sizes': [3, 7, 11], 'upsample_kernel_sizes': [16, 16, 4, 4], 'upsample_scales': [8, 8, 2, 2], 'use_additional_convs': True, 'use_weight_norm': True}, use_pqmf: bool = False, pqmf_params: Dict[str, Any] = {'beta': 9.0, 'cutoff_ratio': 0.142, 'subbands': 4, 'taps': 62}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, use_feat_match_loss: bool = True, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, use_mel_loss: bool = True, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_text2mel: float = 1.0, lambda_adv: float = 1.0, lambda_feat_match: float = 2.0, lambda_mel: float = 45.0, cache_generator_outputs: bool = False)
Bases: AbsGANTTS
General class to jointly train text2mel and vocoder parts.
Initialize JointText2Wav module.
- Parameters:
- idim (int) – Input vocabrary size.
- odim (int) – Acoustic feature dimension. The actual output channels will be 1 since the model is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.
- segment_size (int) – Segment size for random windowed inputs.
- sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.
- text2mel_type (str) – The text2mel model type.
- text2mel_params (Dict *[*str , Any ]) – Parameter dict for text2mel model.
- use_pqmf (bool) – Whether to use PQMF for multi-band vocoder.
- pqmf_params (Dict *[*str , Any ]) – Parameter dict for PQMF module.
- vocoder_type (str) – The vocoder model type.
- vocoder_params (Dict *[*str , Any ]) – Parameter dict for vocoder model.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- use_feat_match_loss (bool) – Whether to use feat match loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feat match loss.
- use_mel_loss (bool) – Whether to use mel loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_text2mel (float) – Loss scaling coefficient for text2mel model loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_feat_match (float) – Loss scaling coefficient for feat match loss.
- lambda_mel (float) – Loss scaling coefficient for mel loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.
forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, speech: Tensor, speech_lengths: Tensor, forward_generator: bool = True, **kwargs) → Dict[str, Any]
Perform generator forward.
- Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- feats (Tensor) – Feature tensor (B, T_feats, aux_channels).
- feats_lengths (Tensor) – Feature length tensor (B,).
- speech (Tensor) – Speech waveform tensor (B, T_wav).
- speech_lengths (Tensor) – Speech length tensor (B,).
- forward_generator (bool) – Whether to forward generator.
- Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
- Return type: Dict[str, Any]
inference(text: Tensor, **kwargs) → Dict[str, Tensor]
Run inference.
- Parameters:text (Tensor) – Input text index tensor (T_text,).
- Returns:
- wav (Tensor): Generated waveform tensor (T_wav,).
- feat_gan (Tensor): Generated feature tensor (T_text, C).
- Return type: Dict[str, Tensor]
property require_raw_speech
Return whether or not speech is required.
property require_vocoder
Return whether or not vocoder is required.