espnet2.gan_tts package

espnet2.gan_tts.espnet_model

GAN-based text-to-speech ESPnet model.

class espnet2.gan_tts.espnet_model.ESPnetGANTTSModel(feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], pitch_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], pitch_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], energy_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], energy_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], tts: espnet2.gan_tts.abs_gan_tts.AbsGANTTS)[source]

Bases: espnet2.train.abs_gan_espnet_model.AbsGANESPnetModel

ESPnet model for GAN-based text-to-speech task.

Initialize ESPnetGANTTSModel module.

collect_feats(text: torch.Tensor, text_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, durations: Optional[torch.Tensor] = None, durations_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, **kwargs) → Dict[str, torch.Tensor][source]

Calculate features and return them as a dict.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • speech (Tensor) – Speech waveform tensor (B, T_wav).

  • speech_lengths (Tensor) – Speech length tensor (B, 1).

  • durations (Optional[Tensor) – Duration tensor.

  • durations_lengths (Optional[Tensor) – Duration length tensor (B,).

  • pitch (Optional[Tensor) – Pitch tensor.

  • pitch_lengths (Optional[Tensor) – Pitch length tensor (B,).

  • energy (Optional[Tensor) – Energy tensor.

  • energy_lengths (Optional[Tensor) – Energy length tensor (B,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).

  • sids (Optional[Tensor]) – Speaker index tensor (B, 1).

  • lids (Optional[Tensor]) – Language ID tensor (B, 1).

Returns:

Dict of features.

Return type:

Dict[str, Tensor]

forward(text: torch.Tensor, text_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, durations: Optional[torch.Tensor] = None, durations_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, forward_generator: bool = True, **kwargs) → Dict[str, Any][source]

Return generator or discriminator loss with dict format.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • speech (Tensor) – Speech waveform tensor (B, T_wav).

  • speech_lengths (Tensor) – Speech length tensor (B,).

  • duration (Optional[Tensor]) – Duration tensor.

  • duration_lengths (Optional[Tensor]) – Duration length tensor (B,).

  • pitch (Optional[Tensor]) – Pitch tensor.

  • pitch_lengths (Optional[Tensor]) – Pitch length tensor (B,).

  • energy (Optional[Tensor]) – Energy tensor.

  • energy_lengths (Optional[Tensor]) – Energy length tensor (B,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).

  • sids (Optional[Tensor]) – Speaker ID tensor (B, 1).

  • lids (Optional[Tensor]) – Language ID tensor (B, 1).

  • forward_generator (bool) – Whether to forward generator.

  • kwargs – “utt_id” is among the input.

Returns:

  • loss (Tensor): Loss scalar tensor.

  • stats (Dict[str, float]): Statistics to be monitored.

  • weight (Tensor): Weight tensor to summarize losses.

  • optim_idx (int): Optimizer index (0 for G and 1 for D).

Return type:

Dict[str, Any]

espnet2.gan_tts.__init__

espnet2.gan_tts.abs_gan_tts

GAN-based TTS abstrast class.

class espnet2.gan_tts.abs_gan_tts.AbsGANTTS(*args, **kwargs)[source]

Bases: espnet2.tts.abs_tts.AbsTTS, abc.ABC

GAN-based TTS model abstract class.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(forward_generator, *args, **kwargs) → Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor], int]][source]

Return generator or discriminator loss.

espnet2.gan_tts.parallel_wavegan.upsample

Upsampling module.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.parallel_wavegan.upsample.Conv2d(*args, **kwargs)[source]

Bases: torch.nn.modules.conv.Conv2d

Conv2d module with customized initialization.

Initialize Conv2d module.

reset_parameters()[source]

Reset parameters.

class espnet2.gan_tts.parallel_wavegan.upsample.ConvInUpsampleNetwork(upsample_scales: List[int], nonlinear_activation: Optional[str] = None, nonlinear_activation_params: Dict[str, Any] = {}, interpolate_mode: str = 'nearest', freq_axis_kernel_size: int = 1, aux_channels: int = 80, aux_context_window: int = 0)[source]

Bases: torch.nn.modules.module.Module

Convolution + upsampling network module.

Initialize ConvInUpsampleNetwork module.

Parameters:
  • upsample_scales (list) – List of upsampling scales.

  • nonlinear_activation (Optional[str]) – Activation function name.

  • nonlinear_activation_params (Dict[str, Any]) – Arguments for the specified activation function.

  • mode (str) – Interpolation mode.

  • freq_axis_kernel_size (int) – Kernel size in the direction of frequency axis.

  • aux_channels (int) – Number of channels of pre-conv layer.

  • aux_context_window (int) – Context window size of the pre-conv layer.

forward(c: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:

c (Tensor) – Input tensor (B, C, T_feats).

Returns:

Upsampled tensor (B, C, T_wav),

where T_wav = T_feats * prod(upsample_scales).

Return type:

Tensor

class espnet2.gan_tts.parallel_wavegan.upsample.Stretch2d(x_scale: int, y_scale: int, mode: str = 'nearest')[source]

Bases: torch.nn.modules.module.Module

Stretch2d module.

Initialize Stretch2d module.

Parameters:
  • x_scale (int) – X scaling factor (Time axis in spectrogram).

  • y_scale (int) – Y scaling factor (Frequency axis in spectrogram).

  • mode (str) – Interpolation mode.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input tensor (B, C, F, T).

Returns:

Interpolated tensor (B, C, F * y_scale, T * x_scale),

Return type:

Tensor

class espnet2.gan_tts.parallel_wavegan.upsample.UpsampleNetwork(upsample_scales: List[int], nonlinear_activation: Optional[str] = None, nonlinear_activation_params: Dict[str, Any] = {}, interpolate_mode: str = 'nearest', freq_axis_kernel_size: int = 1)[source]

Bases: torch.nn.modules.module.Module

Upsampling network module.

Initialize UpsampleNetwork module.

Parameters:
  • upsample_scales (List[int]) – List of upsampling scales.

  • nonlinear_activation (Optional[str]) – Activation function name.

  • nonlinear_activation_params (Dict[str, Any]) – Arguments for the specified activation function.

  • interpolate_mode (str) – Interpolation mode.

  • freq_axis_kernel_size (int) – Kernel size in the direction of frequency axis.

forward(c: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:

c – Input tensor (B, C, T_feats).

Returns:

Upsampled tensor (B, C, T_wav).

Return type:

Tensor

espnet2.gan_tts.parallel_wavegan.__init__

class espnet2.gan_tts.parallel_wavegan.__init__.ParallelWaveGANDiscriminator(in_channels: int = 1, out_channels: int = 1, kernel_size: int = 3, layers: int = 10, conv_channels: int = 64, dilation_factor: int = 1, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, bias: bool = True, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Parallel WaveGAN Discriminator module.

Initialize ParallelWaveGANDiscriminator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Number of output channels.

  • layers (int) – Number of conv layers.

  • conv_channels (int) – Number of chnn layers.

  • dilation_factor (int) – Dilation factor. For example, if dilation_factor = 2, the dilation will be 2, 4, 8, …, and so on.

  • nonlinear_activation (str) – Nonlinear function after each conv.

  • nonlinear_activation_params (Dict[str, Any]) – Nonlinear function parameters

  • bias (bool) – Whether to use bias parameter in conv.

  • use_weight_norm (bool) – If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input noise signal (B, 1, T).

Returns:

Output tensor (B, 1, T).

Return type:

Tensor

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

class espnet2.gan_tts.parallel_wavegan.__init__.ParallelWaveGANGenerator(in_channels: int = 1, out_channels: int = 1, kernel_size: int = 3, layers: int = 30, stacks: int = 3, residual_channels: int = 64, gate_channels: int = 128, skip_channels: int = 64, aux_channels: int = 80, aux_context_window: int = 2, dropout_rate: float = 0.0, bias: bool = True, use_weight_norm: bool = True, upsample_conditional_features: bool = True, upsample_net: str = 'ConvInUpsampleNetwork', upsample_params: Dict[str, Any] = {'upsample_scales': [4, 4, 4, 4]})[source]

Bases: torch.nn.modules.module.Module

Parallel WaveGAN Generator module.

Initialize ParallelWaveGANGenerator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Kernel size of dilated convolution.

  • layers (int) – Number of residual block layers.

  • stacks (int) – Number of stacks i.e., dilation cycles.

  • residual_channels (int) – Number of channels in residual conv.

  • gate_channels (int) – Number of channels in gated conv.

  • skip_channels (int) – Number of channels in skip conv.

  • aux_channels (int) – Number of channels for auxiliary feature conv.

  • aux_context_window (int) – Context window size for auxiliary feature.

  • dropout_rate (float) – Dropout rate. 0.0 means no dropout applied.

  • bias (bool) – Whether to use bias parameter in conv layer.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

  • upsample_conditional_features (bool) – Whether to use upsampling network.

  • upsample_net (str) – Upsampling network architecture.

  • upsample_params (Dict[str, Any]) – Upsampling network parameters.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c: torch.Tensor, z: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • c (Tensor) – Local conditioning auxiliary features (B, C ,T_feats).

  • z (Tensor) – Input noise signal (B, 1, T_wav).

Returns:

Output tensor (B, out_channels, T_wav)

Return type:

Tensor

inference(c: torch.Tensor, z: Optional[torch.Tensor] = None) → torch.Tensor[source]

Perform inference.

Parameters:
  • c (Tensor) – Local conditioning auxiliary features (T_feats ,C).

  • z (Optional[Tensor]) – Input noise signal (T_wav, 1).

Returns:

Output tensor (T_wav, out_channels)

Return type:

Tensor

property receptive_field_size

Return receptive field size.

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

espnet2.gan_tts.parallel_wavegan.parallel_wavegan

Parallel WaveGAN Modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.parallel_wavegan.parallel_wavegan.ParallelWaveGANDiscriminator(in_channels: int = 1, out_channels: int = 1, kernel_size: int = 3, layers: int = 10, conv_channels: int = 64, dilation_factor: int = 1, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, bias: bool = True, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Parallel WaveGAN Discriminator module.

Initialize ParallelWaveGANDiscriminator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Number of output channels.

  • layers (int) – Number of conv layers.

  • conv_channels (int) – Number of chnn layers.

  • dilation_factor (int) – Dilation factor. For example, if dilation_factor = 2, the dilation will be 2, 4, 8, …, and so on.

  • nonlinear_activation (str) – Nonlinear function after each conv.

  • nonlinear_activation_params (Dict[str, Any]) – Nonlinear function parameters

  • bias (bool) – Whether to use bias parameter in conv.

  • use_weight_norm (bool) – If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input noise signal (B, 1, T).

Returns:

Output tensor (B, 1, T).

Return type:

Tensor

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

class espnet2.gan_tts.parallel_wavegan.parallel_wavegan.ParallelWaveGANGenerator(in_channels: int = 1, out_channels: int = 1, kernel_size: int = 3, layers: int = 30, stacks: int = 3, residual_channels: int = 64, gate_channels: int = 128, skip_channels: int = 64, aux_channels: int = 80, aux_context_window: int = 2, dropout_rate: float = 0.0, bias: bool = True, use_weight_norm: bool = True, upsample_conditional_features: bool = True, upsample_net: str = 'ConvInUpsampleNetwork', upsample_params: Dict[str, Any] = {'upsample_scales': [4, 4, 4, 4]})[source]

Bases: torch.nn.modules.module.Module

Parallel WaveGAN Generator module.

Initialize ParallelWaveGANGenerator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Kernel size of dilated convolution.

  • layers (int) – Number of residual block layers.

  • stacks (int) – Number of stacks i.e., dilation cycles.

  • residual_channels (int) – Number of channels in residual conv.

  • gate_channels (int) – Number of channels in gated conv.

  • skip_channels (int) – Number of channels in skip conv.

  • aux_channels (int) – Number of channels for auxiliary feature conv.

  • aux_context_window (int) – Context window size for auxiliary feature.

  • dropout_rate (float) – Dropout rate. 0.0 means no dropout applied.

  • bias (bool) – Whether to use bias parameter in conv layer.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

  • upsample_conditional_features (bool) – Whether to use upsampling network.

  • upsample_net (str) – Upsampling network architecture.

  • upsample_params (Dict[str, Any]) – Upsampling network parameters.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c: torch.Tensor, z: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • c (Tensor) – Local conditioning auxiliary features (B, C ,T_feats).

  • z (Tensor) – Input noise signal (B, 1, T_wav).

Returns:

Output tensor (B, out_channels, T_wav)

Return type:

Tensor

inference(c: torch.Tensor, z: Optional[torch.Tensor] = None) → torch.Tensor[source]

Perform inference.

Parameters:
  • c (Tensor) – Local conditioning auxiliary features (T_feats ,C).

  • z (Optional[Tensor]) – Input noise signal (T_wav, 1).

Returns:

Output tensor (T_wav, out_channels)

Return type:

Tensor

property receptive_field_size

Return receptive field size.

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

espnet2.gan_tts.joint.__init__

espnet2.gan_tts.joint.joint_text2wav

Joint text-to-wav module for end-to-end training.

class espnet2.gan_tts.joint.joint_text2wav.JointText2Wav(idim: int, odim: int, segment_size: int = 32, sampling_rate: int = 22050, text2mel_type: str = 'fastspeech2', text2mel_params: Dict[str, Any] = {'adim': 384, 'aheads': 2, 'conformer_activation_type': 'swish', 'conformer_dec_kernel_size': 31, 'conformer_enc_kernel_size': 7, 'conformer_pos_enc_layer_type': 'rel_pos', 'conformer_rel_pos_type': 'latest', 'conformer_self_attn_layer_type': 'rel_selfattn', 'decoder_concat_after': False, 'decoder_normalize_before': True, 'decoder_type': 'conformer', 'dlayers': 4, 'dunits': 1536, 'duration_predictor_chans': 384, 'duration_predictor_dropout_rate': 0.1, 'duration_predictor_kernel_size': 3, 'duration_predictor_layers': 2, 'elayers': 4, 'encoder_concat_after': False, 'encoder_normalize_before': True, 'encoder_type': 'conformer', 'energy_embed_dropout': 0.5, 'energy_embed_kernel_size': 1, 'energy_predictor_chans': 384, 'energy_predictor_dropout': 0.5, 'energy_predictor_kernel_size': 3, 'energy_predictor_layers': 2, 'eunits': 1536, 'gst_conv_chans_list': [32, 32, 64, 64, 128, 128], 'gst_conv_kernel_size': 3, 'gst_conv_layers': 6, 'gst_conv_stride': 2, 'gst_gru_layers': 1, 'gst_gru_units': 128, 'gst_heads': 4, 'gst_tokens': 10, 'init_dec_alpha': 1.0, 'init_enc_alpha': 1.0, 'init_type': 'xavier_uniform', 'langs': -1, 'pitch_embed_dropout': 0.5, 'pitch_embed_kernel_size': 1, 'pitch_predictor_chans': 384, 'pitch_predictor_dropout': 0.5, 'pitch_predictor_kernel_size': 5, 'pitch_predictor_layers': 5, 'positionwise_conv_kernel_size': 1, 'positionwise_layer_type': 'conv1d', 'postnet_chans': 512, 'postnet_dropout_rate': 0.5, 'postnet_filts': 5, 'postnet_layers': 5, 'reduction_factor': 1, 'spk_embed_dim': None, 'spk_embed_integration_type': 'add', 'spks': -1, 'stop_gradient_from_energy_predictor': False, 'stop_gradient_from_pitch_predictor': True, 'transformer_dec_attn_dropout_rate': 0.1, 'transformer_dec_dropout_rate': 0.1, 'transformer_dec_positional_dropout_rate': 0.1, 'transformer_enc_attn_dropout_rate': 0.1, 'transformer_enc_dropout_rate': 0.1, 'transformer_enc_positional_dropout_rate': 0.1, 'use_batch_norm': True, 'use_cnn_in_conformer': True, 'use_gst': False, 'use_macaron_style_in_conformer': True, 'use_masking': False, 'use_scaled_pos_enc': True, 'use_weighted_masking': False, 'zero_triu': False}, vocoder_type: str = 'hifigan_generator', vocoder_params: Dict[str, Any] = {'bias': True, 'channels': 512, 'global_channels': -1, 'kernel_size': 7, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'resblock_kernel_sizes': [3, 7, 11], 'upsample_kernel_sizes': [16, 16, 4, 4], 'upsample_scales': [8, 8, 2, 2], 'use_additional_convs': True, 'use_weight_norm': True}, use_pqmf: bool = False, pqmf_params: Dict[str, Any] = {'beta': 9.0, 'cutoff_ratio': 0.142, 'subbands': 4, 'taps': 62}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, use_feat_match_loss: bool = True, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, use_mel_loss: bool = True, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_text2mel: float = 1.0, lambda_adv: float = 1.0, lambda_feat_match: float = 2.0, lambda_mel: float = 45.0, cache_generator_outputs: bool = False)[source]

Bases: espnet2.gan_tts.abs_gan_tts.AbsGANTTS

General class to jointly train text2mel and vocoder parts.

Initialize JointText2Wav module.

Parameters:
  • idim (int) – Input vocabrary size.

  • odim (int) – Acoustic feature dimension. The actual output channels will be 1 since the model is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.

  • segment_size (int) – Segment size for random windowed inputs.

  • sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.

  • text2mel_type (str) – The text2mel model type.

  • text2mel_params (Dict[str, Any]) – Parameter dict for text2mel model.

  • use_pqmf (bool) – Whether to use PQMF for multi-band vocoder.

  • pqmf_params (Dict[str, Any]) – Parameter dict for PQMF module.

  • vocoder_type (str) – The vocoder model type.

  • vocoder_params (Dict[str, Any]) – Parameter dict for vocoder model.

  • discriminator_type (str) – Discriminator type.

  • discriminator_params (Dict[str, Any]) – Parameter dict for discriminator.

  • generator_adv_loss_params (Dict[str, Any]) – Parameter dict for generator adversarial loss.

  • discriminator_adv_loss_params (Dict[str, Any]) – Parameter dict for discriminator adversarial loss.

  • use_feat_match_loss (bool) – Whether to use feat match loss.

  • feat_match_loss_params (Dict[str, Any]) – Parameter dict for feat match loss.

  • use_mel_loss (bool) – Whether to use mel loss.

  • mel_loss_params (Dict[str, Any]) – Parameter dict for mel loss.

  • lambda_text2mel (float) – Loss scaling coefficient for text2mel model loss.

  • lambda_adv (float) – Loss scaling coefficient for adversarial loss.

  • lambda_feat_match (float) – Loss scaling coefficient for feat match loss.

  • lambda_mel (float) – Loss scaling coefficient for mel loss.

  • cache_generator_outputs (bool) – Whether to cache generator outputs.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, forward_generator: bool = True, **kwargs) → Dict[str, Any][source]

Perform generator forward.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats (Tensor) – Feature tensor (B, T_feats, aux_channels).

  • feats_lengths (Tensor) – Feature length tensor (B,).

  • speech (Tensor) – Speech waveform tensor (B, T_wav).

  • speech_lengths (Tensor) – Speech length tensor (B,).

  • forward_generator (bool) – Whether to forward generator.

Returns:

  • loss (Tensor): Loss scalar tensor.

  • stats (Dict[str, float]): Statistics to be monitored.

  • weight (Tensor): Weight tensor to summarize losses.

  • optim_idx (int): Optimizer index (0 for G and 1 for D).

Return type:

Dict[str, Any]

inference(text: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]

Run inference.

Parameters:

text (Tensor) – Input text index tensor (T_text,).

Returns:

  • wav (Tensor): Generated waveform tensor (T_wav,).

  • feat_gan (Tensor): Generated feature tensor (T_text, C).

Return type:

Dict[str, Tensor]

property require_raw_speech

Return whether or not speech is required.

property require_vocoder

Return whether or not vocoder is required.

espnet2.gan_tts.vits.transform

Flow-related transformation.

This code is derived from https://github.com/bayesiains/nflows.

espnet2.gan_tts.vits.transform.piecewise_rational_quadratic_transform(inputs, unnormalized_widths, unnormalized_heights, unnormalized_derivatives, inverse=False, tails=None, tail_bound=1.0, min_bin_width=0.001, min_bin_height=0.001, min_derivative=0.001)[source]
espnet2.gan_tts.vits.transform.rational_quadratic_spline(inputs, unnormalized_widths, unnormalized_heights, unnormalized_derivatives, inverse=False, left=0.0, right=1.0, bottom=0.0, top=1.0, min_bin_width=0.001, min_bin_height=0.001, min_derivative=0.001)[source]
espnet2.gan_tts.vits.transform.unconstrained_rational_quadratic_spline(inputs, unnormalized_widths, unnormalized_heights, unnormalized_derivatives, inverse=False, tails='linear', tail_bound=1.0, min_bin_width=0.001, min_bin_height=0.001, min_derivative=0.001)[source]

espnet2.gan_tts.vits.generator

Generator module in VITS.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.generator.VITSGenerator(vocabs: int, aux_channels: int = 513, hidden_channels: int = 192, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, global_channels: int = -1, segment_size: int = 32, text_encoder_attention_heads: int = 2, text_encoder_ffn_expand: int = 4, text_encoder_blocks: int = 6, text_encoder_positionwise_layer_type: str = 'conv1d', text_encoder_positionwise_conv_kernel_size: int = 1, text_encoder_positional_encoding_layer_type: str = 'rel_pos', text_encoder_self_attention_layer_type: str = 'rel_selfattn', text_encoder_activation_type: str = 'swish', text_encoder_normalize_before: bool = True, text_encoder_dropout_rate: float = 0.1, text_encoder_positional_dropout_rate: float = 0.0, text_encoder_attention_dropout_rate: float = 0.0, text_encoder_conformer_kernel_size: int = 7, use_macaron_style_in_text_encoder: bool = True, use_conformer_conv_in_text_encoder: bool = True, decoder_kernel_size: int = 7, decoder_channels: int = 512, decoder_upsample_scales: List[int] = [8, 8, 2, 2], decoder_upsample_kernel_sizes: List[int] = [16, 16, 4, 4], decoder_resblock_kernel_sizes: List[int] = [3, 7, 11], decoder_resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], use_weight_norm_in_decoder: bool = True, posterior_encoder_kernel_size: int = 5, posterior_encoder_layers: int = 16, posterior_encoder_stacks: int = 1, posterior_encoder_base_dilation: int = 1, posterior_encoder_dropout_rate: float = 0.0, use_weight_norm_in_posterior_encoder: bool = True, flow_flows: int = 4, flow_kernel_size: int = 5, flow_base_dilation: int = 1, flow_layers: int = 4, flow_dropout_rate: float = 0.0, use_weight_norm_in_flow: bool = True, use_only_mean_in_flow: bool = True, stochastic_duration_predictor_kernel_size: int = 3, stochastic_duration_predictor_dropout_rate: float = 0.5, stochastic_duration_predictor_flows: int = 4, stochastic_duration_predictor_dds_conv_layers: int = 3)[source]

Bases: torch.nn.modules.module.Module

Generator module in VITS.

This is a module of VITS described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

As text encoder, we use conformer architecture instead of the relative positional Transformer, which contains additional convolution layers.

Initialize VITS generator module.

Parameters:
  • vocabs (int) – Input vocabulary size.

  • aux_channels (int) – Number of acoustic feature channels.

  • hidden_channels (int) – Number of hidden channels.

  • spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.

  • langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.

  • spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.

  • global_channels (int) – Number of global conditioning channels.

  • segment_size (int) – Segment size for decoder.

  • text_encoder_attention_heads (int) – Number of heads in conformer block of text encoder.

  • text_encoder_ffn_expand (int) – Expansion ratio of FFN in conformer block of text encoder.

  • text_encoder_blocks (int) – Number of conformer blocks in text encoder.

  • text_encoder_positionwise_layer_type (str) – Position-wise layer type in conformer block of text encoder.

  • text_encoder_positionwise_conv_kernel_size (int) – Position-wise convolution kernel size in conformer block of text encoder. Only used when the above layer type is conv1d or conv1d-linear.

  • text_encoder_positional_encoding_layer_type (str) – Positional encoding layer type in conformer block of text encoder.

  • text_encoder_self_attention_layer_type (str) – Self-attention layer type in conformer block of text encoder.

  • text_encoder_activation_type (str) – Activation function type in conformer block of text encoder.

  • text_encoder_normalize_before (bool) – Whether to apply layer norm before self-attention in conformer block of text encoder.

  • text_encoder_dropout_rate (float) – Dropout rate in conformer block of text encoder.

  • text_encoder_positional_dropout_rate (float) – Dropout rate for positional encoding in conformer block of text encoder.

  • text_encoder_attention_dropout_rate (float) – Dropout rate for attention in conformer block of text encoder.

  • text_encoder_conformer_kernel_size (int) – Conformer conv kernel size. It will be used when only use_conformer_conv_in_text_encoder = True.

  • use_macaron_style_in_text_encoder (bool) – Whether to use macaron style FFN in conformer block of text encoder.

  • use_conformer_conv_in_text_encoder (bool) – Whether to use covolution in conformer block of text encoder.

  • decoder_kernel_size (int) – Decoder kernel size.

  • decoder_channels (int) – Number of decoder initial channels.

  • decoder_upsample_scales (List[int]) – List of upsampling scales in decoder.

  • decoder_upsample_kernel_sizes (List[int]) – List of kernel size for upsampling layers in decoder.

  • decoder_resblock_kernel_sizes (List[int]) – List of kernel size for resblocks in decoder.

  • decoder_resblock_dilations (List[List[int]]) – List of list of dilations for resblocks in decoder.

  • use_weight_norm_in_decoder (bool) – Whether to apply weight normalization in decoder.

  • posterior_encoder_kernel_size (int) – Posterior encoder kernel size.

  • posterior_encoder_layers (int) – Number of layers of posterior encoder.

  • posterior_encoder_stacks (int) – Number of stacks of posterior encoder.

  • posterior_encoder_base_dilation (int) – Base dilation of posterior encoder.

  • posterior_encoder_dropout_rate (float) – Dropout rate for posterior encoder.

  • use_weight_norm_in_posterior_encoder (bool) – Whether to apply weight normalization in posterior encoder.

  • flow_flows (int) – Number of flows in flow.

  • flow_kernel_size (int) – Kernel size in flow.

  • flow_base_dilation (int) – Base dilation in flow.

  • flow_layers (int) – Number of layers in flow.

  • flow_dropout_rate (float) – Dropout rate in flow

  • use_weight_norm_in_flow (bool) – Whether to apply weight normalization in flow.

  • use_only_mean_in_flow (bool) – Whether to use only mean in flow.

  • stochastic_duration_predictor_kernel_size (int) – Kernel size in stochastic duration predictor.

  • stochastic_duration_predictor_dropout_rate (float) – Dropout rate in stochastic duration predictor.

  • stochastic_duration_predictor_flows (int) – Number of flows in stochastic duration predictor.

  • stochastic_duration_predictor_dds_conv_layers (int) – Number of DDS conv layers in stochastic duration predictor.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats (Tensor) – Feature tensor (B, aux_channels, T_feats).

  • feats_lengths (Tensor) – Feature length tensor (B,).

  • sids (Optional[Tensor]) – Speaker index tensor (B,) or (B, 1).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, spk_embed_dim).

  • lids (Optional[Tensor]) – Language index tensor (B,) or (B, 1).

Returns:

Waveform tensor (B, 1, segment_size * upsample_factor). Tensor: Duration negative log-likelihood (NLL) tensor (B,). Tensor: Monotonic attention weight tensor (B, 1, T_feats, T_text). Tensor: Segments start index tensor (B,). Tensor: Text mask tensor (B, 1, T_text). Tensor: Feature mask tensor (B, 1, T_feats). tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]:

  • Tensor: Posterior encoder hidden representation (B, H, T_feats).

  • Tensor: Flow hidden representation (B, H, T_feats).

  • Tensor: Expanded text encoder projected mean (B, H, T_feats).

  • Tensor: Expanded text encoder projected scale (B, H, T_feats).

  • Tensor: Posterior encoder projected mean (B, H, T_feats).

  • Tensor: Posterior encoder projected scale (B, H, T_feats).

Return type:

Tensor

inference(text: torch.Tensor, text_lengths: torch.Tensor, feats: Optional[torch.Tensor] = None, feats_lengths: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, dur: Optional[torch.Tensor] = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: Optional[int] = None, use_teacher_forcing: bool = False) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Run inference.

Parameters:
  • text (Tensor) – Input text index tensor (B, T_text,).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats (Tensor) – Feature tensor (B, aux_channels, T_feats,).

  • feats_lengths (Tensor) – Feature length tensor (B,).

  • sids (Optional[Tensor]) – Speaker index tensor (B,) or (B, 1).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, spk_embed_dim).

  • lids (Optional[Tensor]) – Language index tensor (B,) or (B, 1).

  • dur (Optional[Tensor]) – Ground-truth duration (B, T_text,). If provided, skip the prediction of durations (i.e., teacher forcing).

  • noise_scale (float) – Noise scale parameter for flow.

  • noise_scale_dur (float) – Noise scale parameter for duration predictor.

  • alpha (float) – Alpha parameter to control the speed of generated speech.

  • max_len (Optional[int]) – Maximum length of acoustic feature sequence.

  • use_teacher_forcing (bool) – Whether to use teacher forcing.

Returns:

Generated waveform tensor (B, T_wav). Tensor: Monotonic attention weight tensor (B, T_feats, T_text). Tensor: Duration tensor (B, T_text).

Return type:

Tensor

espnet2.gan_tts.vits.vits

VITS module for GAN-TTS task.

class espnet2.gan_tts.vits.vits.VITS(idim: int, odim: int, sampling_rate: int = 22050, generator_type: str = 'vits_generator', generator_params: Dict[str, Any] = {'decoder_channels': 512, 'decoder_kernel_size': 7, 'decoder_resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'decoder_resblock_kernel_sizes': [3, 7, 11], 'decoder_upsample_kernel_sizes': [16, 16, 4, 4], 'decoder_upsample_scales': [8, 8, 2, 2], 'flow_base_dilation': 1, 'flow_dropout_rate': 0.0, 'flow_flows': 4, 'flow_kernel_size': 5, 'flow_layers': 4, 'global_channels': -1, 'hidden_channels': 192, 'langs': None, 'posterior_encoder_base_dilation': 1, 'posterior_encoder_dropout_rate': 0.0, 'posterior_encoder_kernel_size': 5, 'posterior_encoder_layers': 16, 'posterior_encoder_stacks': 1, 'segment_size': 32, 'spk_embed_dim': None, 'spks': None, 'stochastic_duration_predictor_dds_conv_layers': 3, 'stochastic_duration_predictor_dropout_rate': 0.5, 'stochastic_duration_predictor_flows': 4, 'stochastic_duration_predictor_kernel_size': 3, 'text_encoder_activation_type': 'swish', 'text_encoder_attention_dropout_rate': 0.0, 'text_encoder_attention_heads': 2, 'text_encoder_blocks': 6, 'text_encoder_conformer_kernel_size': 7, 'text_encoder_dropout_rate': 0.1, 'text_encoder_ffn_expand': 4, 'text_encoder_normalize_before': True, 'text_encoder_positional_dropout_rate': 0.0, 'text_encoder_positional_encoding_layer_type': 'rel_pos', 'text_encoder_positionwise_conv_kernel_size': 1, 'text_encoder_positionwise_layer_type': 'conv1d', 'text_encoder_self_attention_layer_type': 'rel_selfattn', 'use_conformer_conv_in_text_encoder': True, 'use_macaron_style_in_text_encoder': True, 'use_only_mean_in_flow': True, 'use_weight_norm_in_decoder': True, 'use_weight_norm_in_flow': True, 'use_weight_norm_in_posterior_encoder': True}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_adv: float = 1.0, lambda_mel: float = 45.0, lambda_feat_match: float = 2.0, lambda_dur: float = 1.0, lambda_kl: float = 1.0, cache_generator_outputs: bool = True, plot_pred_mos: bool = False, mos_pred_tool: str = 'utmos')[source]

Bases: espnet2.gan_tts.abs_gan_tts.AbsGANTTS

VITS module (generator + discriminator).

This is a module of VITS described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Initialize VITS module.

Parameters:
  • idim (int) – Input vocabrary size.

  • odim (int) – Acoustic feature dimension. The actual output channels will be 1 since VITS is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.

  • sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.

  • generator_type (str) – Generator type.

  • generator_params (Dict[str, Any]) – Parameter dict for generator.

  • discriminator_type (str) – Discriminator type.

  • discriminator_params (Dict[str, Any]) – Parameter dict for discriminator.

  • generator_adv_loss_params (Dict[str, Any]) – Parameter dict for generator adversarial loss.

  • discriminator_adv_loss_params (Dict[str, Any]) – Parameter dict for discriminator adversarial loss.

  • feat_match_loss_params (Dict[str, Any]) – Parameter dict for feat match loss.

  • mel_loss_params (Dict[str, Any]) – Parameter dict for mel loss.

  • lambda_adv (float) – Loss scaling coefficient for adversarial loss.

  • lambda_mel (float) – Loss scaling coefficient for mel spectrogram loss.

  • lambda_feat_match (float) – Loss scaling coefficient for feat match loss.

  • lambda_dur (float) – Loss scaling coefficient for duration loss.

  • lambda_kl (float) – Loss scaling coefficient for KL divergence loss.

  • cache_generator_outputs (bool) – Whether to cache generator outputs.

  • plot_pred_mos (bool) – Whether to plot predicted MOS during the training.

  • mos_pred_tool (str) – MOS prediction tool name.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, forward_generator: bool = True) → Dict[str, Any][source]

Perform generator forward.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats (Tensor) – Feature tensor (B, T_feats, aux_channels).

  • feats_lengths (Tensor) – Feature length tensor (B,).

  • speech (Tensor) – Speech waveform tensor (B, T_wav).

  • speech_lengths (Tensor) – Speech length tensor (B,).

  • sids (Optional[Tensor]) – Speaker index tensor (B,) or (B, 1).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, spk_embed_dim).

  • lids (Optional[Tensor]) – Language index tensor (B,) or (B, 1).

  • forward_generator (bool) – Whether to forward generator.

Returns:

  • loss (Tensor): Loss scalar tensor.

  • stats (Dict[str, float]): Statistics to be monitored.

  • weight (Tensor): Weight tensor to summarize losses.

  • optim_idx (int): Optimizer index (0 for G and 1 for D).

Return type:

Dict[str, Any]

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, durations: Optional[torch.Tensor] = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: Optional[int] = None, use_teacher_forcing: bool = False) → Dict[str, torch.Tensor][source]

Run inference.

Parameters:
  • text (Tensor) – Input text index tensor (T_text,).

  • feats (Tensor) – Feature tensor (T_feats, aux_channels).

  • sids (Tensor) – Speaker index tensor (1,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (spk_embed_dim,).

  • lids (Tensor) – Language index tensor (1,).

  • durations (Tensor) – Ground-truth duration tensor (T_text,).

  • noise_scale (float) – Noise scale value for flow.

  • noise_scale_dur (float) – Noise scale value for duration predictor.

  • alpha (float) – Alpha parameter to control the speed of generated speech.

  • max_len (Optional[int]) – Maximum length.

  • use_teacher_forcing (bool) – Whether to use teacher forcing.

Returns:

  • wav (Tensor): Generated waveform tensor (T_wav,).

  • att_w (Tensor): Monotonic attention weight tensor (T_feats, T_text).

  • duration (Tensor): Predicted duration tensor (T_text,).

Return type:

Dict[str, Tensor]

property require_raw_speech

Return whether or not speech is required.

property require_vocoder

Return whether or not vocoder is required.

espnet2.gan_tts.vits.residual_coupling

Residual affine coupling modules in VITS.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.residual_coupling.ResidualAffineCouplingBlock(in_channels: int = 192, hidden_channels: int = 192, flows: int = 4, kernel_size: int = 5, base_dilation: int = 1, layers: int = 4, global_channels: int = -1, dropout_rate: float = 0.0, use_weight_norm: bool = True, bias: bool = True, use_only_mean: bool = True)[source]

Bases: torch.nn.modules.module.Module

Residual affine coupling block module.

This is a module of residual affine coupling block, which used as “Flow” in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Initilize ResidualAffineCouplingBlock module.

Parameters:
  • in_channels (int) – Number of input channels.

  • hidden_channels (int) – Number of hidden channels.

  • flows (int) – Number of flows.

  • kernel_size (int) – Kernel size for WaveNet.

  • base_dilation (int) – Base dilation factor for WaveNet.

  • layers (int) – Number of layers of WaveNet.

  • stacks (int) – Number of stacks of WaveNet.

  • global_channels (int) – Number of global channels.

  • dropout_rate (float) – Dropout rate.

  • use_weight_norm (bool) – Whether to use weight normalization in WaveNet.

  • bias (bool) – Whether to use bias paramters in WaveNet.

  • use_only_mean (bool) – Whether to estimate only mean.

forward(x: torch.Tensor, x_mask: torch.Tensor, g: Optional[torch.Tensor] = None, inverse: bool = False) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input tensor (B, in_channels, T).

  • x_lengths (Tensor) – Length tensor (B,).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

  • inverse (bool) – Whether to inverse the flow.

Returns:

Output tensor (B, in_channels, T).

Return type:

Tensor

class espnet2.gan_tts.vits.residual_coupling.ResidualAffineCouplingLayer(in_channels: int = 192, hidden_channels: int = 192, kernel_size: int = 5, base_dilation: int = 1, layers: int = 5, stacks: int = 1, global_channels: int = -1, dropout_rate: float = 0.0, use_weight_norm: bool = True, bias: bool = True, use_only_mean: bool = True)[source]

Bases: torch.nn.modules.module.Module

Residual affine coupling layer.

Initialzie ResidualAffineCouplingLayer module.

Parameters:
  • in_channels (int) – Number of input channels.

  • hidden_channels (int) – Number of hidden channels.

  • kernel_size (int) – Kernel size for WaveNet.

  • base_dilation (int) – Base dilation factor for WaveNet.

  • layers (int) – Number of layers of WaveNet.

  • stacks (int) – Number of stacks of WaveNet.

  • global_channels (int) – Number of global channels.

  • dropout_rate (float) – Dropout rate.

  • use_weight_norm (bool) – Whether to use weight normalization in WaveNet.

  • bias (bool) – Whether to use bias paramters in WaveNet.

  • use_only_mean (bool) – Whether to estimate only mean.

forward(x: torch.Tensor, x_mask: torch.Tensor, g: Optional[torch.Tensor] = None, inverse: bool = False) → Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input tensor (B, in_channels, T).

  • x_lengths (Tensor) – Length tensor (B,).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

  • inverse (bool) – Whether to inverse the flow.

Returns:

Output tensor (B, in_channels, T). Tensor: Log-determinant tensor for NLL (B,) if not inverse.

Return type:

Tensor

espnet2.gan_tts.vits.duration_predictor

Stochastic duration predictor modules in VITS.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.duration_predictor.StochasticDurationPredictor(channels: int = 192, kernel_size: int = 3, dropout_rate: float = 0.5, flows: int = 4, dds_conv_layers: int = 3, global_channels: int = -1)[source]

Bases: torch.nn.modules.module.Module

Stochastic duration predictor module.

This is a module of stochastic duration predictor described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Initialize StochasticDurationPredictor module.

Parameters:
  • channels (int) – Number of channels.

  • kernel_size (int) – Kernel size.

  • dropout_rate (float) – Dropout rate.

  • flows (int) – Number of flows.

  • dds_conv_layers (int) – Number of conv layers in DDS conv.

  • global_channels (int) – Number of global conditioning channels.

forward(x: torch.Tensor, x_mask: torch.Tensor, w: Optional[torch.Tensor] = None, g: Optional[torch.Tensor] = None, inverse: bool = False, noise_scale: float = 1.0) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input tensor (B, channels, T_text).

  • x_mask (Tensor) – Mask tensor (B, 1, T_text).

  • w (Optional[Tensor]) – Duration tensor (B, 1, T_text).

  • g (Optional[Tensor]) – Global conditioning tensor (B, channels, 1)

  • inverse (bool) – Whether to inverse the flow.

  • noise_scale (float) – Noise scale value.

Returns:

If not inverse, negative log-likelihood (NLL) tensor (B,).

If inverse, log-duration tensor (B, 1, T_text).

Return type:

Tensor

espnet2.gan_tts.vits.loss

VITS-related loss modules.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.loss.KLDivergenceLoss(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module

KL divergence loss.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(z_p: torch.Tensor, logs_q: torch.Tensor, m_p: torch.Tensor, logs_p: torch.Tensor, z_mask: torch.Tensor) → torch.Tensor[source]

Calculate KL divergence loss.

Parameters:
  • z_p (Tensor) – Flow hidden representation (B, H, T_feats).

  • logs_q (Tensor) – Posterior encoder projected scale (B, H, T_feats).

  • m_p (Tensor) – Expanded text encoder projected mean (B, H, T_feats).

  • logs_p (Tensor) – Expanded text encoder projected scale (B, H, T_feats).

  • z_mask (Tensor) – Mask tensor (B, 1, T_feats).

Returns:

KL divergence loss.

Return type:

Tensor

class espnet2.gan_tts.vits.loss.KLDivergenceLossWithoutFlow(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module

KL divergence loss without flow.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(m_q: torch.Tensor, logs_q: torch.Tensor, m_p: torch.Tensor, logs_p: torch.Tensor) → torch.Tensor[source]

Calculate KL divergence loss without flow.

Parameters:
  • m_q (Tensor) – Posterior encoder projected mean (B, H, T_feats).

  • logs_q (Tensor) – Posterior encoder projected scale (B, H, T_feats).

  • m_p (Tensor) – Expanded text encoder projected mean (B, H, T_feats).

  • logs_p (Tensor) – Expanded text encoder projected scale (B, H, T_feats).

espnet2.gan_tts.vits.posterior_encoder

Posterior encoder module in VITS.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.posterior_encoder.PosteriorEncoder(in_channels: int = 513, out_channels: int = 192, hidden_channels: int = 192, kernel_size: int = 5, layers: int = 16, stacks: int = 1, base_dilation: int = 1, global_channels: int = -1, dropout_rate: float = 0.0, bias: bool = True, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Posterior encoder module in VITS.

This is a module of posterior encoder described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Initilialize PosteriorEncoder module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • hidden_channels (int) – Number of hidden channels.

  • kernel_size (int) – Kernel size in WaveNet.

  • layers (int) – Number of layers of WaveNet.

  • stacks (int) – Number of repeat stacking of WaveNet.

  • base_dilation (int) – Base dilation factor.

  • global_channels (int) – Number of global conditioning channels.

  • dropout_rate (float) – Dropout rate.

  • bias (bool) – Whether to use bias parameters in conv.

  • use_weight_norm (bool) – Whether to apply weight norm.

forward(x: torch.Tensor, x_lengths: torch.Tensor, g: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input tensor (B, in_channels, T_feats).

  • x_lengths (Tensor) – Length tensor (B,).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns:

Encoded hidden representation tensor (B, out_channels, T_feats). Tensor: Projected mean tensor (B, out_channels, T_feats). Tensor: Projected scale tensor (B, out_channels, T_feats). Tensor: Mask tensor for input tensor (B, 1, T_feats).

Return type:

Tensor

espnet2.gan_tts.vits.flow

Basic Flow modules used in VITS.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.flow.ConvFlow(in_channels: int, hidden_channels: int, kernel_size: int, layers: int, bins: int = 10, tail_bound: float = 5.0)[source]

Bases: torch.nn.modules.module.Module

Convolutional flow module.

Initialize ConvFlow module.

Parameters:
  • in_channels (int) – Number of input channels.

  • hidden_channels (int) – Number of hidden channels.

  • kernel_size (int) – Kernel size.

  • layers (int) – Number of layers.

  • bins (int) – Number of bins.

  • tail_bound (float) – Tail bound value.

forward(x: torch.Tensor, x_mask: torch.Tensor, g: Optional[torch.Tensor] = None, inverse: bool = False) → Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input tensor (B, channels, T).

  • x_mask (Tensor) – Mask tensor (B,).

  • g (Optional[Tensor]) – Global conditioning tensor (B, channels, 1).

  • inverse (bool) – Whether to inverse the flow.

Returns:

Output tensor (B, channels, T). Tensor: Log-determinant tensor for NLL (B,) if not inverse.

Return type:

Tensor

class espnet2.gan_tts.vits.flow.DilatedDepthSeparableConv(channels: int, kernel_size: int, layers: int, dropout_rate: float = 0.0, eps: float = 1e-05)[source]

Bases: torch.nn.modules.module.Module

Dilated depth-separable conv module.

Initialize DilatedDepthSeparableConv module.

Parameters:
  • channels (int) – Number of channels.

  • kernel_size (int) – Kernel size.

  • layers (int) – Number of layers.

  • dropout_rate (float) – Dropout rate.

  • eps (float) – Epsilon for layer norm.

forward(x: torch.Tensor, x_mask: torch.Tensor, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input tensor (B, in_channels, T).

  • x_mask (Tensor) – Mask tensor (B, 1, T).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns:

Output tensor (B, channels, T).

Return type:

Tensor

class espnet2.gan_tts.vits.flow.ElementwiseAffineFlow(channels: int)[source]

Bases: torch.nn.modules.module.Module

Elementwise affine flow module.

Initialize ElementwiseAffineFlow module.

Parameters:

channels (int) – Number of channels.

forward(x: torch.Tensor, x_mask: torch.Tensor, inverse: bool = False, **kwargs) → Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input tensor (B, channels, T).

  • x_lengths (Tensor) – Length tensor (B,).

  • inverse (bool) – Whether to inverse the flow.

Returns:

Output tensor (B, channels, T). Tensor: Log-determinant tensor for NLL (B,) if not inverse.

Return type:

Tensor

class espnet2.gan_tts.vits.flow.FlipFlow(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module

Flip flow module.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: torch.Tensor, *args, inverse: bool = False, **kwargs) → Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input tensor (B, channels, T).

  • inverse (bool) – Whether to inverse the flow.

Returns:

Flipped tensor (B, channels, T). Tensor: Log-determinant tensor for NLL (B,) if not inverse.

Return type:

Tensor

class espnet2.gan_tts.vits.flow.LogFlow(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module

Log flow module.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: torch.Tensor, x_mask: torch.Tensor, inverse: bool = False, eps: float = 1e-05, **kwargs) → Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input tensor (B, channels, T).

  • x_mask (Tensor) – Mask tensor (B, 1, T).

  • inverse (bool) – Whether to inverse the flow.

  • eps (float) – Epsilon for log.

Returns:

Output tensor (B, channels, T). Tensor: Log-determinant tensor for NLL (B,) if not inverse.

Return type:

Tensor

class espnet2.gan_tts.vits.flow.Transpose(dim1: int, dim2: int)[source]

Bases: torch.nn.modules.module.Module

Transpose module for torch.nn.Sequential().

Initialize Transpose module.

forward(x: torch.Tensor) → torch.Tensor[source]

Transpose.

espnet2.gan_tts.vits.__init__

espnet2.gan_tts.vits.text_encoder

Text encoder module in VITS.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.text_encoder.TextEncoder(vocabs: int, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 6, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 3, positional_encoding_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

Text encoder module in VITS.

This is a module of text encoder described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Instead of the relative positional Transformer, we use conformer architecture as the encoder module, which contains additional convolution layers.

Initialize TextEncoder module.

Parameters:
  • vocabs (int) – Vocabulary size.

  • attention_dim (int) – Attention dimension.

  • attention_heads (int) – Number of attention heads.

  • linear_units (int) – Number of linear units of positionwise layers.

  • blocks (int) – Number of encoder blocks.

  • positionwise_layer_type (str) – Positionwise layer type.

  • positionwise_conv_kernel_size (int) – Positionwise layer’s kernel size.

  • positional_encoding_layer_type (str) – Positional encoding layer type.

  • self_attention_layer_type (str) – Self-attention layer type.

  • activation_type (str) – Activation function type.

  • normalize_before (bool) – Whether to apply LayerNorm before attention.

  • use_macaron_style (bool) – Whether to use macaron style components.

  • use_conformer_conv (bool) – Whether to use conformer conv layers.

  • conformer_kernel_size (int) – Conformer’s conv kernel size.

  • dropout_rate (float) – Dropout rate.

  • positional_dropout_rate (float) – Dropout rate for positional encoding.

  • attention_dropout_rate (float) – Dropout rate for attention.

forward(x: torch.Tensor, x_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input index tensor (B, T_text).

  • x_lengths (Tensor) – Length tensor (B,).

Returns:

Encoded hidden representation (B, attention_dim, T_text). Tensor: Projected mean tensor (B, attention_dim, T_text). Tensor: Projected scale tensor (B, attention_dim, T_text). Tensor: Mask tensor for input tensor (B, 1, T_text).

Return type:

Tensor

espnet2.gan_tts.vits.monotonic_align.setup

espnet2.gan_tts.vits.monotonic_align.__init__

Maximum path calculation module.

This code is based on https://github.com/jaywalnut310/vits.

espnet2.gan_tts.vits.monotonic_align.__init__.maximum_path(neg_x_ent: torch.Tensor, attn_mask: torch.Tensor) → torch.Tensor[source]

Calculate maximum path.

Parameters:
  • neg_x_ent (Tensor) – Negative X entropy tensor (B, T_feats, T_text).

  • attn_mask (Tensor) – Attention mask (B, T_feats, T_text).

Returns:

Maximum path tensor (B, T_feats, T_text).

Return type:

Tensor

espnet2.gan_tts.vits.monotonic_align.__init__.maximum_path_each_numba[source]

Calculate a single maximum path with numba.

espnet2.gan_tts.vits.monotonic_align.__init__.maximum_path_numba[source]

Calculate batch maximum path with numba.

espnet2.gan_tts.style_melgan.tade_res_block

StyleMelGAN’s TADEResBlock Modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.style_melgan.tade_res_block.TADELayer(in_channels: int = 64, aux_channels: int = 80, kernel_size: int = 9, bias: bool = True, upsample_factor: int = 2, upsample_mode: str = 'nearest')[source]

Bases: torch.nn.modules.module.Module

TADE Layer module.

Initilize TADELayer module.

Parameters:
  • in_channels (int) – Number of input channles.

  • aux_channels (int) – Number of auxirialy channles.

  • kernel_size (int) – Kernel size.

  • bias (bool) – Whether to use bias parameter in conv.

  • upsample_factor (int) – Upsample factor.

  • upsample_mode (str) – Upsample mode.

forward(x: torch.Tensor, c: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input tensor (B, in_channels, T).

  • c (Tensor) – Auxiliary input tensor (B, aux_channels, T’).

Returns:

Output tensor (B, in_channels, T * in_upsample_factor). Tensor: Upsampled aux tensor (B, in_channels, T * aux_upsample_factor).

Return type:

Tensor

class espnet2.gan_tts.style_melgan.tade_res_block.TADEResBlock(in_channels: int = 64, aux_channels: int = 80, kernel_size: int = 9, dilation: int = 2, bias: bool = True, upsample_factor: int = 2, upsample_mode: str = 'nearest', gated_function: str = 'softmax')[source]

Bases: torch.nn.modules.module.Module

TADEResBlock module.

Initialize TADEResBlock module.

Parameters:
  • in_channels (int) – Number of input channles.

  • aux_channels (int) – Number of auxirialy channles.

  • kernel_size (int) – Kernel size.

  • bias (bool) – Whether to use bias parameter in conv.

  • upsample_factor (int) – Upsample factor.

  • upsample_mode (str) – Upsample mode.

  • gated_function (str) – Gated function type (softmax of sigmoid).

forward(x: torch.Tensor, c: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input tensor (B, in_channels, T).

  • c (Tensor) – Auxiliary input tensor (B, aux_channels, T’).

Returns:

Output tensor (B, in_channels, T * in_upsample_factor). Tensor: Upsampled auxirialy tensor (B, in_channels, T * in_upsample_factor).

Return type:

Tensor

espnet2.gan_tts.style_melgan.__init__

espnet2.gan_tts.style_melgan.style_melgan

StyleMelGAN Modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.style_melgan.style_melgan.StyleMelGANDiscriminator(repeats: int = 2, window_sizes: List[int] = [512, 1024, 2048, 4096], pqmf_params: List[List[int]] = [[1, None, None, None], [2, 62, 0.267, 9.0], [4, 62, 0.142, 9.0], [8, 62, 0.07949, 9.0]], discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 16, 'downsample_scales': [4, 4, 4, 1], 'kernel_sizes': [5, 3], 'max_downsample_channels': 512, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.2}, 'out_channels': 1, 'pad': 'ReflectionPad1d', 'pad_params': {}}, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Style MelGAN disciminator module.

Initilize StyleMelGANDiscriminator module.

Parameters:
  • repeats (int) – Number of repititons to apply RWD.

  • window_sizes (List[int]) – List of random window sizes.

  • pqmf_params (List[List[int]]) – List of list of Parameters for PQMF modules

  • discriminator_params (Dict[str, Any]) – Parameters for base discriminator module.

  • use_weight_nom (bool) – Whether to apply weight normalization.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → List[torch.Tensor][source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input tensor (B, 1, T).

Returns:

List of discriminator outputs, #items in the list will be

equal to repeats * #discriminators.

Return type:

List

reset_parameters()[source]

Reset parameters.

class espnet2.gan_tts.style_melgan.style_melgan.StyleMelGANGenerator(in_channels: int = 128, aux_channels: int = 80, channels: int = 64, out_channels: int = 1, kernel_size: int = 9, dilation: int = 2, bias: bool = True, noise_upsample_scales: List[int] = [11, 2, 2, 2], noise_upsample_activation: str = 'LeakyReLU', noise_upsample_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, upsample_scales: List[int] = [2, 2, 2, 2, 2, 2, 2, 2, 1], upsample_mode: str = 'nearest', gated_function: str = 'softmax', use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Style MelGAN generator module.

Initilize StyleMelGANGenerator module.

Parameters:
  • in_channels (int) – Number of input noise channels.

  • aux_channels (int) – Number of auxiliary input channels.

  • channels (int) – Number of channels for conv layer.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Kernel size of conv layers.

  • dilation (int) – Dilation factor for conv layers.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • noise_upsample_scales (List[int]) – List of noise upsampling scales.

  • noise_upsample_activation (str) – Activation function module name for noise upsampling.

  • noise_upsample_activation_params (Dict[str, Any]) – Hyperparameters for the above activation function.

  • upsample_scales (List[int]) – List of upsampling scales.

  • upsample_mode (str) – Upsampling mode in TADE layer.

  • gated_function (str) – Gated function used in TADEResBlock (“softmax” or “sigmoid”).

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c: torch.Tensor, z: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • c (Tensor) – Auxiliary input tensor (B, channels, T).

  • z (Tensor) – Input noise tensor (B, in_channels, 1).

Returns:

Output tensor (B, out_channels, T ** prod(upsample_scales)).

Return type:

Tensor

inference(c: torch.Tensor) → torch.Tensor[source]

Perform inference.

Parameters:

c (Tensor) – Input tensor (T, in_channels).

Returns:

Output tensor (T ** prod(upsample_scales), out_channels).

Return type:

Tensor

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

espnet2.gan_tts.melgan.melgan

MelGAN Modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.melgan.melgan.MelGANDiscriminator(in_channels: int = 1, out_channels: int = 1, kernel_sizes: List[int] = [5, 3], channels: int = 16, max_downsample_channels: int = 1024, bias: bool = True, downsample_scales: List[int] = [4, 4, 4, 4], nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, pad: str = 'ReflectionPad1d', pad_params: Dict[str, Any] = {})[source]

Bases: torch.nn.modules.module.Module

MelGAN discriminator module.

Initilize MelGANDiscriminator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_sizes (List[int]) – List of two kernel sizes. The prod will be used for the first conv layer, and the first and the second kernel sizes will be used for the last two layers. For example if kernel_sizes = [5, 3], the first layer kernel size will be 5 * 3 = 15, the last two layers’ kernel size will be 5 and 3, respectively.

  • channels (int) – Initial number of channels for conv layer.

  • max_downsample_channels (int) – Maximum number of channels for downsampling layers.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • downsample_scales (List[int]) – List of downsampling scales.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • pad (str) – Padding function module name before dilated convolution layer.

  • pad_params (Dict[str, Any]) – Hyperparameters for padding function.

forward(x: torch.Tensor) → List[torch.Tensor][source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input noise signal (B, 1, T).

Returns:

List of output tensors of each layer.

Return type:

List[Tensor]

class espnet2.gan_tts.melgan.melgan.MelGANGenerator(in_channels: int = 80, out_channels: int = 1, kernel_size: int = 7, channels: int = 512, bias: bool = True, upsample_scales: List[int] = [8, 8, 2, 2], stack_kernel_size: int = 3, stacks: int = 3, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, pad: str = 'ReflectionPad1d', pad_params: Dict[str, Any] = {}, use_final_nonlinear_activation: bool = True, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

MelGAN generator module.

Initialize MelGANGenerator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Kernel size of initial and final conv layer.

  • channels (int) – Initial number of channels for conv layer.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • upsample_scales (List[int]) – List of upsampling scales.

  • stack_kernel_size (int) – Kernel size of dilated conv layers in residual stack.

  • stacks (int) – Number of stacks in a single residual stack.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • pad (str) – Padding function module name before dilated convolution layer.

  • pad_params (Dict[str, Any]) – Hyperparameters for padding function.

  • use_final_nonlinear_activation (torch.nn.Module) – Activation function for the final layer.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:

c (Tensor) – Input tensor (B, channels, T).

Returns:

Output tensor (B, 1, T ** prod(upsample_scales)).

Return type:

Tensor

inference(c: torch.Tensor) → torch.Tensor[source]

Perform inference.

Parameters:

c (Tensor) – Input tensor (T, in_channels).

Returns:

Output tensor (T ** prod(upsample_scales), out_channels).

Return type:

Tensor

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows official implementation manner. https://github.com/descriptinc/melgan-neurips/blob/master/mel2wav/modules.py

class espnet2.gan_tts.melgan.melgan.MelGANMultiScaleDiscriminator(in_channels: int = 1, out_channels: int = 1, scales: int = 3, downsample_pooling: str = 'AvgPool1d', downsample_pooling_params: Dict[str, Any] = {'count_include_pad': False, 'kernel_size': 4, 'padding': 1, 'stride': 2}, kernel_sizes: List[int] = [5, 3], channels: int = 16, max_downsample_channels: int = 1024, bias: bool = True, downsample_scales: List[int] = [4, 4, 4, 4], nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, pad: str = 'ReflectionPad1d', pad_params: Dict[str, Any] = {}, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

MelGAN multi-scale discriminator module.

Initilize MelGANMultiScaleDiscriminator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • scales (int) – Number of multi-scales.

  • downsample_pooling (str) – Pooling module name for downsampling of the inputs.

  • downsample_pooling_params (Dict[str, Any]) – Parameters for the above pooling module.

  • kernel_sizes (List[int]) – List of two kernel sizes. The sum will be used for the first conv layer, and the first and the second kernel sizes will be used for the last two layers.

  • channels (int) – Initial number of channels for conv layer.

  • max_downsample_channels (int) – Maximum number of channels for downsampling layers.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • downsample_scales (List[int]) – List of downsampling scales.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • pad (str) – Padding function module name before dilated convolution layer.

  • pad_params (Dict[str, Any]) – Hyperparameters for padding function.

  • use_weight_norm (bool) – Whether to use weight norm.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → List[List[torch.Tensor]][source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input noise signal (B, 1, T).

Returns:

List of list of each discriminator outputs, which

consists of each layer output tensors.

Return type:

List[List[Tensor]]

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows official implementation manner. https://github.com/descriptinc/melgan-neurips/blob/master/mel2wav/modules.py

espnet2.gan_tts.melgan.pqmf

Pseudo QMF modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.melgan.pqmf.PQMF(subbands: int = 4, taps: int = 62, cutoff_ratio: float = 0.142, beta: float = 9.0)[source]

Bases: torch.nn.modules.module.Module

PQMF module.

This module is based on Near-perfect-reconstruction pseudo-QMF banks.

Initilize PQMF module.

The cutoff_ratio and beta parameters are optimized for #subbands = 4. See dicussion in https://github.com/kan-bayashi/ParallelWaveGAN/issues/195.

Parameters:
  • subbands (int) – The number of subbands.

  • taps (int) – The number of filter taps.

  • cutoff_ratio (float) – Cut-off frequency ratio.

  • beta (float) – Beta coefficient for kaiser window.

analysis(x: torch.Tensor) → torch.Tensor[source]

Analysis with PQMF.

Parameters:

x (Tensor) – Input tensor (B, 1, T).

Returns:

Output tensor (B, subbands, T // subbands).

Return type:

Tensor

synthesis(x: torch.Tensor) → torch.Tensor[source]

Synthesis with PQMF.

Parameters:

x (Tensor) – Input tensor (B, subbands, T // subbands).

Returns:

Output tensor (B, 1, T).

Return type:

Tensor

espnet2.gan_tts.melgan.pqmf.design_prototype_filter(taps: int = 62, cutoff_ratio: float = 0.142, beta: float = 9.0) → numpy.ndarray[source]

Design prototype filter for PQMF.

This method is based on A Kaiser window approach for the design of prototype filters of cosine modulated filterbanks.

Parameters:
  • taps (int) – The number of filter taps.

  • cutoff_ratio (float) – Cut-off frequency ratio.

  • beta (float) – Beta coefficient for kaiser window.

Returns:

Impluse response of prototype filter (taps + 1,).

Return type:

ndarray

espnet2.gan_tts.melgan.__init__

espnet2.gan_tts.melgan.residual_stack

Residual stack module in MelGAN.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.melgan.residual_stack.ResidualStack(kernel_size: int = 3, channels: int = 32, dilation: int = 1, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, pad: str = 'ReflectionPad1d', pad_params: Dict[str, Any] = {})[source]

Bases: torch.nn.modules.module.Module

Residual stack module introduced in MelGAN.

Initialize ResidualStack module.

Parameters:
  • kernel_size (int) – Kernel size of dilation convolution layer.

  • channels (int) – Number of channels of convolution layers.

  • dilation (int) – Dilation factor.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • pad (str) – Padding function module name before dilated convolution layer.

  • pad_params (Dict[str, Any]) – Hyperparameters for padding function.

forward(c: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:

c (Tensor) – Input tensor (B, channels, T).

Returns:

Output tensor (B, chennels, T).

Return type:

Tensor

espnet2.gan_tts.utils.get_random_segments

Function to get random segments.

espnet2.gan_tts.utils.get_random_segments.get_random_segments(x: torch.Tensor, x_lengths: torch.Tensor, segment_size: int) → Tuple[torch.Tensor, torch.Tensor][source]

Get random segments.

Parameters:
  • x (Tensor) – Input tensor (B, C, T).

  • x_lengths (Tensor) – Length tensor (B,).

  • segment_size (int) – Segment size.

Returns:

Segmented tensor (B, C, segment_size). Tensor: Start index tensor (B,).

Return type:

Tensor

espnet2.gan_tts.utils.get_random_segments.get_segments(x: torch.Tensor, start_idxs: torch.Tensor, segment_size: int) → torch.Tensor[source]

Get segments.

Parameters:
  • x (Tensor) – Input tensor (B, C, T).

  • start_idxs (Tensor) – Start index tensor (B,).

  • segment_size (int) – Segment size.

Returns:

Segmented tensor (B, C, segment_size).

Return type:

Tensor

espnet2.gan_tts.utils.__init__

espnet2.gan_tts.wavenet.residual_block

Residual block modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.wavenet.residual_block.Conv1d(*args, **kwargs)[source]

Bases: torch.nn.modules.conv.Conv1d

Conv1d module with customized initialization.

Initialize Conv1d module.

reset_parameters()[source]

Reset parameters.

class espnet2.gan_tts.wavenet.residual_block.Conv1d1x1(in_channels: int, out_channels: int, bias: bool)[source]

Bases: espnet2.gan_tts.wavenet.residual_block.Conv1d

1x1 Conv1d with customized initialization.

Initialize 1x1 Conv1d module.

class espnet2.gan_tts.wavenet.residual_block.ResidualBlock(kernel_size: int = 3, residual_channels: int = 64, gate_channels: int = 128, skip_channels: int = 64, aux_channels: int = 80, global_channels: int = -1, dropout_rate: float = 0.0, dilation: int = 1, bias: bool = True, scale_residual: bool = False)[source]

Bases: torch.nn.modules.module.Module

Residual block module in WaveNet.

Initialize ResidualBlock module.

Parameters:
  • kernel_size (int) – Kernel size of dilation convolution layer.

  • residual_channels (int) – Number of channels for residual connection.

  • skip_channels (int) – Number of channels for skip connection.

  • aux_channels (int) – Number of local conditioning channels.

  • dropout (float) – Dropout probability.

  • dilation (int) – Dilation factor.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • scale_residual (bool) – Whether to scale the residual outputs.

forward(x: torch.Tensor, x_mask: Optional[torch.Tensor] = None, c: Optional[torch.Tensor] = None, g: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input tensor (B, residual_channels, T).

  • Optional[torch.Tensor] (x_mask) – Mask tensor (B, 1, T).

  • c (Optional[Tensor]) – Local conditioning tensor (B, aux_channels, T).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns:

Output tensor for residual connection (B, residual_channels, T). Tensor: Output tensor for skip connection (B, skip_channels, T).

Return type:

Tensor

espnet2.gan_tts.wavenet.__init__

espnet2.gan_tts.wavenet.wavenet

WaveNet modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.wavenet.wavenet.WaveNet(in_channels: int = 1, out_channels: int = 1, kernel_size: int = 3, layers: int = 30, stacks: int = 3, base_dilation: int = 2, residual_channels: int = 64, aux_channels: int = -1, gate_channels: int = 128, skip_channels: int = 64, global_channels: int = -1, dropout_rate: float = 0.0, bias: bool = True, use_weight_norm: bool = True, use_first_conv: bool = False, use_last_conv: bool = False, scale_residual: bool = False, scale_skip_connect: bool = False)[source]

Bases: torch.nn.modules.module.Module

WaveNet with global conditioning.

Initialize WaveNet module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Kernel size of dilated convolution.

  • layers (int) – Number of residual block layers.

  • stacks (int) – Number of stacks i.e., dilation cycles.

  • base_dilation (int) – Base dilation factor.

  • residual_channels (int) – Number of channels in residual conv.

  • gate_channels (int) – Number of channels in gated conv.

  • skip_channels (int) – Number of channels in skip conv.

  • aux_channels (int) – Number of channels for local conditioning feature.

  • global_channels (int) – Number of channels for global conditioning feature.

  • dropout_rate (float) – Dropout rate. 0.0 means no dropout applied.

  • bias (bool) – Whether to use bias parameter in conv layer.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

  • use_first_conv (bool) – Whether to use the first conv layers.

  • use_last_conv (bool) – Whether to use the last conv layers.

  • scale_residual (bool) – Whether to scale the residual outputs.

  • scale_skip_connect (bool) – Whether to scale the skip connection outputs.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor, x_mask: Optional[torch.Tensor] = None, c: Optional[torch.Tensor] = None, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • x (Tensor) – Input noise signal (B, 1, T) if use_first_conv else (B, residual_channels, T).

  • x_mask (Optional[Tensor]) – Mask tensor (B, 1, T).

  • c (Optional[Tensor]) – Local conditioning features (B, aux_channels, T).

  • g (Optional[Tensor]) – Global conditioning features (B, global_channels, 1).

Returns:

Output tensor (B, out_channels, T) if use_last_conv else

(B, residual_channels, T).

Return type:

Tensor

property receptive_field_size

Return receptive field size.

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

espnet2.gan_tts.hifigan.hifigan

HiFi-GAN Modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.hifigan.hifigan.HiFiGANGenerator(in_channels: int = 80, out_channels: int = 1, channels: int = 512, global_channels: int = -1, kernel_size: int = 7, upsample_scales: List[int] = [8, 8, 2, 2], upsample_kernel_sizes: List[int] = [16, 16, 4, 4], resblock_kernel_sizes: List[int] = [3, 7, 11], resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], use_additional_convs: bool = True, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

HiFiGAN generator module.

Initialize HiFiGANGenerator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • channels (int) – Number of hidden representation channels.

  • global_channels (int) – Number of global conditioning channels.

  • kernel_size (int) – Kernel size of initial and final conv layer.

  • upsample_scales (List[int]) – List of upsampling scales.

  • upsample_kernel_sizes (List[int]) – List of kernel sizes for upsample layers.

  • resblock_kernel_sizes (List[int]) – List of kernel sizes for residual blocks.

  • resblock_dilations (List[List[int]]) – List of list of dilations for residual blocks.

  • use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c: torch.Tensor, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • c (Tensor) – Input tensor (B, in_channels, T).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns:

Output tensor (B, out_channels, T).

Return type:

Tensor

inference(c: torch.Tensor, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Perform inference.

Parameters:
  • c (torch.Tensor) – Input tensor (T, in_channels).

  • g (Optional[Tensor]) – Global conditioning tensor (global_channels, 1).

Returns:

Output tensor (T ** upsample_factor, out_channels).

Return type:

Tensor

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows the official implementation manner. https://github.com/jik876/hifi-gan/blob/master/models.py

class espnet2.gan_tts.hifigan.hifigan.HiFiGANMultiPeriodDiscriminator(periods: List[int] = [2, 3, 5, 7, 11], discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True})[source]

Bases: torch.nn.modules.module.Module

HiFiGAN multi-period discriminator module.

Initialize HiFiGANMultiPeriodDiscriminator module.

Parameters:
  • periods (List[int]) – List of periods.

  • discriminator_params (Dict[str, Any]) – Parameters for hifi-gan period discriminator module. The period parameter will be overwritten.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input noise signal (B, 1, T).

Returns:

List of list of each discriminator outputs, which consists of each

layer output tensors.

Return type:

List

class espnet2.gan_tts.hifigan.hifigan.HiFiGANMultiScaleDiscriminator(scales: int = 3, downsample_pooling: str = 'AvgPool1d', downsample_pooling_params: Dict[str, Any] = {'kernel_size': 4, 'padding': 2, 'stride': 2}, discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1}, follow_official_norm: bool = False)[source]

Bases: torch.nn.modules.module.Module

HiFi-GAN multi-scale discriminator module.

Initilize HiFiGAN multi-scale discriminator module.

Parameters:
  • scales (int) – Number of multi-scales.

  • downsample_pooling (str) – Pooling module name for downsampling of the inputs.

  • downsample_pooling_params (Dict[str, Any]) – Parameters for the above pooling module.

  • discriminator_params (Dict[str, Any]) – Parameters for hifi-gan scale discriminator module.

  • follow_official_norm (bool) – Whether to follow the norm setting of the official implementaion. The first discriminator uses spectral norm and the other discriminators use weight norm.

forward(x: torch.Tensor) → List[List[torch.Tensor]][source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input noise signal (B, 1, T).

Returns:

List of list of each discriminator outputs,

which consists of eachlayer output tensors.

Return type:

List[List[torch.Tensor]]

class espnet2.gan_tts.hifigan.hifigan.HiFiGANMultiScaleMultiPeriodDiscriminator(scales: int = 3, scale_downsample_pooling: str = 'AvgPool1d', scale_downsample_pooling_params: Dict[str, Any] = {'kernel_size': 4, 'padding': 2, 'stride': 2}, scale_discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1}, follow_official_norm: bool = True, periods: List[int] = [2, 3, 5, 7, 11], period_discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True})[source]

Bases: torch.nn.modules.module.Module

HiFi-GAN multi-scale + multi-period discriminator module.

Initilize HiFiGAN multi-scale + multi-period discriminator module.

Parameters:
  • scales (int) – Number of multi-scales.

  • scale_downsample_pooling (str) – Pooling module name for downsampling of the inputs.

  • scale_downsample_pooling_params (dict) – Parameters for the above pooling module.

  • scale_discriminator_params (dict) – Parameters for hifi-gan scale discriminator module.

  • follow_official_norm (bool) – Whether to follow the norm setting of the official implementaion. The first discriminator uses spectral norm and the other discriminators use weight norm.

  • periods (list) – List of periods.

  • period_discriminator_params (dict) – Parameters for hifi-gan period discriminator module. The period parameter will be overwritten.

forward(x: torch.Tensor) → List[List[torch.Tensor]][source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input noise signal (B, 1, T).

Returns:

List of list of each discriminator outputs,

which consists of each layer output tensors. Multi scale and multi period ones are concatenated.

Return type:

List[List[Tensor]]

class espnet2.gan_tts.hifigan.hifigan.HiFiGANPeriodDiscriminator(in_channels: int = 1, out_channels: int = 1, period: int = 3, kernel_sizes: List[int] = [5, 3], channels: int = 32, downsample_scales: List[int] = [3, 3, 3, 3, 1], max_downsample_channels: int = 1024, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True, use_spectral_norm: bool = False)[source]

Bases: torch.nn.modules.module.Module

HiFiGAN period discriminator module.

Initialize HiFiGANPeriodDiscriminator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • period (int) – Period.

  • kernel_sizes (list) – Kernel sizes of initial conv layers and the final conv layer.

  • channels (int) – Number of initial channels.

  • downsample_scales (List[int]) – List of downsampling scales.

  • max_downsample_channels (int) – Number of maximum downsampling channels.

  • use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

  • use_spectral_norm (bool) – Whether to use spectral norm. If set to true, it will be applied to all of the conv layers.

apply_spectral_norm()[source]

Apply spectral normalization module from all of the layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:

c (Tensor) – Input tensor (B, in_channels, T).

Returns:

List of each layer’s tensors.

Return type:

list

class espnet2.gan_tts.hifigan.hifigan.HiFiGANScaleDiscriminator(in_channels: int = 1, out_channels: int = 1, kernel_sizes: List[int] = [15, 41, 5, 3], channels: int = 128, max_downsample_channels: int = 1024, max_groups: int = 16, bias: int = True, downsample_scales: List[int] = [2, 2, 4, 4, 1], nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True, use_spectral_norm: bool = False)[source]

Bases: torch.nn.modules.module.Module

HiFi-GAN scale discriminator module.

Initilize HiFiGAN scale discriminator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_sizes (List[int]) – List of four kernel sizes. The first will be used for the first conv layer, and the second is for downsampling part, and the remaining two are for the last two output layers.

  • channels (int) – Initial number of channels for conv layer.

  • max_downsample_channels (int) – Maximum number of channels for downsampling layers.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • downsample_scales (List[int]) – List of downsampling scales.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

  • use_spectral_norm (bool) – Whether to use spectral norm. If set to true, it will be applied to all of the conv layers.

apply_spectral_norm()[source]

Apply spectral normalization module from all of the layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → List[torch.Tensor][source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input noise signal (B, 1, T).

Returns:

List of output tensors of each layer.

Return type:

List[Tensor]

remove_spectral_norm()[source]

Remove spectral normalization module from all of the layers.

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

espnet2.gan_tts.hifigan.residual_block

HiFiGAN Residual block modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.hifigan.residual_block.ResidualBlock(kernel_size: int = 3, channels: int = 512, dilations: List[int] = [1, 3, 5], bias: bool = True, use_additional_convs: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1})[source]

Bases: torch.nn.modules.module.Module

Residual block module in HiFiGAN.

Initialize ResidualBlock module.

Parameters:
  • kernel_size (int) – Kernel size of dilation convolution layer.

  • channels (int) – Number of channels for convolution layer.

  • dilations (List[int]) – List of dilation factors.

  • use_additional_convs (bool) – Whether to use additional convolution layers.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input tensor (B, channels, T).

Returns:

Output tensor (B, channels, T).

Return type:

Tensor

espnet2.gan_tts.hifigan.loss

HiFiGAN-related loss modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.hifigan.loss.DiscriminatorAdversarialLoss(average_by_discriminators: bool = True, loss_type: str = 'mse')[source]

Bases: torch.nn.modules.module.Module

Discriminator adversarial loss module.

Initialize DiscriminatorAversarialLoss module.

Parameters:
  • average_by_discriminators (bool) – Whether to average the loss by the number of discriminators.

  • loss_type (str) – Loss type, “mse” or “hinge”.

forward(outputs_hat: Union[List[List[torch.Tensor]], List[torch.Tensor], torch.Tensor], outputs: Union[List[List[torch.Tensor]], List[torch.Tensor], torch.Tensor]) → Tuple[torch.Tensor, torch.Tensor][source]

Calcualate discriminator adversarial loss.

Parameters:
  • outputs_hat (Union[List[List[Tensor]], List[Tensor], Tensor]) – Discriminator outputs, list of discriminator outputs, or list of list of discriminator outputs calculated from generator.

  • outputs (Union[List[List[Tensor]], List[Tensor], Tensor]) – Discriminator outputs, list of discriminator outputs, or list of list of discriminator outputs calculated from groundtruth.

Returns:

Discriminator real loss value. Tensor: Discriminator fake loss value.

Return type:

Tensor

class espnet2.gan_tts.hifigan.loss.FeatureMatchLoss(average_by_layers: bool = True, average_by_discriminators: bool = True, include_final_outputs: bool = False)[source]

Bases: torch.nn.modules.module.Module

Feature matching loss module.

Initialize FeatureMatchLoss module.

Parameters:
  • average_by_layers (bool) – Whether to average the loss by the number of layers.

  • average_by_discriminators (bool) – Whether to average the loss by the number of discriminators.

  • include_final_outputs (bool) – Whether to include the final output of each discriminator for loss calculation.

forward(feats_hat: Union[List[List[torch.Tensor]], List[torch.Tensor]], feats: Union[List[List[torch.Tensor]], List[torch.Tensor]]) → torch.Tensor[source]

Calculate feature matching loss.

Parameters:
  • feats_hat (Union[List[List[Tensor]], List[Tensor]]) – List of list of discriminator outputs or list of discriminator outputs calcuated from generator’s outputs.

  • feats (Union[List[List[Tensor]], List[Tensor]]) – List of list of discriminator outputs or list of discriminator outputs calcuated from groundtruth..

Returns:

Feature matching loss value.

Return type:

Tensor

class espnet2.gan_tts.hifigan.loss.GeneratorAdversarialLoss(average_by_discriminators: bool = True, loss_type: str = 'mse')[source]

Bases: torch.nn.modules.module.Module

Generator adversarial loss module.

Initialize GeneratorAversarialLoss module.

Parameters:
  • average_by_discriminators (bool) – Whether to average the loss by the number of discriminators.

  • loss_type (str) – Loss type, “mse” or “hinge”.

forward(outputs: Union[List[List[torch.Tensor]], List[torch.Tensor], torch.Tensor]) → torch.Tensor[source]

Calcualate generator adversarial loss.

Parameters:

outputs (Union[List[List[Tensor]], List[Tensor], Tensor]) – Discriminator outputs, list of discriminator outputs, or list of list of discriminator outputs..

Returns:

Generator adversarial loss value.

Return type:

Tensor

class espnet2.gan_tts.hifigan.loss.MelSpectrogramLoss(fs: int = 22050, n_fft: int = 1024, hop_length: int = 256, win_length: Optional[int] = None, window: str = 'hann', n_mels: int = 80, fmin: Optional[int] = 0, fmax: Optional[int] = None, center: bool = True, normalized: bool = False, onesided: bool = True, log_base: Optional[float] = 10.0)[source]

Bases: torch.nn.modules.module.Module

Mel-spectrogram loss.

Initialize Mel-spectrogram loss.

Parameters:
  • fs (int) – Sampling rate.

  • n_fft (int) – FFT points.

  • hop_length (int) – Hop length.

  • win_length (Optional[int]) – Window length.

  • window (str) – Window type.

  • n_mels (int) – Number of Mel basis.

  • fmin (Optional[int]) – Minimum frequency for Mel.

  • fmax (Optional[int]) – Maximum frequency for Mel.

  • center (bool) – Whether to use center window.

  • normalized (bool) – Whether to use normalized one.

  • onesided (bool) – Whether to use oneseded one.

  • log_base (Optional[float]) – Log base value.

forward(y_hat: torch.Tensor, y: torch.Tensor, spec: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate Mel-spectrogram loss.

Parameters:
  • y_hat (Tensor) – Generated waveform tensor (B, 1, T).

  • y (Tensor) – Groundtruth waveform tensor (B, 1, T).

  • spec (Optional[Tensor]) – Groundtruth linear amplitude spectrum tensor (B, T, n_fft // 2 + 1). if provided, use it instead of groundtruth waveform.

Returns:

Mel-spectrogram loss value.

Return type:

Tensor

espnet2.gan_tts.hifigan.__init__

class espnet2.gan_tts.hifigan.__init__.HiFiGANGenerator(in_channels: int = 80, out_channels: int = 1, channels: int = 512, global_channels: int = -1, kernel_size: int = 7, upsample_scales: List[int] = [8, 8, 2, 2], upsample_kernel_sizes: List[int] = [16, 16, 4, 4], resblock_kernel_sizes: List[int] = [3, 7, 11], resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], use_additional_convs: bool = True, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

HiFiGAN generator module.

Initialize HiFiGANGenerator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • channels (int) – Number of hidden representation channels.

  • global_channels (int) – Number of global conditioning channels.

  • kernel_size (int) – Kernel size of initial and final conv layer.

  • upsample_scales (List[int]) – List of upsampling scales.

  • upsample_kernel_sizes (List[int]) – List of kernel sizes for upsample layers.

  • resblock_kernel_sizes (List[int]) – List of kernel sizes for residual blocks.

  • resblock_dilations (List[List[int]]) – List of list of dilations for residual blocks.

  • use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c: torch.Tensor, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • c (Tensor) – Input tensor (B, in_channels, T).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns:

Output tensor (B, out_channels, T).

Return type:

Tensor

inference(c: torch.Tensor, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Perform inference.

Parameters:
  • c (torch.Tensor) – Input tensor (T, in_channels).

  • g (Optional[Tensor]) – Global conditioning tensor (global_channels, 1).

Returns:

Output tensor (T ** upsample_factor, out_channels).

Return type:

Tensor

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows the official implementation manner. https://github.com/jik876/hifi-gan/blob/master/models.py

class espnet2.gan_tts.hifigan.__init__.HiFiGANMultiPeriodDiscriminator(periods: List[int] = [2, 3, 5, 7, 11], discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True})[source]

Bases: torch.nn.modules.module.Module

HiFiGAN multi-period discriminator module.

Initialize HiFiGANMultiPeriodDiscriminator module.

Parameters:
  • periods (List[int]) – List of periods.

  • discriminator_params (Dict[str, Any]) – Parameters for hifi-gan period discriminator module. The period parameter will be overwritten.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input noise signal (B, 1, T).

Returns:

List of list of each discriminator outputs, which consists of each

layer output tensors.

Return type:

List

class espnet2.gan_tts.hifigan.__init__.HiFiGANMultiScaleDiscriminator(scales: int = 3, downsample_pooling: str = 'AvgPool1d', downsample_pooling_params: Dict[str, Any] = {'kernel_size': 4, 'padding': 2, 'stride': 2}, discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1}, follow_official_norm: bool = False)[source]

Bases: torch.nn.modules.module.Module

HiFi-GAN multi-scale discriminator module.

Initilize HiFiGAN multi-scale discriminator module.

Parameters:
  • scales (int) – Number of multi-scales.

  • downsample_pooling (str) – Pooling module name for downsampling of the inputs.

  • downsample_pooling_params (Dict[str, Any]) – Parameters for the above pooling module.

  • discriminator_params (Dict[str, Any]) – Parameters for hifi-gan scale discriminator module.

  • follow_official_norm (bool) – Whether to follow the norm setting of the official implementaion. The first discriminator uses spectral norm and the other discriminators use weight norm.

forward(x: torch.Tensor) → List[List[torch.Tensor]][source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input noise signal (B, 1, T).

Returns:

List of list of each discriminator outputs,

which consists of eachlayer output tensors.

Return type:

List[List[torch.Tensor]]

class espnet2.gan_tts.hifigan.__init__.HiFiGANMultiScaleMultiPeriodDiscriminator(scales: int = 3, scale_downsample_pooling: str = 'AvgPool1d', scale_downsample_pooling_params: Dict[str, Any] = {'kernel_size': 4, 'padding': 2, 'stride': 2}, scale_discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1}, follow_official_norm: bool = True, periods: List[int] = [2, 3, 5, 7, 11], period_discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True})[source]

Bases: torch.nn.modules.module.Module

HiFi-GAN multi-scale + multi-period discriminator module.

Initilize HiFiGAN multi-scale + multi-period discriminator module.

Parameters:
  • scales (int) – Number of multi-scales.

  • scale_downsample_pooling (str) – Pooling module name for downsampling of the inputs.

  • scale_downsample_pooling_params (dict) – Parameters for the above pooling module.

  • scale_discriminator_params (dict) – Parameters for hifi-gan scale discriminator module.

  • follow_official_norm (bool) – Whether to follow the norm setting of the official implementaion. The first discriminator uses spectral norm and the other discriminators use weight norm.

  • periods (list) – List of periods.

  • period_discriminator_params (dict) – Parameters for hifi-gan period discriminator module. The period parameter will be overwritten.

forward(x: torch.Tensor) → List[List[torch.Tensor]][source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input noise signal (B, 1, T).

Returns:

List of list of each discriminator outputs,

which consists of each layer output tensors. Multi scale and multi period ones are concatenated.

Return type:

List[List[Tensor]]

class espnet2.gan_tts.hifigan.__init__.HiFiGANPeriodDiscriminator(in_channels: int = 1, out_channels: int = 1, period: int = 3, kernel_sizes: List[int] = [5, 3], channels: int = 32, downsample_scales: List[int] = [3, 3, 3, 3, 1], max_downsample_channels: int = 1024, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True, use_spectral_norm: bool = False)[source]

Bases: torch.nn.modules.module.Module

HiFiGAN period discriminator module.

Initialize HiFiGANPeriodDiscriminator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • period (int) – Period.

  • kernel_sizes (list) – Kernel sizes of initial conv layers and the final conv layer.

  • channels (int) – Number of initial channels.

  • downsample_scales (List[int]) – List of downsampling scales.

  • max_downsample_channels (int) – Number of maximum downsampling channels.

  • use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

  • use_spectral_norm (bool) – Whether to use spectral norm. If set to true, it will be applied to all of the conv layers.

apply_spectral_norm()[source]

Apply spectral normalization module from all of the layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters:

c (Tensor) – Input tensor (B, in_channels, T).

Returns:

List of each layer’s tensors.

Return type:

list

class espnet2.gan_tts.hifigan.__init__.HiFiGANScaleDiscriminator(in_channels: int = 1, out_channels: int = 1, kernel_sizes: List[int] = [15, 41, 5, 3], channels: int = 128, max_downsample_channels: int = 1024, max_groups: int = 16, bias: int = True, downsample_scales: List[int] = [2, 2, 4, 4, 1], nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True, use_spectral_norm: bool = False)[source]

Bases: torch.nn.modules.module.Module

HiFi-GAN scale discriminator module.

Initilize HiFiGAN scale discriminator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_sizes (List[int]) – List of four kernel sizes. The first will be used for the first conv layer, and the second is for downsampling part, and the remaining two are for the last two output layers.

  • channels (int) – Initial number of channels for conv layer.

  • max_downsample_channels (int) – Maximum number of channels for downsampling layers.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • downsample_scales (List[int]) – List of downsampling scales.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

  • use_spectral_norm (bool) – Whether to use spectral norm. If set to true, it will be applied to all of the conv layers.

apply_spectral_norm()[source]

Apply spectral normalization module from all of the layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → List[torch.Tensor][source]

Calculate forward propagation.

Parameters:

x (Tensor) – Input noise signal (B, 1, T).

Returns:

List of output tensors of each layer.

Return type:

List[Tensor]

remove_spectral_norm()[source]

Remove spectral normalization module from all of the layers.

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

espnet2.gan_tts.jets.generator

Generator module in JETS.

class espnet2.gan_tts.jets.generator.JETSGenerator(idim: int, odim: int, adim: int = 256, aheads: int = 2, elayers: int = 4, eunits: int = 1024, dlayers: int = 4, dunits: int = 1024, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, reduction_factor: int = 1, encoder_type: str = 'transformer', decoder_type: str = 'transformer', transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, conformer_rel_pos_type: str = 'legacy', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, zero_triu: bool = False, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, energy_predictor_layers: int = 2, energy_predictor_chans: int = 384, energy_predictor_kernel_size: int = 3, energy_predictor_dropout: float = 0.5, energy_embed_kernel_size: int = 9, energy_embed_dropout: float = 0.5, stop_gradient_from_energy_predictor: bool = False, pitch_predictor_layers: int = 2, pitch_predictor_chans: int = 384, pitch_predictor_kernel_size: int = 3, pitch_predictor_dropout: float = 0.5, pitch_embed_kernel_size: int = 9, pitch_embed_dropout: float = 0.5, stop_gradient_from_pitch_predictor: bool = False, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False, segment_size: int = 64, generator_out_channels: int = 1, generator_channels: int = 512, generator_global_channels: int = -1, generator_kernel_size: int = 7, generator_upsample_scales: List[int] = [8, 8, 2, 2], generator_upsample_kernel_sizes: List[int] = [16, 16, 4, 4], generator_resblock_kernel_sizes: List[int] = [3, 7, 11], generator_resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], generator_use_additional_convs: bool = True, generator_bias: bool = True, generator_nonlinear_activation: str = 'LeakyReLU', generator_nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, generator_use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Generator module in JETS.

Initialize JETS generator module.

Parameters:
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • elayers (int) – Number of encoder layers.

  • eunits (int) – Number of encoder hidden units.

  • dlayers (int) – Number of decoder layers.

  • dunits (int) – Number of decoder hidden units.

  • use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.

  • use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.

  • encoder_normalize_before (bool) – Whether to apply layernorm layer before encoder block.

  • decoder_normalize_before (bool) – Whether to apply layernorm layer before decoder block.

  • encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.

  • decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.

  • reduction_factor (int) – Reduction factor.

  • encoder_type (str) – Encoder type (“transformer” or “conformer”).

  • decoder_type (str) – Decoder type (“transformer” or “conformer”).

  • transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention and positional encoding.

  • transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.

  • transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.

  • transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention & positional encoding.

  • transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.

  • transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.

  • conformer_rel_pos_type (str) – Relative pos encoding type in conformer.

  • conformer_pos_enc_layer_type (str) – Pos encoding layer type in conformer.

  • conformer_self_attn_layer_type (str) – Self-attention layer type in conformer

  • conformer_activation_type (str) – Activation function type in conformer.

  • use_macaron_style_in_conformer – Whether to use macaron style FFN.

  • use_cnn_in_conformer – Whether to use CNN in conformer.

  • zero_triu – Whether to use zero triu in relative self-attention module.

  • conformer_enc_kernel_size – Kernel size of encoder conformer.

  • conformer_dec_kernel_size – Kernel size of decoder conformer.

  • duration_predictor_layers (int) – Number of duration predictor layers.

  • duration_predictor_chans (int) – Number of duration predictor channels.

  • duration_predictor_kernel_size (int) – Kernel size of duration predictor.

  • duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.

  • pitch_predictor_layers (int) – Number of pitch predictor layers.

  • pitch_predictor_chans (int) – Number of pitch predictor channels.

  • pitch_predictor_kernel_size (int) – Kernel size of pitch predictor.

  • pitch_predictor_dropout_rate (float) – Dropout rate in pitch predictor.

  • pitch_embed_kernel_size (float) – Kernel size of pitch embedding.

  • pitch_embed_dropout_rate (float) – Dropout rate for pitch embedding.

  • stop_gradient_from_pitch_predictor – Whether to stop gradient from pitch predictor to encoder.

  • energy_predictor_layers (int) – Number of energy predictor layers.

  • energy_predictor_chans (int) – Number of energy predictor channels.

  • energy_predictor_kernel_size (int) – Kernel size of energy predictor.

  • energy_predictor_dropout_rate (float) – Dropout rate in energy predictor.

  • energy_embed_kernel_size (float) – Kernel size of energy embedding.

  • energy_embed_dropout_rate (float) – Dropout rate for energy embedding.

  • stop_gradient_from_energy_predictor – Whether to stop gradient from energy predictor to encoder.

  • spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.

  • langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.

  • spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.

  • spk_embed_integration_type – How to integrate speaker embedding.

  • use_gst (str) – Whether to use global style token.

  • gst_tokens (int) – The number of GST embeddings.

  • gst_heads (int) – The number of heads in GST multihead attention.

  • gst_conv_layers (int) – The number of conv layers in GST.

  • gst_conv_chans_list – (Sequence[int]): List of the number of channels of conv layers in GST.

  • gst_conv_kernel_size (int) – Kernel size of conv layers in GST.

  • gst_conv_stride (int) – Stride size of conv layers in GST.

  • gst_gru_layers (int) – The number of GRU layers in GST.

  • gst_gru_units (int) – The number of GRU units in GST.

  • init_type (str) – How to initialize transformer parameters.

  • init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.

  • init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.

  • use_masking (bool) – Whether to apply masking for padded part in loss calculation.

  • use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.

  • segment_size (int) – Segment size for random windowed discriminator

  • generator_out_channels (int) – Number of output channels.

  • generator_channels (int) – Number of hidden representation channels.

  • generator_global_channels (int) – Number of global conditioning channels.

  • generator_kernel_size (int) – Kernel size of initial and final conv layer.

  • generator_upsample_scales (List[int]) – List of upsampling scales.

  • generator_upsample_kernel_sizes (List[int]) – List of kernel sizes for upsample layers.

  • generator_resblock_kernel_sizes (List[int]) – List of kernel sizes for residual blocks.

  • generator_resblock_dilations (List[List[int]]) – List of list of dilations for residual blocks.

  • generator_use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • generator_bias (bool) – Whether to add bias parameter in convolution layers.

  • generator_nonlinear_activation (str) – Activation function module name.

  • generator_nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • generator_use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, pitch: torch.Tensor, pitch_lengths: torch.Tensor, energy: torch.Tensor, energy_lengths: torch.Tensor, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats (Tensor) – Feature tensor (B, T_feats, aux_channels).

  • feats_lengths (Tensor) – Feature length tensor (B,).

  • pitch (Tensor) – Batch of padded token-averaged pitch (B, T_text, 1).

  • pitch_lengths (LongTensor) – Batch of pitch lengths (B, T_text).

  • energy (Tensor) – Batch of padded token-averaged energy (B, T_text, 1).

  • energy_lengths (LongTensor) – Batch of energy lengths (B, T_text).

  • sids (Optional[Tensor]) – Speaker index tensor (B,) or (B, 1).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, spk_embed_dim).

  • lids (Optional[Tensor]) – Language index tensor (B,) or (B, 1).

Returns:

Waveform tensor (B, 1, segment_size * upsample_factor). Tensor: Binarization loss (). Tensor: Log probability attention matrix (B, T_feats, T_text). Tensor: Segments start index tensor (B,). Tensor: predicted duration (B, T_text). Tensor: ground-truth duration obtained from an alignment module (B, T_text). Tensor: predicted pitch (B, T_text,1). Tensor: ground-truth averaged pitch (B, T_text, 1). Tensor: predicted energy (B, T_text, 1). Tensor: ground-truth averaged energy (B, T_text, 1).

Return type:

Tensor

inference(text: torch.Tensor, text_lengths: torch.Tensor, feats: Optional[torch.Tensor] = None, feats_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, use_teacher_forcing: bool = False) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Run inference.

Parameters:
  • text (Tensor) – Input text index tensor (B, T_text,).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats (Tensor) – Feature tensor (B, T_feats, aux_channels).

  • feats_lengths (Tensor) – Feature length tensor (B,).

  • pitch (Tensor) – Pitch tensor (B, T_feats, 1)

  • energy (Tensor) – Energy tensor (B, T_feats, 1)

  • sids (Optional[Tensor]) – Speaker index tensor (B,) or (B, 1).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, spk_embed_dim).

  • lids (Optional[Tensor]) – Language index tensor (B,) or (B, 1).

  • use_teacher_forcing (bool) – Whether to use teacher forcing.

Returns:

Generated waveform tensor (B, T_wav). Tensor: Duration tensor (B, T_text).

Return type:

Tensor

espnet2.gan_tts.jets.jets

JETS module for GAN-TTS task.

class espnet2.gan_tts.jets.jets.JETS(idim: int, odim: int, sampling_rate: int = 22050, generator_type: str = 'jets_generator', generator_params: Dict[str, Any] = {'adim': 256, 'aheads': 2, 'conformer_activation_type': 'swish', 'conformer_dec_kernel_size': 31, 'conformer_enc_kernel_size': 7, 'conformer_pos_enc_layer_type': 'rel_pos', 'conformer_rel_pos_type': 'latest', 'conformer_self_attn_layer_type': 'rel_selfattn', 'decoder_concat_after': False, 'decoder_normalize_before': True, 'decoder_type': 'transformer', 'dlayers': 4, 'dunits': 1024, 'duration_predictor_chans': 384, 'duration_predictor_dropout_rate': 0.1, 'duration_predictor_kernel_size': 3, 'duration_predictor_layers': 2, 'elayers': 4, 'encoder_concat_after': False, 'encoder_normalize_before': True, 'encoder_type': 'transformer', 'energy_embed_dropout': 0.5, 'energy_embed_kernel_size': 1, 'energy_predictor_chans': 384, 'energy_predictor_dropout': 0.5, 'energy_predictor_kernel_size': 3, 'energy_predictor_layers': 2, 'eunits': 1024, 'generator_bias': True, 'generator_channels': 512, 'generator_global_channels': -1, 'generator_kernel_size': 7, 'generator_nonlinear_activation': 'LeakyReLU', 'generator_nonlinear_activation_params': {'negative_slope': 0.1}, 'generator_out_channels': 1, 'generator_resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'generator_resblock_kernel_sizes': [3, 7, 11], 'generator_upsample_kernel_sizes': [16, 16, 4, 4], 'generator_upsample_scales': [8, 8, 2, 2], 'generator_use_additional_convs': True, 'generator_use_weight_norm': True, 'gst_conv_chans_list': [32, 32, 64, 64, 128, 128], 'gst_conv_kernel_size': 3, 'gst_conv_layers': 6, 'gst_conv_stride': 2, 'gst_gru_layers': 1, 'gst_gru_units': 128, 'gst_heads': 4, 'gst_tokens': 10, 'init_dec_alpha': 1.0, 'init_enc_alpha': 1.0, 'init_type': 'xavier_uniform', 'langs': -1, 'pitch_embed_dropout': 0.5, 'pitch_embed_kernel_size': 1, 'pitch_predictor_chans': 384, 'pitch_predictor_dropout': 0.5, 'pitch_predictor_kernel_size': 5, 'pitch_predictor_layers': 5, 'positionwise_conv_kernel_size': 1, 'positionwise_layer_type': 'conv1d', 'reduction_factor': 1, 'segment_size': 64, 'spk_embed_dim': None, 'spk_embed_integration_type': 'add', 'spks': -1, 'stop_gradient_from_energy_predictor': False, 'stop_gradient_from_pitch_predictor': True, 'transformer_dec_attn_dropout_rate': 0.1, 'transformer_dec_dropout_rate': 0.1, 'transformer_dec_positional_dropout_rate': 0.1, 'transformer_enc_attn_dropout_rate': 0.1, 'transformer_enc_dropout_rate': 0.1, 'transformer_enc_positional_dropout_rate': 0.1, 'use_batch_norm': True, 'use_cnn_in_conformer': True, 'use_gst': False, 'use_macaron_style_in_conformer': True, 'use_masking': False, 'use_scaled_pos_enc': True, 'use_weighted_masking': False, 'zero_triu': False}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_adv: float = 1.0, lambda_mel: float = 45.0, lambda_feat_match: float = 2.0, lambda_var: float = 1.0, lambda_align: float = 2.0, cache_generator_outputs: bool = True, plot_pred_mos: bool = False, mos_pred_tool: str = 'utmos')[source]

Bases: espnet2.gan_tts.abs_gan_tts.AbsGANTTS

JETS module (generator + discriminator).

This is a module of JETS described in `JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech’_.

Initialize JETS module.

Parameters:
  • idim (int) – Input vocabrary size.

  • odim (int) – Acoustic feature dimension. The actual output channels will be 1 since JETS is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.

  • sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.

  • generator_type (str) – Generator type.

  • generator_params (Dict[str, Any]) – Parameter dict for generator.

  • discriminator_type (str) – Discriminator type.

  • discriminator_params (Dict[str, Any]) – Parameter dict for discriminator.

  • generator_adv_loss_params (Dict[str, Any]) – Parameter dict for generator adversarial loss.

  • discriminator_adv_loss_params (Dict[str, Any]) – Parameter dict for discriminator adversarial loss.

  • feat_match_loss_params (Dict[str, Any]) – Parameter dict for feat match loss.

  • mel_loss_params (Dict[str, Any]) – Parameter dict for mel loss.

  • lambda_adv (float) – Loss scaling coefficient for adversarial loss.

  • lambda_mel (float) – Loss scaling coefficient for mel spectrogram loss.

  • lambda_feat_match (float) – Loss scaling coefficient for feat match loss.

  • lambda_var (float) – Loss scaling coefficient for variance loss.

  • lambda_align (float) – Loss scaling coefficient for alignment loss.

  • cache_generator_outputs (bool) – Whether to cache generator outputs.

  • plot_pred_mos (bool) – Whether to plot predicted MOS during the training.

  • mos_pred_tool (str) – MOS prediction tool name.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, forward_generator: bool = True, **kwargs) → Dict[str, Any][source]

Perform generator forward.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats (Tensor) – Feature tensor (B, T_feats, aux_channels).

  • feats_lengths (Tensor) – Feature length tensor (B,).

  • speech (Tensor) – Speech waveform tensor (B, T_wav).

  • speech_lengths (Tensor) – Speech length tensor (B,).

  • sids (Optional[Tensor]) – Speaker index tensor (B,) or (B, 1).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, spk_embed_dim).

  • lids (Optional[Tensor]) – Language index tensor (B,) or (B, 1).

  • forward_generator (bool) – Whether to forward generator.

Returns:

  • loss (Tensor): Loss scalar tensor.

  • stats (Dict[str, float]): Statistics to be monitored.

  • weight (Tensor): Weight tensor to summarize losses.

  • optim_idx (int): Optimizer index (0 for G and 1 for D).

Return type:

Dict[str, Any]

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, use_teacher_forcing: bool = False, **kwargs) → Dict[str, torch.Tensor][source]

Run inference.

Parameters:
  • text (Tensor) – Input text index tensor (T_text,).

  • feats (Tensor) – Feature tensor (T_feats, aux_channels).

  • pitch (Tensor) – Pitch tensor (T_feats, 1).

  • energy (Tensor) – Energy tensor (T_feats, 1).

  • use_teacher_forcing (bool) – Whether to use teacher forcing.

Returns:

  • wav (Tensor): Generated waveform tensor (T_wav,).

  • duration (Tensor): Predicted duration tensor (T_text,).

Return type:

Dict[str, Tensor]

property require_raw_speech

Return whether or not speech is required.

property require_vocoder

Return whether or not vocoder is required.

espnet2.gan_tts.jets.loss

JETS related loss module for ESPnet2.

class espnet2.gan_tts.jets.loss.ForwardSumLoss[source]

Bases: torch.nn.modules.module.Module

Forwardsum loss described at https://openreview.net/forum?id=0NQwnnwAORi

Initialize forwardsum loss module.

forward(log_p_attn: torch.Tensor, ilens: torch.Tensor, olens: torch.Tensor, blank_prob: float = 0.36787944117144233) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • log_p_attn (Tensor) – Batch of log probability of attention matrix (B, T_feats, T_text).

  • ilens (Tensor) – Batch of the lengths of each input (B,).

  • olens (Tensor) – Batch of the lengths of each target (B,).

  • blank_prob (float) – Blank symbol probability.

Returns:

forwardsum loss value.

Return type:

Tensor

class espnet2.gan_tts.jets.loss.VarianceLoss(use_masking: bool = True, use_weighted_masking: bool = False)[source]

Bases: torch.nn.modules.module.Module

Initialize JETS variance loss module.

Parameters:
  • use_masking (bool) – Whether to apply masking for padded part in loss calculation.

  • use_weighted_masking (bool) – Whether to weighted masking in loss calculation.

forward(d_outs: torch.Tensor, ds: torch.Tensor, p_outs: torch.Tensor, ps: torch.Tensor, e_outs: torch.Tensor, es: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • d_outs (LongTensor) – Batch of outputs of duration predictor (B, T_text).

  • ds (LongTensor) – Batch of durations (B, T_text).

  • p_outs (Tensor) – Batch of outputs of pitch predictor (B, T_text, 1).

  • ps (Tensor) – Batch of target token-averaged pitch (B, T_text, 1).

  • e_outs (Tensor) – Batch of outputs of energy predictor (B, T_text, 1).

  • es (Tensor) – Batch of target token-averaged energy (B, T_text, 1).

  • ilens (LongTensor) – Batch of the lengths of each input (B,).

Returns:

Duration predictor loss value. Tensor: Pitch predictor loss value. Tensor: Energy predictor loss value.

Return type:

Tensor

espnet2.gan_tts.jets.length_regulator

class espnet2.gan_tts.jets.length_regulator.GaussianUpsampling(delta=0.1)[source]

Bases: torch.nn.modules.module.Module

Gaussian upsampling with fixed temperature as in:

https://arxiv.org/abs/2010.04301

forward(hs, ds, h_masks=None, d_masks=None)[source]

Upsample hidden states according to durations.

Parameters:
  • hs (Tensor) – Batched hidden state to be expanded (B, T_text, adim).

  • ds (Tensor) – Batched token duration (B, T_text).

  • h_masks (Tensor) – Mask tensor (B, T_feats).

  • d_masks (Tensor) – Mask tensor (B, T_text).

Returns:

Expanded hidden state (B, T_feat, adim).

Return type:

Tensor

espnet2.gan_tts.jets.__init__

espnet2.gan_tts.jets.alignments

class espnet2.gan_tts.jets.alignments.AlignmentModule(adim, odim, cache_prior=True)[source]

Bases: torch.nn.modules.module.Module

Alignment Learning Framework proposed for parallel TTS models in:

https://arxiv.org/abs/2108.10447

Initialize AlignmentModule.

Parameters:
  • adim (int) – Dimension of attention.

  • odim (int) – Dimension of feats.

  • cache_prior (bool) – Whether to cache beta-binomial prior.

forward(text, feats, text_lengths, feats_lengths, x_masks=None)[source]

Calculate alignment loss.

Parameters:
  • text (Tensor) – Batched text embedding (B, T_text, adim).

  • feats (Tensor) – Batched acoustic feature (B, T_feats, odim).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats_lengths (Tensor) – Feature length tensor (B,).

  • x_masks (Tensor) – Mask tensor (B, T_text).

Returns:

Log probability of attention matrix (B, T_feats, T_text).

Return type:

Tensor

espnet2.gan_tts.jets.alignments.average_by_duration(ds, xs, text_lengths, feats_lengths)[source]

Average frame-level features into token-level according to durations

Parameters:
  • ds (Tensor) – Batched token duration (B, T_text).

  • xs (Tensor) – Batched feature sequences to be averaged (B, T_feats).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats_lengths (Tensor) – Feature length tensor (B,).

Returns:

Batched feature averaged according to the token duration (B, T_text).

Return type:

Tensor

espnet2.gan_tts.jets.alignments.viterbi_decode(log_p_attn, text_lengths, feats_lengths)[source]

Extract duration from an attention probability matrix

Parameters:
  • log_p_attn (Tensor) – Batched log probability of attention matrix (B, T_feats, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats_legnths (Tensor) – Feature length tensor (B,).

Returns:

Batched token duration extracted from log_p_attn (B, T_text). Tensor: Binarization loss tensor ().

Return type:

Tensor