espnet2.gan_tts package

espnet2.gan_tts.espnet_model

GAN-based text-to-speech ESPnet model.

class espnet2.gan_tts.espnet_model.ESPnetGANTTSModel(feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], pitch_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], pitch_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], energy_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], energy_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], tts: espnet2.gan_tts.abs_gan_tts.AbsGANTTS)[source]

Bases: espnet2.train.abs_gan_espnet_model.AbsGANESPnetModel

ESPnet model for GAN-based text-to-speech task.

Initialize ESPnetGANTTSModel module.

collect_feats(text: torch.Tensor, text_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, durations: Optional[torch.Tensor] = None, durations_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None) → Dict[str, torch.Tensor][source]

Calculate features and return them as a dict.

Parameters
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • speech (Tensor) – Speech waveform tensor (B, T_wav).

  • speech_lengths (Tensor) – Speech length tensor (B, 1).

  • durations (Optional[Tensor) – Duration tensor.

  • durations_lengths (Optional[Tensor) – Duration length tensor (B,).

  • pitch (Optional[Tensor) – Pitch tensor.

  • pitch_lengths (Optional[Tensor) – Pitch length tensor (B,).

  • energy (Optional[Tensor) – Energy tensor.

  • energy_lengths (Optional[Tensor) – Energy length tensor (B,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).

  • sids (Optional[Tensor]) – Speaker index tensor (B, 1).

  • lids (Optional[Tensor]) – Language ID tensor (B, 1).

Returns

Dict of features.

Return type

Dict[str, Tensor]

forward(text: torch.Tensor, text_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, durations: Optional[torch.Tensor] = None, durations_lengths: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, forward_generator: bool = True) → Dict[str, Any][source]

Return generator or discriminator loss with dict format.

Parameters
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • speech (Tensor) – Speech waveform tensor (B, T_wav).

  • speech_lengths (Tensor) – Speech length tensor (B,).

  • duration (Optional[Tensor]) – Duration tensor.

  • duration_lengths (Optional[Tensor]) – Duration length tensor (B,).

  • pitch (Optional[Tensor]) – Pitch tensor.

  • pitch_lengths (Optional[Tensor]) – Pitch length tensor (B,).

  • energy (Optional[Tensor]) – Energy tensor.

  • energy_lengths (Optional[Tensor]) – Energy length tensor (B,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).

  • sids (Optional[Tensor]) – Speaker ID tensor (B, 1).

  • lids (Optional[Tensor]) – Language ID tensor (B, 1).

  • forward_generator (bool) – Whether to forward generator.

Returns

  • loss (Tensor): Loss scalar tensor.

  • stats (Dict[str, float]): Statistics to be monitored.

  • weight (Tensor): Weight tensor to summarize losses.

  • optim_idx (int): Optimizer index (0 for G and 1 for D).

Return type

Dict[str, Any]

espnet2.gan_tts.__init__

espnet2.gan_tts.abs_gan_tts

GAN-based TTS abstrast class.

class espnet2.gan_tts.abs_gan_tts.AbsGANTTS[source]

Bases: espnet2.tts.abs_tts.AbsTTS, abc.ABC

GAN-based TTS model abstract class.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(forward_generator, *args, **kwargs) → Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor], int]][source]

Return generator or discriminator loss.

espnet2.gan_tts.parallel_wavegan.parallel_wavegan

Parallel WaveGAN Modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.parallel_wavegan.parallel_wavegan.ParallelWaveGANDiscriminator(in_channels: int = 1, out_channels: int = 1, kernel_size: int = 3, layers: int = 10, conv_channels: int = 64, dilation_factor: int = 1, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, bias: bool = True, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Parallel WaveGAN Discriminator module.

Initialize ParallelWaveGANDiscriminator module.

Parameters
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Number of output channels.

  • layers (int) – Number of conv layers.

  • conv_channels (int) – Number of chnn layers.

  • dilation_factor (int) – Dilation factor. For example, if dilation_factor = 2, the dilation will be 2, 4, 8, …, and so on.

  • nonlinear_activation (str) – Nonlinear function after each conv.

  • nonlinear_activation_params (Dict[str, Any]) – Nonlinear function parameters

  • bias (bool) – Whether to use bias parameter in conv.

  • use_weight_norm (bool) – If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters

x (Tensor) – Input noise signal (B, 1, T).

Returns

Output tensor (B, 1, T).

Return type

Tensor

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

class espnet2.gan_tts.parallel_wavegan.parallel_wavegan.ParallelWaveGANGenerator(in_channels: int = 1, out_channels: int = 1, kernel_size: int = 3, layers: int = 30, stacks: int = 3, residual_channels: int = 64, gate_channels: int = 128, skip_channels: int = 64, aux_channels: int = 80, aux_context_window: int = 2, dropout_rate: float = 0.0, bias: bool = True, use_weight_norm: bool = True, upsample_conditional_features: bool = True, upsample_net: str = 'ConvInUpsampleNetwork', upsample_params: Dict[str, Any] = {'upsample_scales': [4, 4, 4, 4]})[source]

Bases: torch.nn.modules.module.Module

Parallel WaveGAN Generator module.

Initialize ParallelWaveGANGenerator module.

Parameters
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Kernel size of dilated convolution.

  • layers (int) – Number of residual block layers.

  • stacks (int) – Number of stacks i.e., dilation cycles.

  • residual_channels (int) – Number of channels in residual conv.

  • gate_channels (int) – Number of channels in gated conv.

  • skip_channels (int) – Number of channels in skip conv.

  • aux_channels (int) – Number of channels for auxiliary feature conv.

  • aux_context_window (int) – Context window size for auxiliary feature.

  • dropout_rate (float) – Dropout rate. 0.0 means no dropout applied.

  • bias (bool) – Whether to use bias parameter in conv layer.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

  • upsample_conditional_features (bool) – Whether to use upsampling network.

  • upsample_net (str) – Upsampling network architecture.

  • upsample_params (Dict[str, Any]) – Upsampling network parameters.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c: torch.Tensor, z: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters
  • c (Tensor) – Local conditioning auxiliary features (B, C ,T_feats).

  • z (Tensor) – Input noise signal (B, 1, T_wav).

Returns

Output tensor (B, out_channels, T_wav)

Return type

Tensor

inference(c: torch.Tensor, z: Optional[torch.Tensor] = None) → torch.Tensor[source]

Perform inference.

Parameters
  • c (Tensor) – Local conditioning auxiliary features (T_feats ,C).

  • z (Optional[Tensor]) – Input noise signal (T_wav, 1).

Returns

Output tensor (T_wav, out_channels)

Return type

Tensor

property receptive_field_size

Return receptive field size.

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

espnet2.gan_tts.parallel_wavegan.upsample

Upsampling module.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.parallel_wavegan.upsample.Conv2d(*args, **kwargs)[source]

Bases: torch.nn.modules.conv.Conv2d

Conv2d module with customized initialization.

Initialize Conv2d module.

reset_parameters()[source]

Reset parameters.

class espnet2.gan_tts.parallel_wavegan.upsample.ConvInUpsampleNetwork(upsample_scales: List[int], nonlinear_activation: Optional[str] = None, nonlinear_activation_params: Dict[str, Any] = {}, interpolate_mode: str = 'nearest', freq_axis_kernel_size: int = 1, aux_channels: int = 80, aux_context_window: int = 0)[source]

Bases: torch.nn.modules.module.Module

Convolution + upsampling network module.

Initialize ConvInUpsampleNetwork module.

Parameters
  • upsample_scales (list) – List of upsampling scales.

  • nonlinear_activation (Optional[str]) – Activation function name.

  • nonlinear_activation_params (Dict[str, Any]) – Arguments for the specified activation function.

  • mode (str) – Interpolation mode.

  • freq_axis_kernel_size (int) – Kernel size in the direction of frequency axis.

  • aux_channels (int) – Number of channels of pre-conv layer.

  • aux_context_window (int) – Context window size of the pre-conv layer.

forward(c: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters

c (Tensor) – Input tensor (B, C, T_feats).

Returns

Upsampled tensor (B, C, T_wav),

where T_wav = T_feats * prod(upsample_scales).

Return type

Tensor

class espnet2.gan_tts.parallel_wavegan.upsample.Stretch2d(x_scale: int, y_scale: int, mode: str = 'nearest')[source]

Bases: torch.nn.modules.module.Module

Stretch2d module.

Initialize Stretch2d module.

Parameters
  • x_scale (int) – X scaling factor (Time axis in spectrogram).

  • y_scale (int) – Y scaling factor (Frequency axis in spectrogram).

  • mode (str) – Interpolation mode.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters

x (Tensor) – Input tensor (B, C, F, T).

Returns

Interpolated tensor (B, C, F * y_scale, T * x_scale),

Return type

Tensor

class espnet2.gan_tts.parallel_wavegan.upsample.UpsampleNetwork(upsample_scales: List[int], nonlinear_activation: Optional[str] = None, nonlinear_activation_params: Dict[str, Any] = {}, interpolate_mode: str = 'nearest', freq_axis_kernel_size: int = 1)[source]

Bases: torch.nn.modules.module.Module

Upsampling network module.

Initialize UpsampleNetwork module.

Parameters
  • upsample_scales (List[int]) – List of upsampling scales.

  • nonlinear_activation (Optional[str]) – Activation function name.

  • nonlinear_activation_params (Dict[str, Any]) – Arguments for the specified activation function.

  • interpolate_mode (str) – Interpolation mode.

  • freq_axis_kernel_size (int) – Kernel size in the direction of frequency axis.

forward(c: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters

c – Input tensor (B, C, T_feats).

Returns

Upsampled tensor (B, C, T_wav).

Return type

Tensor

espnet2.gan_tts.parallel_wavegan.__init__

espnet2.gan_tts.hifigan.loss

HiFiGAN-related loss modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.hifigan.loss.DiscriminatorAdversarialLoss(average_by_discriminators: bool = True, loss_type: str = 'mse')[source]

Bases: torch.nn.modules.module.Module

Discriminator adversarial loss module.

Initialize DiscriminatorAversarialLoss module.

Parameters
  • average_by_discriminators (bool) – Whether to average the loss by the number of discriminators.

  • loss_type (str) – Loss type, “mse” or “hinge”.

forward(outputs_hat: Union[List[List[torch.Tensor]], List[torch.Tensor], torch.Tensor], outputs: Union[List[List[torch.Tensor]], List[torch.Tensor], torch.Tensor]) → Tuple[torch.Tensor, torch.Tensor][source]

Calcualate discriminator adversarial loss.

Parameters
  • outputs_hat (Union[List[List[Tensor]], List[Tensor], Tensor]) – Discriminator outputs, list of discriminator outputs, or list of list of discriminator outputs calculated from generator.

  • outputs (Union[List[List[Tensor]], List[Tensor], Tensor]) – Discriminator outputs, list of discriminator outputs, or list of list of discriminator outputs calculated from groundtruth.

Returns

Discriminator real loss value. Tensor: Discriminator fake loss value.

Return type

Tensor

class espnet2.gan_tts.hifigan.loss.FeatureMatchLoss(average_by_layers: bool = True, average_by_discriminators: bool = True, include_final_outputs: bool = False)[source]

Bases: torch.nn.modules.module.Module

Feature matching loss module.

Initialize FeatureMatchLoss module.

Parameters
  • average_by_layers (bool) – Whether to average the loss by the number of layers.

  • average_by_discriminators (bool) – Whether to average the loss by the number of discriminators.

  • include_final_outputs (bool) – Whether to include the final output of each discriminator for loss calculation.

forward(feats_hat: Union[List[List[torch.Tensor]], List[torch.Tensor]], feats: Union[List[List[torch.Tensor]], List[torch.Tensor]]) → torch.Tensor[source]

Calculate feature matching loss.

Parameters
  • feats_hat (Union[List[List[Tensor]], List[Tensor]]) – List of list of discriminator outputs or list of discriminator outputs calcuated from generator’s outputs.

  • feats (Union[List[List[Tensor]], List[Tensor]]) – List of list of discriminator outputs or list of discriminator outputs calcuated from groundtruth..

Returns

Feature matching loss value.

Return type

Tensor

class espnet2.gan_tts.hifigan.loss.GeneratorAdversarialLoss(average_by_discriminators: bool = True, loss_type: str = 'mse')[source]

Bases: torch.nn.modules.module.Module

Generator adversarial loss module.

Initialize GeneratorAversarialLoss module.

Parameters
  • average_by_discriminators (bool) – Whether to average the loss by the number of discriminators.

  • loss_type (str) – Loss type, “mse” or “hinge”.

forward(outputs: Union[List[List[torch.Tensor]], List[torch.Tensor], torch.Tensor]) → torch.Tensor[source]

Calcualate generator adversarial loss.

Parameters

outputs (Union[List[List[Tensor]], List[Tensor], Tensor]) – Discriminator outputs, list of discriminator outputs, or list of list of discriminator outputs..

Returns

Generator adversarial loss value.

Return type

Tensor

class espnet2.gan_tts.hifigan.loss.MelSpectrogramLoss(fs: int = 22050, n_fft: int = 1024, hop_length: int = 256, win_length: Optional[int] = None, window: str = 'hann', n_mels: int = 80, fmin: Optional[int] = 0, fmax: Optional[int] = None, center: bool = True, normalized: bool = False, onesided: bool = True, log_base: Optional[float] = 10.0)[source]

Bases: torch.nn.modules.module.Module

Mel-spectrogram loss.

Initialize Mel-spectrogram loss.

Parameters
  • fs (int) – Sampling rate.

  • n_fft (int) – FFT points.

  • hop_length (int) – Hop length.

  • win_length (Optional[int]) – Window length.

  • window (str) – Window type.

  • n_mels (int) – Number of Mel basis.

  • fmin (Optional[int]) – Minimum frequency for Mel.

  • fmax (Optional[int]) – Maximum frequency for Mel.

  • center (bool) – Whether to use center window.

  • normalized (bool) – Whether to use normalized one.

  • onesided (bool) – Whether to use oneseded one.

  • log_base (Optional[float]) – Log base value.

forward(y_hat: torch.Tensor, y: torch.Tensor, spec: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate Mel-spectrogram loss.

Parameters
  • y_hat (Tensor) – Generated waveform tensor (B, 1, T).

  • y (Tensor) – Groundtruth waveform tensor (B, 1, T).

  • spec (Optional[Tensor]) – Groundtruth linear amplitude spectrum tensor (B, n_fft, T). if provided, use it instead of groundtruth waveform.

Returns

Mel-spectrogram loss value.

Return type

Tensor

espnet2.gan_tts.hifigan.__init__

espnet2.gan_tts.hifigan.residual_block

HiFiGAN Residual block modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.hifigan.residual_block.ResidualBlock(kernel_size: int = 3, channels: int = 512, dilations: List[int] = [1, 3, 5], bias: bool = True, use_additional_convs: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1})[source]

Bases: torch.nn.modules.module.Module

Residual block module in HiFiGAN.

Initialize ResidualBlock module.

Parameters
  • kernel_size (int) – Kernel size of dilation convolution layer.

  • channels (int) – Number of channels for convolution layer.

  • dilations (List[int]) – List of dilation factors.

  • use_additional_convs (bool) – Whether to use additional convolution layers.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters

x (Tensor) – Input tensor (B, channels, T).

Returns

Output tensor (B, channels, T).

Return type

Tensor

espnet2.gan_tts.hifigan.hifigan

HiFi-GAN Modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.hifigan.hifigan.HiFiGANGenerator(in_channels: int = 80, out_channels: int = 1, channels: int = 512, global_channels: int = -1, kernel_size: int = 7, upsample_scales: List[int] = [8, 8, 2, 2], upsample_kernel_sizes: List[int] = [16, 16, 4, 4], resblock_kernel_sizes: List[int] = [3, 7, 11], resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], use_additional_convs: bool = True, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

HiFiGAN generator module.

Initialize HiFiGANGenerator module.

Parameters
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • channels (int) – Number of hidden representation channels.

  • global_channels (int) – Number of global conditioning channels.

  • kernel_size (int) – Kernel size of initial and final conv layer.

  • upsample_scales (List[int]) – List of upsampling scales.

  • upsample_kernel_sizes (List[int]) – List of kernel sizes for upsample layers.

  • resblock_kernel_sizes (List[int]) – List of kernel sizes for residual blocks.

  • resblock_dilations (List[List[int]]) – List of list of dilations for residual blocks.

  • use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c: torch.Tensor, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters
  • c (Tensor) – Input tensor (B, in_channels, T).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns

Output tensor (B, out_channels, T).

Return type

Tensor

inference(c: torch.Tensor, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Perform inference.

Parameters
  • c (torch.Tensor) – Input tensor (T, in_channels).

  • g (Optional[Tensor]) – Global conditioning tensor (global_channels, 1).

Returns

Output tensor (T ** upsample_factor, out_channels).

Return type

Tensor

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows the official implementation manner. https://github.com/jik876/hifi-gan/blob/master/models.py

class espnet2.gan_tts.hifigan.hifigan.HiFiGANMultiPeriodDiscriminator(periods: List[int] = [2, 3, 5, 7, 11], discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True})[source]

Bases: torch.nn.modules.module.Module

HiFiGAN multi-period discriminator module.

Initialize HiFiGANMultiPeriodDiscriminator module.

Parameters
  • periods (List[int]) – List of periods.

  • discriminator_params (Dict[str, Any]) – Parameters for hifi-gan period discriminator module. The period parameter will be overwritten.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters

x (Tensor) – Input noise signal (B, 1, T).

Returns

List of list of each discriminator outputs, which consists of each

layer output tensors.

Return type

List

class espnet2.gan_tts.hifigan.hifigan.HiFiGANMultiScaleDiscriminator(scales: int = 3, downsample_pooling: str = 'AvgPool1d', downsample_pooling_params: Dict[str, Any] = {'kernel_size': 4, 'padding': 2, 'stride': 2}, discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1}, follow_official_norm: bool = False)[source]

Bases: torch.nn.modules.module.Module

HiFi-GAN multi-scale discriminator module.

Initilize HiFiGAN multi-scale discriminator module.

Parameters
  • scales (int) – Number of multi-scales.

  • downsample_pooling (str) – Pooling module name for downsampling of the inputs.

  • downsample_pooling_params (Dict[str, Any]) – Parameters for the above pooling module.

  • discriminator_params (Dict[str, Any]) – Parameters for hifi-gan scale discriminator module.

  • follow_official_norm (bool) – Whether to follow the norm setting of the official implementaion. The first discriminator uses spectral norm and the other discriminators use weight norm.

forward(x: torch.Tensor) → List[List[torch.Tensor]][source]

Calculate forward propagation.

Parameters

x (Tensor) – Input noise signal (B, 1, T).

Returns

List of list of each discriminator outputs,

which consists of eachlayer output tensors.

Return type

List[List[torch.Tensor]]

class espnet2.gan_tts.hifigan.hifigan.HiFiGANMultiScaleMultiPeriodDiscriminator(scales: int = 3, scale_downsample_pooling: str = 'AvgPool1d', scale_downsample_pooling_params: Dict[str, Any] = {'kernel_size': 4, 'padding': 2, 'stride': 2}, scale_discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1}, follow_official_norm: bool = True, periods: List[int] = [2, 3, 5, 7, 11], period_discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True})[source]

Bases: torch.nn.modules.module.Module

HiFi-GAN multi-scale + multi-period discriminator module.

Initilize HiFiGAN multi-scale + multi-period discriminator module.

Parameters
  • scales (int) – Number of multi-scales.

  • scale_downsample_pooling (str) – Pooling module name for downsampling of the inputs.

  • scale_downsample_pooling_params (dict) – Parameters for the above pooling module.

  • scale_discriminator_params (dict) – Parameters for hifi-gan scale discriminator module.

  • follow_official_norm (bool) – Whether to follow the norm setting of the official implementaion. The first discriminator uses spectral norm and the other discriminators use weight norm.

  • periods (list) – List of periods.

  • period_discriminator_params (dict) – Parameters for hifi-gan period discriminator module. The period parameter will be overwritten.

forward(x: torch.Tensor) → List[List[torch.Tensor]][source]

Calculate forward propagation.

Parameters

x (Tensor) – Input noise signal (B, 1, T).

Returns

List of list of each discriminator outputs,

which consists of each layer output tensors. Multi scale and multi period ones are concatenated.

Return type

List[List[Tensor]]

class espnet2.gan_tts.hifigan.hifigan.HiFiGANPeriodDiscriminator(in_channels: int = 1, out_channels: int = 1, period: int = 3, kernel_sizes: List[int] = [5, 3], channels: int = 32, downsample_scales: List[int] = [3, 3, 3, 3, 1], max_downsample_channels: int = 1024, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True, use_spectral_norm: bool = False)[source]

Bases: torch.nn.modules.module.Module

HiFiGAN period discriminator module.

Initialize HiFiGANPeriodDiscriminator module.

Parameters
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • period (int) – Period.

  • kernel_sizes (list) – Kernel sizes of initial conv layers and the final conv layer.

  • channels (int) – Number of initial channels.

  • downsample_scales (List[int]) – List of downsampling scales.

  • max_downsample_channels (int) – Number of maximum downsampling channels.

  • use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

  • use_spectral_norm (bool) – Whether to use spectral norm. If set to true, it will be applied to all of the conv layers.

apply_spectral_norm()[source]

Apply spectral normalization module from all of the layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters

c (Tensor) – Input tensor (B, in_channels, T).

Returns

List of each layer’s tensors.

Return type

list

class espnet2.gan_tts.hifigan.hifigan.HiFiGANScaleDiscriminator(in_channels: int = 1, out_channels: int = 1, kernel_sizes: List[int] = [15, 41, 5, 3], channels: int = 128, max_downsample_channels: int = 1024, max_groups: int = 16, bias: int = True, downsample_scales: List[int] = [2, 2, 4, 4, 1], nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True, use_spectral_norm: bool = False)[source]

Bases: torch.nn.modules.module.Module

HiFi-GAN scale discriminator module.

Initilize HiFiGAN scale discriminator module.

Parameters
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_sizes (List[int]) – List of four kernel sizes. The first will be used for the first conv layer, and the second is for downsampling part, and the remaining two are for the last two output layers.

  • channels (int) – Initial number of channels for conv layer.

  • max_downsample_channels (int) – Maximum number of channels for downsampling layers.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • downsample_scales (List[int]) – List of downsampling scales.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

  • use_spectral_norm (bool) – Whether to use spectral norm. If set to true, it will be applied to all of the conv layers.

apply_spectral_norm()[source]

Apply spectral normalization module from all of the layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → List[torch.Tensor][source]

Calculate forward propagation.

Parameters

x (Tensor) – Input noise signal (B, 1, T).

Returns

List of output tensors of each layer.

Return type

List[Tensor]

espnet2.gan_tts.wavenet.__init__

espnet2.gan_tts.wavenet.residual_block

Residual block modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.wavenet.residual_block.Conv1d(*args, **kwargs)[source]

Bases: torch.nn.modules.conv.Conv1d

Conv1d module with customized initialization.

Initialize Conv1d module.

reset_parameters()[source]

Reset parameters.

class espnet2.gan_tts.wavenet.residual_block.Conv1d1x1(in_channels: int, out_channels: int, bias: bool)[source]

Bases: espnet2.gan_tts.wavenet.residual_block.Conv1d

1x1 Conv1d with customized initialization.

Initialize 1x1 Conv1d module.

class espnet2.gan_tts.wavenet.residual_block.ResidualBlock(kernel_size: int = 3, residual_channels: int = 64, gate_channels: int = 128, skip_channels: int = 64, aux_channels: int = 80, global_channels: int = -1, dropout_rate: float = 0.0, dilation: int = 1, bias: bool = True, scale_residual: bool = False)[source]

Bases: torch.nn.modules.module.Module

Residual block module in WaveNet.

Initialize ResidualBlock module.

Parameters
  • kernel_size (int) – Kernel size of dilation convolution layer.

  • residual_channels (int) – Number of channels for residual connection.

  • skip_channels (int) – Number of channels for skip connection.

  • aux_channels (int) – Number of local conditioning channels.

  • dropout (float) – Dropout probability.

  • dilation (int) – Dilation factor.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • scale_residual (bool) – Whether to scale the residual outputs.

forward(x: torch.Tensor, x_mask: Optional[torch.Tensor] = None, c: Optional[torch.Tensor] = None, g: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input tensor (B, residual_channels, T).

  • Optional[torch.Tensor] (x_mask) – Mask tensor (B, 1, T).

  • c (Optional[Tensor]) – Local conditioning tensor (B, aux_channels, T).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns

Output tensor for residual connection (B, residual_channels, T). Tensor: Output tensor for skip connection (B, skip_channels, T).

Return type

Tensor

espnet2.gan_tts.wavenet.wavenet

WaveNet modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.wavenet.wavenet.WaveNet(in_channels: int = 1, out_channels: int = 1, kernel_size: int = 3, layers: int = 30, stacks: int = 3, base_dilation: int = 2, residual_channels: int = 64, aux_channels: int = -1, gate_channels: int = 128, skip_channels: int = 64, global_channels: int = -1, dropout_rate: float = 0.0, bias: bool = True, use_weight_norm: bool = True, use_first_conv: bool = False, use_last_conv: bool = False, scale_residual: bool = False, scale_skip_connect: bool = False)[source]

Bases: torch.nn.modules.module.Module

WaveNet with global conditioning.

Initialize WaveNet module.

Parameters
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Kernel size of dilated convolution.

  • layers (int) – Number of residual block layers.

  • stacks (int) – Number of stacks i.e., dilation cycles.

  • base_dilation (int) – Base dilation factor.

  • residual_channels (int) – Number of channels in residual conv.

  • gate_channels (int) – Number of channels in gated conv.

  • skip_channels (int) – Number of channels in skip conv.

  • aux_channels (int) – Number of channels for local conditioning feature.

  • global_channels (int) – Number of channels for global conditioning feature.

  • dropout_rate (float) – Dropout rate. 0.0 means no dropout applied.

  • bias (bool) – Whether to use bias parameter in conv layer.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

  • use_first_conv (bool) – Whether to use the first conv layers.

  • use_last_conv (bool) – Whether to use the last conv layers.

  • scale_residual (bool) – Whether to scale the residual outputs.

  • scale_skip_connect (bool) – Whether to scale the skip connection outputs.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor, x_mask: Optional[torch.Tensor] = None, c: Optional[torch.Tensor] = None, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input noise signal (B, 1, T) if use_first_conv else (B, residual_channels, T).

  • x_mask (Optional[Tensor]) – Mask tensor (B, 1, T).

  • c (Optional[Tensor]) – Local conditioning features (B, aux_channels, T).

  • g (Optional[Tensor]) – Global conditioning features (B, global_channels, 1).

Returns

Output tensor (B, out_channels, T) if use_last_conv else

(B, residual_channels, T).

Return type

Tensor

property receptive_field_size

Return receptive field size.

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

espnet2.gan_tts.melgan.melgan

MelGAN Modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.melgan.melgan.MelGANDiscriminator(in_channels: int = 1, out_channels: int = 1, kernel_sizes: List[int] = [5, 3], channels: int = 16, max_downsample_channels: int = 1024, bias: bool = True, downsample_scales: List[int] = [4, 4, 4, 4], nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, pad: str = 'ReflectionPad1d', pad_params: Dict[str, Any] = {})[source]

Bases: torch.nn.modules.module.Module

MelGAN discriminator module.

Initilize MelGANDiscriminator module.

Parameters
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_sizes (List[int]) – List of two kernel sizes. The prod will be used for the first conv layer, and the first and the second kernel sizes will be used for the last two layers. For example if kernel_sizes = [5, 3], the first layer kernel size will be 5 * 3 = 15, the last two layers’ kernel size will be 5 and 3, respectively.

  • channels (int) – Initial number of channels for conv layer.

  • max_downsample_channels (int) – Maximum number of channels for downsampling layers.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • downsample_scales (List[int]) – List of downsampling scales.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • pad (str) – Padding function module name before dilated convolution layer.

  • pad_params (Dict[str, Any]) – Hyperparameters for padding function.

forward(x: torch.Tensor) → List[torch.Tensor][source]

Calculate forward propagation.

Parameters

x (Tensor) – Input noise signal (B, 1, T).

Returns

List of output tensors of each layer.

Return type

List[Tensor]

class espnet2.gan_tts.melgan.melgan.MelGANGenerator(in_channels: int = 80, out_channels: int = 1, kernel_size: int = 7, channels: int = 512, bias: bool = True, upsample_scales: List[int] = [8, 8, 2, 2], stack_kernel_size: int = 3, stacks: int = 3, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, pad: str = 'ReflectionPad1d', pad_params: Dict[str, Any] = {}, use_final_nonlinear_activation: bool = True, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

MelGAN generator module.

Initialize MelGANGenerator module.

Parameters
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Kernel size of initial and final conv layer.

  • channels (int) – Initial number of channels for conv layer.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • upsample_scales (List[int]) – List of upsampling scales.

  • stack_kernel_size (int) – Kernel size of dilated conv layers in residual stack.

  • stacks (int) – Number of stacks in a single residual stack.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • pad (str) – Padding function module name before dilated convolution layer.

  • pad_params (Dict[str, Any]) – Hyperparameters for padding function.

  • use_final_nonlinear_activation (torch.nn.Module) – Activation function for the final layer.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters

c (Tensor) – Input tensor (B, channels, T).

Returns

Output tensor (B, 1, T ** prod(upsample_scales)).

Return type

Tensor

inference(c: torch.Tensor) → torch.Tensor[source]

Perform inference.

Parameters

c (Tensor) – Input tensor (T, in_channels).

Returns

Output tensor (T ** prod(upsample_scales), out_channels).

Return type

Tensor

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows official implementation manner. https://github.com/descriptinc/melgan-neurips/blob/master/mel2wav/modules.py

class espnet2.gan_tts.melgan.melgan.MelGANMultiScaleDiscriminator(in_channels: int = 1, out_channels: int = 1, scales: int = 3, downsample_pooling: str = 'AvgPool1d', downsample_pooling_params: Dict[str, Any] = {'count_include_pad': False, 'kernel_size': 4, 'padding': 1, 'stride': 2}, kernel_sizes: List[int] = [5, 3], channels: int = 16, max_downsample_channels: int = 1024, bias: bool = True, downsample_scales: List[int] = [4, 4, 4, 4], nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, pad: str = 'ReflectionPad1d', pad_params: Dict[str, Any] = {}, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

MelGAN multi-scale discriminator module.

Initilize MelGANMultiScaleDiscriminator module.

Parameters
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • scales (int) – Number of multi-scales.

  • downsample_pooling (str) – Pooling module name for downsampling of the inputs.

  • downsample_pooling_params (Dict[str, Any]) – Parameters for the above pooling module.

  • kernel_sizes (List[int]) – List of two kernel sizes. The sum will be used for the first conv layer, and the first and the second kernel sizes will be used for the last two layers.

  • channels (int) – Initial number of channels for conv layer.

  • max_downsample_channels (int) – Maximum number of channels for downsampling layers.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • downsample_scales (List[int]) – List of downsampling scales.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • pad (str) – Padding function module name before dilated convolution layer.

  • pad_params (Dict[str, Any]) – Hyperparameters for padding function.

  • use_weight_norm (bool) – Whether to use weight norm.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → List[List[torch.Tensor]][source]

Calculate forward propagation.

Parameters

x (Tensor) – Input noise signal (B, 1, T).

Returns

List of list of each discriminator outputs, which

consists of each layer output tensors.

Return type

List[List[Tensor]]

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows official implementation manner. https://github.com/descriptinc/melgan-neurips/blob/master/mel2wav/modules.py

espnet2.gan_tts.melgan.__init__

espnet2.gan_tts.melgan.residual_stack

Residual stack module in MelGAN.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.melgan.residual_stack.ResidualStack(kernel_size: int = 3, channels: int = 32, dilation: int = 1, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, pad: str = 'ReflectionPad1d', pad_params: Dict[str, Any] = {})[source]

Bases: torch.nn.modules.module.Module

Residual stack module introduced in MelGAN.

Initialize ResidualStack module.

Parameters
  • kernel_size (int) – Kernel size of dilation convolution layer.

  • channels (int) – Number of channels of convolution layers.

  • dilation (int) – Dilation factor.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • pad (str) – Padding function module name before dilated convolution layer.

  • pad_params (Dict[str, Any]) – Hyperparameters for padding function.

forward(c: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters

c (Tensor) – Input tensor (B, channels, T).

Returns

Output tensor (B, chennels, T).

Return type

Tensor

espnet2.gan_tts.melgan.pqmf

Pseudo QMF modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.melgan.pqmf.PQMF(subbands: int = 4, taps: int = 62, cutoff_ratio: float = 0.142, beta: float = 9.0)[source]

Bases: torch.nn.modules.module.Module

PQMF module.

This module is based on Near-perfect-reconstruction pseudo-QMF banks.

Initilize PQMF module.

The cutoff_ratio and beta parameters are optimized for #subbands = 4. See dicussion in https://github.com/kan-bayashi/ParallelWaveGAN/issues/195.

Parameters
  • subbands (int) – The number of subbands.

  • taps (int) – The number of filter taps.

  • cutoff_ratio (float) – Cut-off frequency ratio.

  • beta (float) – Beta coefficient for kaiser window.

analysis(x: torch.Tensor) → torch.Tensor[source]

Analysis with PQMF.

Parameters

x (Tensor) – Input tensor (B, 1, T).

Returns

Output tensor (B, subbands, T // subbands).

Return type

Tensor

synthesis(x: torch.Tensor) → torch.Tensor[source]

Synthesis with PQMF.

Parameters

x (Tensor) – Input tensor (B, subbands, T // subbands).

Returns

Output tensor (B, 1, T).

Return type

Tensor

espnet2.gan_tts.melgan.pqmf.design_prototype_filter(taps: int = 62, cutoff_ratio: float = 0.142, beta: float = 9.0) → numpy.ndarray[source]

Design prototype filter for PQMF.

This method is based on A Kaiser window approach for the design of prototype filters of cosine modulated filterbanks.

Parameters
  • taps (int) – The number of filter taps.

  • cutoff_ratio (float) – Cut-off frequency ratio.

  • beta (float) – Beta coefficient for kaiser window.

Returns

Impluse response of prototype filter (taps + 1,).

Return type

ndarray

espnet2.gan_tts.utils.get_random_segments

Function to get random segments.

espnet2.gan_tts.utils.get_random_segments.get_random_segments(x: torch.Tensor, x_lengths: torch.Tensor, segment_size: int) → Tuple[torch.Tensor, torch.Tensor][source]

Get random segments.

Parameters
  • x (Tensor) – Input tensor (B, C, T).

  • x_lengths (Tensor) – Length tensor (B,).

  • segment_size (int) – Segment size.

Returns

Segmented tensor (B, C, segment_size). Tensor: Start index tensor (B,).

Return type

Tensor

espnet2.gan_tts.utils.get_random_segments.get_segments(x: torch.Tensor, start_idxs: torch.Tensor, segment_size: int) → torch.Tensor[source]

Get segments.

Parameters
  • x (Tensor) – Input tensor (B, C, T).

  • start_idxs (Tensor) – Start index tensor (B,).

  • segment_size (int) – Segment size.

Returns

Segmented tensor (B, C, segment_size).

Return type

Tensor

espnet2.gan_tts.utils.__init__

espnet2.gan_tts.joint.joint_text2wav

Joint text-to-wav module for end-to-end training.

class espnet2.gan_tts.joint.joint_text2wav.JointText2Wav(idim: int, odim: int, segment_size: int = 32, sampling_rate: int = 22050, text2mel_type: str = 'fastspeech2', text2mel_params: Dict[str, Any] = {'adim': 384, 'aheads': 2, 'conformer_activation_type': 'swish', 'conformer_dec_kernel_size': 31, 'conformer_enc_kernel_size': 7, 'conformer_pos_enc_layer_type': 'rel_pos', 'conformer_rel_pos_type': 'latest', 'conformer_self_attn_layer_type': 'rel_selfattn', 'decoder_concat_after': False, 'decoder_normalize_before': True, 'decoder_type': 'conformer', 'dlayers': 4, 'dunits': 1536, 'duration_predictor_chans': 384, 'duration_predictor_dropout_rate': 0.1, 'duration_predictor_kernel_size': 3, 'duration_predictor_layers': 2, 'elayers': 4, 'encoder_concat_after': False, 'encoder_normalize_before': True, 'encoder_type': 'conformer', 'energy_embed_dropout': 0.5, 'energy_embed_kernel_size': 1, 'energy_predictor_chans': 384, 'energy_predictor_dropout': 0.5, 'energy_predictor_kernel_size': 3, 'energy_predictor_layers': 2, 'eunits': 1536, 'gst_conv_chans_list': [32, 32, 64, 64, 128, 128], 'gst_conv_kernel_size': 3, 'gst_conv_layers': 6, 'gst_conv_stride': 2, 'gst_gru_layers': 1, 'gst_gru_units': 128, 'gst_heads': 4, 'gst_tokens': 10, 'init_dec_alpha': 1.0, 'init_enc_alpha': 1.0, 'init_type': 'xavier_uniform', 'langs': -1, 'pitch_embed_dropout': 0.5, 'pitch_embed_kernel_size': 1, 'pitch_predictor_chans': 384, 'pitch_predictor_dropout': 0.5, 'pitch_predictor_kernel_size': 5, 'pitch_predictor_layers': 5, 'positionwise_conv_kernel_size': 1, 'positionwise_layer_type': 'conv1d', 'postnet_chans': 512, 'postnet_dropout_rate': 0.5, 'postnet_filts': 5, 'postnet_layers': 5, 'reduction_factor': 1, 'spk_embed_dim': None, 'spk_embed_integration_type': 'add', 'spks': -1, 'stop_gradient_from_energy_predictor': False, 'stop_gradient_from_pitch_predictor': True, 'transformer_dec_attn_dropout_rate': 0.1, 'transformer_dec_dropout_rate': 0.1, 'transformer_dec_positional_dropout_rate': 0.1, 'transformer_enc_attn_dropout_rate': 0.1, 'transformer_enc_dropout_rate': 0.1, 'transformer_enc_positional_dropout_rate': 0.1, 'use_batch_norm': True, 'use_cnn_in_conformer': True, 'use_gst': False, 'use_macaron_style_in_conformer': True, 'use_masking': False, 'use_scaled_pos_enc': True, 'use_weighted_masking': False, 'zero_triu': False}, vocoder_type: str = 'hifigan_generator', vocoder_params: Dict[str, Any] = {'bias': True, 'channels': 512, 'global_channels': -1, 'kernel_size': 7, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'resblock_kernel_sizes': [3, 7, 11], 'upsample_kernel_sizes': [16, 16, 4, 4], 'upsample_scales': [8, 8, 2, 2], 'use_additional_convs': True, 'use_weight_norm': True}, use_pqmf: bool = False, pqmf_params: Dict[str, Any] = {'beta': 9.0, 'cutoff_ratio': 0.142, 'subbands': 4, 'taps': 62}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, use_feat_match_loss: bool = True, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, use_mel_loss: bool = True, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_text2mel: float = 1.0, lambda_adv: float = 1.0, lambda_feat_match: float = 2.0, lambda_mel: float = 45.0, cache_generator_outputs: bool = False)[source]

Bases: espnet2.gan_tts.abs_gan_tts.AbsGANTTS

General class to jointly train text2mel and vocoder parts.

Initialize JointText2Wav module.

Parameters
  • idim (int) – Input vocabrary size.

  • odim (int) – Acoustic feature dimension. The actual output channels will be 1 since the model is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.

  • segment_size (int) – Segment size for random windowed inputs.

  • sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.

  • text2mel_type (str) – The text2mel model type.

  • text2mel_params (Dict[str, Any]) – Parameter dict for text2mel model.

  • use_pqmf (bool) – Whether to use PQMF for multi-band vocoder.

  • pqmf_params (Dict[str, Any]) – Parameter dict for PQMF module.

  • vocoder_type (str) – The vocoder model type.

  • vocoder_params (Dict[str, Any]) – Parameter dict for vocoder model.

  • discriminator_type (str) – Discriminator type.

  • discriminator_params (Dict[str, Any]) – Parameter dict for discriminator.

  • generator_adv_loss_params (Dict[str, Any]) – Parameter dict for generator adversarial loss.

  • discriminator_adv_loss_params (Dict[str, Any]) – Parameter dict for discriminator adversarial loss.

  • use_feat_match_loss (bool) – Whether to use feat match loss.

  • feat_match_loss_params (Dict[str, Any]) – Parameter dict for feat match loss.

  • use_mel_loss (bool) – Whether to use mel loss.

  • mel_loss_params (Dict[str, Any]) – Parameter dict for mel loss.

  • lambda_text2mel (float) – Loss scaling coefficient for text2mel model loss.

  • lambda_adv (float) – Loss scaling coefficient for adversarial loss.

  • lambda_feat_match (float) – Loss scaling coefficient for feat match loss.

  • lambda_mel (float) – Loss scaling coefficient for mel loss.

  • cache_generator_outputs (bool) – Whether to cache generator outputs.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, forward_generator: bool = True, **kwargs) → Dict[str, Any][source]

Perform generator forward.

Parameters
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats (Tensor) – Feature tensor (B, T_feats, aux_channels).

  • feats_lengths (Tensor) – Feature length tensor (B,).

  • speech (Tensor) – Speech waveform tensor (B, T_wav).

  • speech_lengths (Tensor) – Speech length tensor (B,).

  • forward_generator (bool) – Whether to forward generator.

Returns

  • loss (Tensor): Loss scalar tensor.

  • stats (Dict[str, float]): Statistics to be monitored.

  • weight (Tensor): Weight tensor to summarize losses.

  • optim_idx (int): Optimizer index (0 for G and 1 for D).

Return type

Dict[str, Any]

inference(text: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]

Run inference.

Parameters

text (Tensor) – Input text index tensor (T_text,).

Returns

  • wav (Tensor): Generated waveform tensor (T_wav,).

  • feat_gan (Tensor): Generated feature tensor (T_text, C).

Return type

Dict[str, Tensor]

property require_raw_speech

Return whether or not speech is required.

property require_vocoder

Return whether or not vocoder is required.

espnet2.gan_tts.joint.__init__

espnet2.gan_tts.style_melgan.style_melgan

StyleMelGAN Modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.style_melgan.style_melgan.StyleMelGANDiscriminator(repeats: int = 2, window_sizes: List[int] = [512, 1024, 2048, 4096], pqmf_params: List[List[int]] = [[1, None, None, None], [2, 62, 0.267, 9.0], [4, 62, 0.142, 9.0], [8, 62, 0.07949, 9.0]], discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 16, 'downsample_scales': [4, 4, 4, 1], 'kernel_sizes': [5, 3], 'max_downsample_channels': 512, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.2}, 'out_channels': 1, 'pad': 'ReflectionPad1d', 'pad_params': {}}, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Style MelGAN disciminator module.

Initilize StyleMelGANDiscriminator module.

Parameters
  • repeats (int) – Number of repititons to apply RWD.

  • window_sizes (List[int]) – List of random window sizes.

  • pqmf_params (List[List[int]]) – List of list of Parameters for PQMF modules

  • discriminator_params (Dict[str, Any]) – Parameters for base discriminator module.

  • use_weight_nom (bool) – Whether to apply weight normalization.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(x: torch.Tensor) → List[torch.Tensor][source]

Calculate forward propagation.

Parameters

x (Tensor) – Input tensor (B, 1, T).

Returns

List of discriminator outputs, #items in the list will be

equal to repeats * #discriminators.

Return type

List

reset_parameters()[source]

Reset parameters.

class espnet2.gan_tts.style_melgan.style_melgan.StyleMelGANGenerator(in_channels: int = 128, aux_channels: int = 80, channels: int = 64, out_channels: int = 1, kernel_size: int = 9, dilation: int = 2, bias: bool = True, noise_upsample_scales: List[int] = [11, 2, 2, 2], noise_upsample_activation: str = 'LeakyReLU', noise_upsample_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, upsample_scales: List[int] = [2, 2, 2, 2, 2, 2, 2, 2, 1], upsample_mode: str = 'nearest', gated_function: str = 'softmax', use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Style MelGAN generator module.

Initilize StyleMelGANGenerator module.

Parameters
  • in_channels (int) – Number of input noise channels.

  • aux_channels (int) – Number of auxiliary input channels.

  • channels (int) – Number of channels for conv layer.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Kernel size of conv layers.

  • dilation (int) – Dilation factor for conv layers.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • noise_upsample_scales (List[int]) – List of noise upsampling scales.

  • noise_upsample_activation (str) – Activation function module name for noise upsampling.

  • noise_upsample_activation_params (Dict[str, Any]) – Hyperparameters for the above activation function.

  • upsample_scales (List[int]) – List of upsampling scales.

  • upsample_mode (str) – Upsampling mode in TADE layer.

  • gated_function (str) – Gated function used in TADEResBlock (“softmax” or “sigmoid”).

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c: torch.Tensor, z: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters
  • c (Tensor) – Auxiliary input tensor (B, channels, T).

  • z (Tensor) – Input noise tensor (B, in_channels, 1).

Returns

Output tensor (B, out_channels, T ** prod(upsample_scales)).

Return type

Tensor

inference(c: torch.Tensor) → torch.Tensor[source]

Perform inference.

Parameters

c (Tensor) – Input tensor (T, in_channels).

Returns

Output tensor (T ** prod(upsample_scales), out_channels).

Return type

Tensor

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

espnet2.gan_tts.style_melgan.tade_res_block

StyleMelGAN’s TADEResBlock Modules.

This code is modified from https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_tts.style_melgan.tade_res_block.TADELayer(in_channels: int = 64, aux_channels: int = 80, kernel_size: int = 9, bias: bool = True, upsample_factor: int = 2, upsample_mode: str = 'nearest')[source]

Bases: torch.nn.modules.module.Module

TADE Layer module.

Initilize TADELayer module.

Parameters
  • in_channels (int) – Number of input channles.

  • aux_channels (int) – Number of auxirialy channles.

  • kernel_size (int) – Kernel size.

  • bias (bool) – Whether to use bias parameter in conv.

  • upsample_factor (int) – Upsample factor.

  • upsample_mode (str) – Upsample mode.

forward(x: torch.Tensor, c: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input tensor (B, in_channels, T).

  • c (Tensor) – Auxiliary input tensor (B, aux_channels, T’).

Returns

Output tensor (B, in_channels, T * in_upsample_factor). Tensor: Upsampled aux tensor (B, in_channels, T * aux_upsample_factor).

Return type

Tensor

class espnet2.gan_tts.style_melgan.tade_res_block.TADEResBlock(in_channels: int = 64, aux_channels: int = 80, kernel_size: int = 9, dilation: int = 2, bias: bool = True, upsample_factor: int = 2, upsample_mode: str = 'nearest', gated_function: str = 'softmax')[source]

Bases: torch.nn.modules.module.Module

TADEResBlock module.

Initialize TADEResBlock module.

Parameters
  • in_channels (int) – Number of input channles.

  • aux_channels (int) – Number of auxirialy channles.

  • kernel_size (int) – Kernel size.

  • bias (bool) – Whether to use bias parameter in conv.

  • upsample_factor (int) – Upsample factor.

  • upsample_mode (str) – Upsample mode.

  • gated_function (str) – Gated function type (softmax of sigmoid).

forward(x: torch.Tensor, c: torch.Tensor) → torch.Tensor[source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input tensor (B, in_channels, T).

  • c (Tensor) – Auxiliary input tensor (B, aux_channels, T’).

Returns

Output tensor (B, in_channels, T * in_upsample_factor). Tensor: Upsampled auxirialy tensor (B, in_channels, T * in_upsample_factor).

Return type

Tensor

espnet2.gan_tts.style_melgan.__init__

espnet2.gan_tts.vits.transform

Flow-related transformation.

This code is derived from https://github.com/bayesiains/nflows.

espnet2.gan_tts.vits.transform.piecewise_rational_quadratic_transform(inputs, unnormalized_widths, unnormalized_heights, unnormalized_derivatives, inverse=False, tails=None, tail_bound=1.0, min_bin_width=0.001, min_bin_height=0.001, min_derivative=0.001)[source]
espnet2.gan_tts.vits.transform.rational_quadratic_spline(inputs, unnormalized_widths, unnormalized_heights, unnormalized_derivatives, inverse=False, left=0.0, right=1.0, bottom=0.0, top=1.0, min_bin_width=0.001, min_bin_height=0.001, min_derivative=0.001)[source]
espnet2.gan_tts.vits.transform.unconstrained_rational_quadratic_spline(inputs, unnormalized_widths, unnormalized_heights, unnormalized_derivatives, inverse=False, tails='linear', tail_bound=1.0, min_bin_width=0.001, min_bin_height=0.001, min_derivative=0.001)[source]

espnet2.gan_tts.vits.generator

Generator module in VITS.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.generator.VITSGenerator(vocabs: int, aux_channels: int = 513, hidden_channels: int = 192, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, global_channels: int = -1, segment_size: int = 32, text_encoder_attention_heads: int = 2, text_encoder_ffn_expand: int = 4, text_encoder_blocks: int = 6, text_encoder_positionwise_layer_type: str = 'conv1d', text_encoder_positionwise_conv_kernel_size: int = 1, text_encoder_positional_encoding_layer_type: str = 'rel_pos', text_encoder_self_attention_layer_type: str = 'rel_selfattn', text_encoder_activation_type: str = 'swish', text_encoder_normalize_before: bool = True, text_encoder_dropout_rate: float = 0.1, text_encoder_positional_dropout_rate: float = 0.0, text_encoder_attention_dropout_rate: float = 0.0, text_encoder_conformer_kernel_size: int = 7, use_macaron_style_in_text_encoder: bool = True, use_conformer_conv_in_text_encoder: bool = True, decoder_kernel_size: int = 7, decoder_channels: int = 512, decoder_upsample_scales: List[int] = [8, 8, 2, 2], decoder_upsample_kernel_sizes: List[int] = [16, 16, 4, 4], decoder_resblock_kernel_sizes: List[int] = [3, 7, 11], decoder_resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], use_weight_norm_in_decoder: bool = True, posterior_encoder_kernel_size: int = 5, posterior_encoder_layers: int = 16, posterior_encoder_stacks: int = 1, posterior_encoder_base_dilation: int = 1, posterior_encoder_dropout_rate: float = 0.0, use_weight_norm_in_posterior_encoder: bool = True, flow_flows: int = 4, flow_kernel_size: int = 5, flow_base_dilation: int = 1, flow_layers: int = 4, flow_dropout_rate: float = 0.0, use_weight_norm_in_flow: bool = True, use_only_mean_in_flow: bool = True, stochastic_duration_predictor_kernel_size: int = 3, stochastic_duration_predictor_dropout_rate: float = 0.5, stochastic_duration_predictor_flows: int = 4, stochastic_duration_predictor_dds_conv_layers: int = 3)[source]

Bases: torch.nn.modules.module.Module

Generator module in VITS.

This is a module of VITS described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

As text encoder, we use conformer architecture instead of the relative positional Transformer, which contains additional convolution layers.

Initialize VITS generator module.

Parameters
  • vocabs (int) – Input vocabulary size.

  • aux_channels (int) – Number of acoustic feature channels.

  • hidden_channels (int) – Number of hidden channels.

  • spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.

  • langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.

  • spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.

  • global_channels (int) – Number of global conditioning channels.

  • segment_size (int) – Segment size for decoder.

  • text_encoder_attention_heads (int) – Number of heads in conformer block of text encoder.

  • text_encoder_ffn_expand (int) – Expansion ratio of FFN in conformer block of text encoder.

  • text_encoder_blocks (int) – Number of conformer blocks in text encoder.

  • text_encoder_positionwise_layer_type (str) – Position-wise layer type in conformer block of text encoder.

  • text_encoder_positionwise_conv_kernel_size (int) – Position-wise convolution kernel size in conformer block of text encoder. Only used when the above layer type is conv1d or conv1d-linear.

  • text_encoder_positional_encoding_layer_type (str) – Positional encoding layer type in conformer block of text encoder.

  • text_encoder_self_attention_layer_type (str) – Self-attention layer type in conformer block of text encoder.

  • text_encoder_activation_type (str) – Activation function type in conformer block of text encoder.

  • text_encoder_normalize_before (bool) – Whether to apply layer norm before self-attention in conformer block of text encoder.

  • text_encoder_dropout_rate (float) – Dropout rate in conformer block of text encoder.

  • text_encoder_positional_dropout_rate (float) – Dropout rate for positional encoding in conformer block of text encoder.

  • text_encoder_attention_dropout_rate (float) – Dropout rate for attention in conformer block of text encoder.

  • text_encoder_conformer_kernel_size (int) – Conformer conv kernel size. It will be used when only use_conformer_conv_in_text_encoder = True.

  • use_macaron_style_in_text_encoder (bool) – Whether to use macaron style FFN in conformer block of text encoder.

  • use_conformer_conv_in_text_encoder (bool) – Whether to use covolution in conformer block of text encoder.

  • decoder_kernel_size (int) – Decoder kernel size.

  • decoder_channels (int) – Number of decoder initial channels.

  • decoder_upsample_scales (List[int]) – List of upsampling scales in decoder.

  • decoder_upsample_kernel_sizes (List[int]) – List of kernel size for upsampling layers in decoder.

  • decoder_resblock_kernel_sizes (List[int]) – List of kernel size for resblocks in decoder.

  • decoder_resblock_dilations (List[List[int]]) – List of list of dilations for resblocks in decoder.

  • use_weight_norm_in_decoder (bool) – Whether to apply weight normalization in decoder.

  • posterior_encoder_kernel_size (int) – Posterior encoder kernel size.

  • posterior_encoder_layers (int) – Number of layers of posterior encoder.

  • posterior_encoder_stacks (int) – Number of stacks of posterior encoder.

  • posterior_encoder_base_dilation (int) – Base dilation of posterior encoder.

  • posterior_encoder_dropout_rate (float) – Dropout rate for posterior encoder.

  • use_weight_norm_in_posterior_encoder (bool) – Whether to apply weight normalization in posterior encoder.

  • flow_flows (int) – Number of flows in flow.

  • flow_kernel_size (int) – Kernel size in flow.

  • flow_base_dilation (int) – Base dilation in flow.

  • flow_layers (int) – Number of layers in flow.

  • flow_dropout_rate (float) – Dropout rate in flow

  • use_weight_norm_in_flow (bool) – Whether to apply weight normalization in flow.

  • use_only_mean_in_flow (bool) – Whether to use only mean in flow.

  • stochastic_duration_predictor_kernel_size (int) – Kernel size in stochastic duration predictor.

  • stochastic_duration_predictor_dropout_rate (float) – Dropout rate in stochastic duration predictor.

  • stochastic_duration_predictor_flows (int) – Number of flows in stochastic duration predictor.

  • stochastic_duration_predictor_dds_conv_layers (int) – Number of DDS conv layers in stochastic duration predictor.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats (Tensor) – Feature tensor (B, aux_channels, T_feats).

  • feats_lengths (Tensor) – Feature length tensor (B,).

  • sids (Optional[Tensor]) – Speaker index tensor (B,) or (B, 1).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, spk_embed_dim).

  • lids (Optional[Tensor]) – Language index tensor (B,) or (B, 1).

Returns

Waveform tensor (B, 1, segment_size * upsample_factor). Tensor: Duration negative log-likelihood (NLL) tensor (B,). Tensor: Monotonic attention weight tensor (B, 1, T_feats, T_text). Tensor: Segments start index tensor (B,). Tensor: Text mask tensor (B, 1, T_text). Tensor: Feature mask tensor (B, 1, T_feats). tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]:

  • Tensor: Posterior encoder hidden representation (B, H, T_feats).

  • Tensor: Flow hidden representation (B, H, T_feats).

  • Tensor: Expanded text encoder projected mean (B, H, T_feats).

  • Tensor: Expanded text encoder projected scale (B, H, T_feats).

  • Tensor: Posterior encoder projected mean (B, H, T_feats).

  • Tensor: Posterior encoder projected scale (B, H, T_feats).

Return type

Tensor

inference(text: torch.Tensor, text_lengths: torch.Tensor, feats: Optional[torch.Tensor] = None, feats_lengths: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, dur: Optional[torch.Tensor] = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: Optional[int] = None, use_teacher_forcing: bool = False) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Run inference.

Parameters
  • text (Tensor) – Input text index tensor (B, T_text,).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats (Tensor) – Feature tensor (B, aux_channels, T_feats,).

  • feats_lengths (Tensor) – Feature length tensor (B,).

  • sids (Optional[Tensor]) – Speaker index tensor (B,) or (B, 1).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, spk_embed_dim).

  • lids (Optional[Tensor]) – Language index tensor (B,) or (B, 1).

  • dur (Optional[Tensor]) – Ground-truth duration (B, T_text,). If provided, skip the prediction of durations (i.e., teacher forcing).

  • noise_scale (float) – Noise scale parameter for flow.

  • noise_scale_dur (float) – Noise scale parameter for duration predictor.

  • alpha (float) – Alpha parameter to control the speed of generated speech.

  • max_len (Optional[int]) – Maximum length of acoustic feature sequence.

  • use_teacher_forcing (bool) – Whether to use teacher forcing.

Returns

Generated waveform tensor (B, T_wav). Tensor: Monotonic attention weight tensor (B, T_feats, T_text). Tensor: Duration tensor (B, T_text).

Return type

Tensor

espnet2.gan_tts.vits.duration_predictor

Stochastic duration predictor modules in VITS.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.duration_predictor.StochasticDurationPredictor(channels: int = 192, kernel_size: int = 3, dropout_rate: float = 0.5, flows: int = 4, dds_conv_layers: int = 3, global_channels: int = -1)[source]

Bases: torch.nn.modules.module.Module

Stochastic duration predictor module.

This is a module of stochastic duration predictor described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Initialize StochasticDurationPredictor module.

Parameters
  • channels (int) – Number of channels.

  • kernel_size (int) – Kernel size.

  • dropout_rate (float) – Dropout rate.

  • flows (int) – Number of flows.

  • dds_conv_layers (int) – Number of conv layers in DDS conv.

  • global_channels (int) – Number of global conditioning channels.

forward(x: torch.Tensor, x_mask: torch.Tensor, w: Optional[torch.Tensor] = None, g: Optional[torch.Tensor] = None, inverse: bool = False, noise_scale: float = 1.0) → torch.Tensor[source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input tensor (B, channels, T_text).

  • x_mask (Tensor) – Mask tensor (B, 1, T_text).

  • w (Optional[Tensor]) – Duration tensor (B, 1, T_text).

  • g (Optional[Tensor]) – Global conditioning tensor (B, channels, 1)

  • inverse (bool) – Whether to inverse the flow.

  • noise_scale (float) – Noise scale value.

Returns

If not inverse, negative log-likelihood (NLL) tensor (B,).

If inverse, log-duration tensor (B, 1, T_text).

Return type

Tensor

espnet2.gan_tts.vits.residual_coupling

Residual affine coupling modules in VITS.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.residual_coupling.ResidualAffineCouplingBlock(in_channels: int = 192, hidden_channels: int = 192, flows: int = 4, kernel_size: int = 5, base_dilation: int = 1, layers: int = 4, global_channels: int = -1, dropout_rate: float = 0.0, use_weight_norm: bool = True, bias: bool = True, use_only_mean: bool = True)[source]

Bases: torch.nn.modules.module.Module

Residual affine coupling block module.

This is a module of residual affine coupling block, which used as “Flow” in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Initilize ResidualAffineCouplingBlock module.

Parameters
  • in_channels (int) – Number of input channels.

  • hidden_channels (int) – Number of hidden channels.

  • flows (int) – Number of flows.

  • kernel_size (int) – Kernel size for WaveNet.

  • base_dilation (int) – Base dilation factor for WaveNet.

  • layers (int) – Number of layers of WaveNet.

  • stacks (int) – Number of stacks of WaveNet.

  • global_channels (int) – Number of global channels.

  • dropout_rate (float) – Dropout rate.

  • use_weight_norm (bool) – Whether to use weight normalization in WaveNet.

  • bias (bool) – Whether to use bias paramters in WaveNet.

  • use_only_mean (bool) – Whether to estimate only mean.

forward(x: torch.Tensor, x_mask: torch.Tensor, g: Optional[torch.Tensor] = None, inverse: bool = False) → torch.Tensor[source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input tensor (B, in_channels, T).

  • x_lengths (Tensor) – Length tensor (B,).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

  • inverse (bool) – Whether to inverse the flow.

Returns

Output tensor (B, in_channels, T).

Return type

Tensor

class espnet2.gan_tts.vits.residual_coupling.ResidualAffineCouplingLayer(in_channels: int = 192, hidden_channels: int = 192, kernel_size: int = 5, base_dilation: int = 1, layers: int = 5, stacks: int = 1, global_channels: int = -1, dropout_rate: float = 0.0, use_weight_norm: bool = True, bias: bool = True, use_only_mean: bool = True)[source]

Bases: torch.nn.modules.module.Module

Residual affine coupling layer.

Initialzie ResidualAffineCouplingLayer module.

Parameters
  • in_channels (int) – Number of input channels.

  • hidden_channels (int) – Number of hidden channels.

  • kernel_size (int) – Kernel size for WaveNet.

  • base_dilation (int) – Base dilation factor for WaveNet.

  • layers (int) – Number of layers of WaveNet.

  • stacks (int) – Number of stacks of WaveNet.

  • global_channels (int) – Number of global channels.

  • dropout_rate (float) – Dropout rate.

  • use_weight_norm (bool) – Whether to use weight normalization in WaveNet.

  • bias (bool) – Whether to use bias paramters in WaveNet.

  • use_only_mean (bool) – Whether to estimate only mean.

forward(x: torch.Tensor, x_mask: torch.Tensor, g: Optional[torch.Tensor] = None, inverse: bool = False) → Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input tensor (B, in_channels, T).

  • x_lengths (Tensor) – Length tensor (B,).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

  • inverse (bool) – Whether to inverse the flow.

Returns

Output tensor (B, in_channels, T). Tensor: Log-determinant tensor for NLL (B,) if not inverse.

Return type

Tensor

espnet2.gan_tts.vits.loss

VITS-related loss modules.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.loss.KLDivergenceLoss[source]

Bases: torch.nn.modules.module.Module

KL divergence loss.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(z_p: torch.Tensor, logs_q: torch.Tensor, m_p: torch.Tensor, logs_p: torch.Tensor, z_mask: torch.Tensor) → torch.Tensor[source]

Calculate KL divergence loss.

Parameters
  • z_p (Tensor) – Flow hidden representation (B, H, T_feats).

  • logs_q (Tensor) – Posterior encoder projected scale (B, H, T_feats).

  • m_p (Tensor) – Expanded text encoder projected mean (B, H, T_feats).

  • logs_p (Tensor) – Expanded text encoder projected scale (B, H, T_feats).

  • z_mask (Tensor) – Mask tensor (B, 1, T_feats).

Returns

KL divergence loss.

Return type

Tensor

espnet2.gan_tts.vits.__init__

espnet2.gan_tts.vits.posterior_encoder

Posterior encoder module in VITS.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.posterior_encoder.PosteriorEncoder(in_channels: int = 513, out_channels: int = 192, hidden_channels: int = 192, kernel_size: int = 5, layers: int = 16, stacks: int = 1, base_dilation: int = 1, global_channels: int = -1, dropout_rate: float = 0.0, bias: bool = True, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Posterior encoder module in VITS.

This is a module of posterior encoder described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Initilialize PosteriorEncoder module.

Parameters
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • hidden_channels (int) – Number of hidden channels.

  • kernel_size (int) – Kernel size in WaveNet.

  • layers (int) – Number of layers of WaveNet.

  • stacks (int) – Number of repeat stacking of WaveNet.

  • base_dilation (int) – Base dilation factor.

  • global_channels (int) – Number of global conditioning channels.

  • dropout_rate (float) – Dropout rate.

  • bias (bool) – Whether to use bias parameters in conv.

  • use_weight_norm (bool) – Whether to apply weight norm.

forward(x: torch.Tensor, x_lengths: torch.Tensor, g: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input tensor (B, in_channels, T_feats).

  • x_lengths (Tensor) – Length tensor (B,).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns

Encoded hidden representation tensor (B, out_channels, T_feats). Tensor: Projected mean tensor (B, out_channels, T_feats). Tensor: Projected scale tensor (B, out_channels, T_feats). Tensor: Mask tensor for input tensor (B, 1, T_feats).

Return type

Tensor

espnet2.gan_tts.vits.text_encoder

Text encoder module in VITS.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.text_encoder.TextEncoder(vocabs: int, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 6, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 3, positional_encoding_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

Text encoder module in VITS.

This is a module of text encoder described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Instead of the relative positional Transformer, we use conformer architecture as the encoder module, which contains additional convolution layers.

Initialize TextEncoder module.

Parameters
  • vocabs (int) – Vocabulary size.

  • attention_dim (int) – Attention dimension.

  • attention_heads (int) – Number of attention heads.

  • linear_units (int) – Number of linear units of positionwise layers.

  • blocks (int) – Number of encoder blocks.

  • positionwise_layer_type (str) – Positionwise layer type.

  • positionwise_conv_kernel_size (int) – Positionwise layer’s kernel size.

  • positional_encoding_layer_type (str) – Positional encoding layer type.

  • self_attention_layer_type (str) – Self-attention layer type.

  • activation_type (str) – Activation function type.

  • normalize_before (bool) – Whether to apply LayerNorm before attention.

  • use_macaron_style (bool) – Whether to use macaron style components.

  • use_conformer_conv (bool) – Whether to use conformer conv layers.

  • conformer_kernel_size (int) – Conformer’s conv kernel size.

  • dropout_rate (float) – Dropout rate.

  • positional_dropout_rate (float) – Dropout rate for positional encoding.

  • attention_dropout_rate (float) – Dropout rate for attention.

forward(x: torch.Tensor, x_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input index tensor (B, T_text).

  • x_lengths (Tensor) – Length tensor (B,).

Returns

Encoded hidden representation (B, attention_dim, T_text). Tensor: Projected mean tensor (B, attention_dim, T_text). Tensor: Projected scale tensor (B, attention_dim, T_text). Tensor: Mask tensor for input tensor (B, 1, T_text).

Return type

Tensor

espnet2.gan_tts.vits.vits

VITS module for GAN-TTS task.

class espnet2.gan_tts.vits.vits.VITS(idim: int, odim: int, sampling_rate: int = 22050, generator_type: str = 'vits_generator', generator_params: Dict[str, Any] = {'decoder_channels': 512, 'decoder_kernel_size': 7, 'decoder_resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'decoder_resblock_kernel_sizes': [3, 7, 11], 'decoder_upsample_kernel_sizes': [16, 16, 4, 4], 'decoder_upsample_scales': [8, 8, 2, 2], 'flow_base_dilation': 1, 'flow_dropout_rate': 0.0, 'flow_flows': 4, 'flow_kernel_size': 5, 'flow_layers': 4, 'global_channels': -1, 'hidden_channels': 192, 'langs': None, 'posterior_encoder_base_dilation': 1, 'posterior_encoder_dropout_rate': 0.0, 'posterior_encoder_kernel_size': 5, 'posterior_encoder_layers': 16, 'posterior_encoder_stacks': 1, 'segment_size': 32, 'spk_embed_dim': None, 'spks': None, 'stochastic_duration_predictor_dds_conv_layers': 3, 'stochastic_duration_predictor_dropout_rate': 0.5, 'stochastic_duration_predictor_flows': 4, 'stochastic_duration_predictor_kernel_size': 3, 'text_encoder_activation_type': 'swish', 'text_encoder_attention_dropout_rate': 0.0, 'text_encoder_attention_heads': 2, 'text_encoder_blocks': 6, 'text_encoder_conformer_kernel_size': 7, 'text_encoder_dropout_rate': 0.1, 'text_encoder_ffn_expand': 4, 'text_encoder_normalize_before': True, 'text_encoder_positional_dropout_rate': 0.0, 'text_encoder_positional_encoding_layer_type': 'rel_pos', 'text_encoder_positionwise_conv_kernel_size': 1, 'text_encoder_positionwise_layer_type': 'conv1d', 'text_encoder_self_attention_layer_type': 'rel_selfattn', 'use_conformer_conv_in_text_encoder': True, 'use_macaron_style_in_text_encoder': True, 'use_only_mean_in_flow': True, 'use_weight_norm_in_decoder': True, 'use_weight_norm_in_flow': True, 'use_weight_norm_in_posterior_encoder': True}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_adv: float = 1.0, lambda_mel: float = 45.0, lambda_feat_match: float = 2.0, lambda_dur: float = 1.0, lambda_kl: float = 1.0, cache_generator_outputs: bool = True)[source]

Bases: espnet2.gan_tts.abs_gan_tts.AbsGANTTS

VITS module (generator + discriminator).

This is a module of VITS described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Initialize VITS module.

Parameters
  • idim (int) – Input vocabrary size.

  • odim (int) – Acoustic feature dimension. The actual output channels will be 1 since VITS is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.

  • sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.

  • generator_type (str) – Generator type.

  • generator_params (Dict[str, Any]) – Parameter dict for generator.

  • discriminator_type (str) – Discriminator type.

  • discriminator_params (Dict[str, Any]) – Parameter dict for discriminator.

  • generator_adv_loss_params (Dict[str, Any]) – Parameter dict for generator adversarial loss.

  • discriminator_adv_loss_params (Dict[str, Any]) – Parameter dict for discriminator adversarial loss.

  • feat_match_loss_params (Dict[str, Any]) – Parameter dict for feat match loss.

  • mel_loss_params (Dict[str, Any]) – Parameter dict for mel loss.

  • lambda_adv (float) – Loss scaling coefficient for adversarial loss.

  • lambda_mel (float) – Loss scaling coefficient for mel spectrogram loss.

  • lambda_feat_match (float) – Loss scaling coefficient for feat match loss.

  • lambda_dur (float) – Loss scaling coefficient for duration loss.

  • lambda_kl (float) – Loss scaling coefficient for KL divergence loss.

  • cache_generator_outputs (bool) – Whether to cache generator outputs.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, forward_generator: bool = True) → Dict[str, Any][source]

Perform generator forward.

Parameters
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • feats (Tensor) – Feature tensor (B, T_feats, aux_channels).

  • feats_lengths (Tensor) – Feature length tensor (B,).

  • speech (Tensor) – Speech waveform tensor (B, T_wav).

  • speech_lengths (Tensor) – Speech length tensor (B,).

  • sids (Optional[Tensor]) – Speaker index tensor (B,) or (B, 1).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, spk_embed_dim).

  • lids (Optional[Tensor]) – Language index tensor (B,) or (B, 1).

  • forward_generator (bool) – Whether to forward generator.

Returns

  • loss (Tensor): Loss scalar tensor.

  • stats (Dict[str, float]): Statistics to be monitored.

  • weight (Tensor): Weight tensor to summarize losses.

  • optim_idx (int): Optimizer index (0 for G and 1 for D).

Return type

Dict[str, Any]

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, durations: Optional[torch.Tensor] = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: Optional[int] = None, use_teacher_forcing: bool = False) → Dict[str, torch.Tensor][source]

Run inference.

Parameters
  • text (Tensor) – Input text index tensor (T_text,).

  • feats (Tensor) – Feature tensor (T_feats, aux_channels).

  • sids (Tensor) – Speaker index tensor (1,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (spk_embed_dim,).

  • lids (Tensor) – Language index tensor (1,).

  • durations (Tensor) – Ground-truth duration tensor (T_text,).

  • noise_scale (float) – Noise scale value for flow.

  • noise_scale_dur (float) – Noise scale value for duration predictor.

  • alpha (float) – Alpha parameter to control the speed of generated speech.

  • max_len (Optional[int]) – Maximum length.

  • use_teacher_forcing (bool) – Whether to use teacher forcing.

Returns

  • wav (Tensor): Generated waveform tensor (T_wav,).

  • att_w (Tensor): Monotonic attention weight tensor (T_feats, T_text).

  • duration (Tensor): Predicted duration tensor (T_text,).

Return type

Dict[str, Tensor]

property require_raw_speech

Return whether or not speech is required.

property require_vocoder

Return whether or not vocoder is required.

espnet2.gan_tts.vits.flow

Basic Flow modules used in VITS.

This code is based on https://github.com/jaywalnut310/vits.

class espnet2.gan_tts.vits.flow.ConvFlow(in_channels: int, hidden_channels: int, kernel_size: int, layers: int, bins: int = 10, tail_bound: float = 5.0)[source]

Bases: torch.nn.modules.module.Module

Convolutional flow module.

Initialize ConvFlow module.

Parameters
  • in_channels (int) – Number of input channels.

  • hidden_channels (int) – Number of hidden channels.

  • kernel_size (int) – Kernel size.

  • layers (int) – Number of layers.

  • bins (int) – Number of bins.

  • tail_bound (float) – Tail bound value.

forward(x: torch.Tensor, x_mask: torch.Tensor, g: Optional[torch.Tensor] = None, inverse: bool = False) → Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input tensor (B, channels, T).

  • x_mask (Tensor) – Mask tensor (B,).

  • g (Optional[Tensor]) – Global conditioning tensor (B, channels, 1).

  • inverse (bool) – Whether to inverse the flow.

Returns

Output tensor (B, channels, T). Tensor: Log-determinant tensor for NLL (B,) if not inverse.

Return type

Tensor

class espnet2.gan_tts.vits.flow.DilatedDepthSeparableConv(channels: int, kernel_size: int, layers: int, dropout_rate: float = 0.0, eps: float = 1e-05)[source]

Bases: torch.nn.modules.module.Module

Dilated depth-separable conv module.

Initialize DilatedDepthSeparableConv module.

Parameters
  • channels (int) – Number of channels.

  • kernel_size (int) – Kernel size.

  • layers (int) – Number of layers.

  • dropout_rate (float) – Dropout rate.

  • eps (float) – Epsilon for layer norm.

forward(x: torch.Tensor, x_mask: torch.Tensor, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input tensor (B, in_channels, T).

  • x_mask (Tensor) – Mask tensor (B, 1, T).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns

Output tensor (B, channels, T).

Return type

Tensor

class espnet2.gan_tts.vits.flow.ElementwiseAffineFlow(channels: int)[source]

Bases: torch.nn.modules.module.Module

Elementwise affine flow module.

Initialize ElementwiseAffineFlow module.

Parameters

channels (int) – Number of channels.

forward(x: torch.Tensor, x_mask: torch.Tensor, inverse: bool = False, **kwargs) → Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input tensor (B, channels, T).

  • x_lengths (Tensor) – Length tensor (B,).

  • inverse (bool) – Whether to inverse the flow.

Returns

Output tensor (B, channels, T). Tensor: Log-determinant tensor for NLL (B,) if not inverse.

Return type

Tensor

class espnet2.gan_tts.vits.flow.FlipFlow[source]

Bases: torch.nn.modules.module.Module

Flip flow module.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: torch.Tensor, *args, inverse: bool = False, **kwargs) → Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input tensor (B, channels, T).

  • inverse (bool) – Whether to inverse the flow.

Returns

Flipped tensor (B, channels, T). Tensor: Log-determinant tensor for NLL (B,) if not inverse.

Return type

Tensor

class espnet2.gan_tts.vits.flow.LogFlow[source]

Bases: torch.nn.modules.module.Module

Log flow module.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: torch.Tensor, x_mask: torch.Tensor, inverse: bool = False, eps: float = 1e-05, **kwargs) → Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters
  • x (Tensor) – Input tensor (B, channels, T).

  • x_mask (Tensor) – Mask tensor (B, 1, T).

  • inverse (bool) – Whether to inverse the flow.

  • eps (float) – Epsilon for log.

Returns

Output tensor (B, channels, T). Tensor: Log-determinant tensor for NLL (B,) if not inverse.

Return type

Tensor

class espnet2.gan_tts.vits.flow.Transpose(dim1: int, dim2: int)[source]

Bases: torch.nn.modules.module.Module

Transpose module for torch.nn.Sequential().

Initialize Transpose module.

forward(x: torch.Tensor) → torch.Tensor[source]

Transpose.

espnet2.gan_tts.vits.monotonic_align.__init__

Maximum path calculation module.

This code is based on https://github.com/jaywalnut310/vits.

espnet2.gan_tts.vits.monotonic_align.__init__.maximum_path(neg_x_ent: torch.Tensor, attn_mask: torch.Tensor) → torch.Tensor[source]

Calculate maximum path.

Parameters
  • neg_x_ent (Tensor) – Negative X entropy tensor (B, T_feats, T_text).

  • attn_mask (Tensor) – Attention mask (B, T_feats, T_text).

Returns

Maximum path tensor (B, T_feats, T_text).

Return type

Tensor

espnet2.gan_tts.vits.monotonic_align.__init__.maximum_path_each_numba[source]

Calculate a single maximum path with numba.

espnet2.gan_tts.vits.monotonic_align.__init__.maximum_path_numba[source]

Calculate batch maximum path with numba.

espnet2.gan_tts.vits.monotonic_align.setup