espnet2.gan_tts.vits.duration_predictor.StochasticDurationPredictor

Less than 1 minute

espnet2.gan_tts.vits.duration_predictor.StochasticDurationPredictor

class espnet2.gan_tts.vits.duration_predictor.StochasticDurationPredictor(channels: int = 192, kernel_size: int = 3, dropout_rate: float = 0.5, flows: int = 4, dds_conv_layers: int = 3, global_channels: int = -1)

Bases: Module

Stochastic duration predictor module.

This is a module of stochastic duration predictor described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Initialize StochasticDurationPredictor module.

Parameters:
- channels (int) – Number of channels.
- kernel_size (int) – Kernel size.
- dropout_rate (float) – Dropout rate.
- flows (int) – Number of flows.
- dds_conv_layers (int) – Number of conv layers in DDS conv.
- global_channels (int) – Number of global conditioning channels.

forward(x: Tensor, x_mask: Tensor, w: Tensor | None = None, g: Tensor | None = None, inverse: bool = False, noise_scale: float = 1.0) → Tensor

Calculate forward propagation.

Parameters:
- x (Tensor) – Input tensor (B, channels, T_text).
- x_mask (Tensor) – Mask tensor (B, 1, T_text).
- w (Optional *[*Tensor ]) – Duration tensor (B, 1, T_text).
- g (Optional *[*Tensor ]) – Global conditioning tensor (B, channels, 1)
- inverse (bool) – Whether to inverse the flow.
- noise_scale (float) – Noise scale value.
Returns: If not inverse, negative log-likelihood (NLL) tensor (B,). : If inverse, log-duration tensor (B, 1, T_text).
Return type: Tensor