espnet2.gan_tts.vits.posterior_encoder.PosteriorEncoder

Less than 1 minute

espnet2.gan_tts.vits.posterior_encoder.PosteriorEncoder

class espnet2.gan_tts.vits.posterior_encoder.PosteriorEncoder(in_channels: int = 513, out_channels: int = 192, hidden_channels: int = 192, kernel_size: int = 5, layers: int = 16, stacks: int = 1, base_dilation: int = 1, global_channels: int = -1, dropout_rate: float = 0.0, bias: bool = True, use_weight_norm: bool = True)

Bases: Module

Posterior encoder module in VITS.

This is a module of posterior encoder described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Initilialize PosteriorEncoder module.

Parameters:
- in_channels (int) – Number of input channels.
- out_channels (int) – Number of output channels.
- hidden_channels (int) – Number of hidden channels.
- kernel_size (int) – Kernel size in WaveNet.
- layers (int) – Number of layers of WaveNet.
- stacks (int) – Number of repeat stacking of WaveNet.
- base_dilation (int) – Base dilation factor.
- global_channels (int) – Number of global conditioning channels.
- dropout_rate (float) – Dropout rate.
- bias (bool) – Whether to use bias parameters in conv.
- use_weight_norm (bool) – Whether to apply weight norm.

forward(x: Tensor, x_lengths: Tensor, g: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor, Tensor]

Calculate forward propagation.

Parameters:
- x (Tensor) – Input tensor (B, in_channels, T_feats).
- x_lengths (Tensor) – Length tensor (B,).
- g (Optional *[*Tensor ]) – Global conditioning tensor (B, global_channels, 1).
Returns: Encoded hidden representation tensor (B, out_channels, T_feats). Tensor: Projected mean tensor (B, out_channels, T_feats). Tensor: Projected scale tensor (B, out_channels, T_feats). Tensor: Mask tensor for input tensor (B, 1, T_feats).
Return type: Tensor