espnet2.tts.gst.style_encoder.StyleEncoder

Less than 1 minute

espnet2.tts.gst.style_encoder.StyleEncoder

class espnet2.tts.gst.style_encoder.StyleEncoder(idim: int = 80, gst_tokens: int = 10, gst_token_dim: int = 256, gst_heads: int = 4, conv_layers: int = 6, conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), conv_kernel_size: int = 3, conv_stride: int = 2, gru_layers: int = 1, gru_units: int = 128)

Bases: Module

Style encoder.

This module is style encoder introduced in Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis.

Parameters:
- idim (int , optional) – Dimension of the input mel-spectrogram.
- gst_tokens (int , optional) – The number of GST embeddings.
- gst_token_dim (int , optional) – Dimension of each GST embedding.
- gst_heads (int , optional) – The number of heads in GST multihead attention.
- conv_layers (int , optional) – The number of conv layers in the reference encoder.
- conv_chans_list – (Sequence[int], optional): List of the number of channels of conv layers in the referece encoder.
- conv_kernel_size (int , optional) – Kernel size of conv layers in the reference encoder.
- conv_stride (int , optional) – Stride size of conv layers in the reference encoder.
- gru_layers (int , optional) – The number of GRU layers in the reference encoder.
- gru_units (int , optional) – The number of GRU units in the reference encoder.

Initilize global style encoder module.

forward(speech: Tensor) → Tensor

Calculate forward propagation.

Parameters:speech (Tensor) – Batch of padded target features (B, Lmax, odim).
Returns: Style token embeddings (B, token_dim).
Return type: Tensor