espnet2.gan_tts.jets.generator.JETSGenerator
espnet2.gan_tts.jets.generator.JETSGenerator
class espnet2.gan_tts.jets.generator.JETSGenerator(idim: int, odim: int, adim: int = 256, aheads: int = 2, elayers: int = 4, eunits: int = 1024, dlayers: int = 4, dunits: int = 1024, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, reduction_factor: int = 1, encoder_type: str = 'transformer', decoder_type: str = 'transformer', transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, conformer_rel_pos_type: str = 'legacy', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, zero_triu: bool = False, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, energy_predictor_layers: int = 2, energy_predictor_chans: int = 384, energy_predictor_kernel_size: int = 3, energy_predictor_dropout: float = 0.5, energy_embed_kernel_size: int = 9, energy_embed_dropout: float = 0.5, stop_gradient_from_energy_predictor: bool = False, pitch_predictor_layers: int = 2, pitch_predictor_chans: int = 384, pitch_predictor_kernel_size: int = 3, pitch_predictor_dropout: float = 0.5, pitch_embed_kernel_size: int = 9, pitch_embed_dropout: float = 0.5, stop_gradient_from_pitch_predictor: bool = False, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'add', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False, segment_size: int = 64, generator_out_channels: int = 1, generator_channels: int = 512, generator_global_channels: int = -1, generator_kernel_size: int = 7, generator_upsample_scales: List[int] = [8, 8, 2, 2], generator_upsample_kernel_sizes: List[int] = [16, 16, 4, 4], generator_resblock_kernel_sizes: List[int] = [3, 7, 11], generator_resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], generator_use_additional_convs: bool = True, generator_bias: bool = True, generator_nonlinear_activation: str = 'LeakyReLU', generator_nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, generator_use_weight_norm: bool = True)
Bases: Module
Generator module in JETS.
Initialize JETS generator module.
- Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- dlayers (int) – Number of decoder layers.
- dunits (int) – Number of decoder hidden units.
- use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.
- use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool) – Whether to apply layernorm layer before encoder block.
- decoder_normalize_before (bool) – Whether to apply layernorm layer before decoder block.
- encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.
- decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.
- reduction_factor (int) – Reduction factor.
- encoder_type (str) – Encoder type (“transformer” or “conformer”).
- decoder_type (str) – Decoder type (“transformer” or “conformer”).
- transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention and positional encoding.
- transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention & positional encoding.
- transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.
- conformer_rel_pos_type (str) – Relative pos encoding type in conformer.
- conformer_pos_enc_layer_type (str) – Pos encoding layer type in conformer.
- conformer_self_attn_layer_type (str) – Self-attention layer type in conformer
- conformer_activation_type (str) – Activation function type in conformer.
- use_macaron_style_in_conformer – Whether to use macaron style FFN.
- use_cnn_in_conformer – Whether to use CNN in conformer.
- zero_triu – Whether to use zero triu in relative self-attention module.
- conformer_enc_kernel_size – Kernel size of encoder conformer.
- conformer_dec_kernel_size – Kernel size of decoder conformer.
- duration_predictor_layers (int) – Number of duration predictor layers.
- duration_predictor_chans (int) – Number of duration predictor channels.
- duration_predictor_kernel_size (int) – Kernel size of duration predictor.
- duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.
- pitch_predictor_layers (int) – Number of pitch predictor layers.
- pitch_predictor_chans (int) – Number of pitch predictor channels.
- pitch_predictor_kernel_size (int) – Kernel size of pitch predictor.
- pitch_predictor_dropout_rate (float) – Dropout rate in pitch predictor.
- pitch_embed_kernel_size (float) – Kernel size of pitch embedding.
- pitch_embed_dropout_rate (float) – Dropout rate for pitch embedding.
- stop_gradient_from_pitch_predictor – Whether to stop gradient from pitch predictor to encoder.
- energy_predictor_layers (int) – Number of energy predictor layers.
- energy_predictor_chans (int) – Number of energy predictor channels.
- energy_predictor_kernel_size (int) – Kernel size of energy predictor.
- energy_predictor_dropout_rate (float) – Dropout rate in energy predictor.
- energy_embed_kernel_size (float) – Kernel size of energy embedding.
- energy_embed_dropout_rate (float) – Dropout rate for energy embedding.
- stop_gradient_from_energy_predictor – Whether to stop gradient from energy predictor to encoder.
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type – How to integrate speaker embedding.
- use_gst (str) – Whether to use global style token.
- gst_tokens (int) – The number of GST embeddings.
- gst_heads (int) – The number of heads in GST multihead attention.
- gst_conv_layers (int) – The number of conv layers in GST.
- gst_conv_chans_list – (Sequence[int]): List of the number of channels of conv layers in GST.
- gst_conv_kernel_size (int) – Kernel size of conv layers in GST.
- gst_conv_stride (int) – Stride size of conv layers in GST.
- gst_gru_layers (int) – The number of GRU layers in GST.
- gst_gru_units (int) – The number of GRU units in GST.
- init_type (str) – How to initialize transformer parameters.
- init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.
- init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.
- use_masking (bool) – Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- segment_size (int) – Segment size for random windowed discriminator
- generator_out_channels (int) – Number of output channels.
- generator_channels (int) – Number of hidden representation channels.
- generator_global_channels (int) – Number of global conditioning channels.
- generator_kernel_size (int) – Kernel size of initial and final conv layer.
- generator_upsample_scales (List *[*int ]) – List of upsampling scales.
- generator_upsample_kernel_sizes (List *[*int ]) – List of kernel sizes for upsample layers.
- generator_resblock_kernel_sizes (List *[*int ]) – List of kernel sizes for residual blocks.
- generator_resblock_dilations (List *[*List *[*int ] ]) – List of list of dilations for residual blocks.
- generator_use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.
- generator_bias (bool) – Whether to add bias parameter in convolution layers.
- generator_nonlinear_activation (str) – Activation function module name.
- generator_nonlinear_activation_params (Dict *[*str , Any ]) – Hyperparameters for activation function.
- generator_use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.
forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, pitch: Tensor, pitch_lengths: Tensor, energy: Tensor, energy_lengths: Tensor, sids: Tensor | None = None, spembs: Tensor | None = None, lids: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]
Calculate forward propagation.
- Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- feats (Tensor) – Feature tensor (B, T_feats, aux_channels).
- feats_lengths (Tensor) – Feature length tensor (B,).
- pitch (Tensor) – Batch of padded token-averaged pitch (B, T_text, 1).
- pitch_lengths (LongTensor) – Batch of pitch lengths (B, T_text).
- energy (Tensor) – Batch of padded token-averaged energy (B, T_text, 1).
- energy_lengths (LongTensor) – Batch of energy lengths (B, T_text).
- sids (Optional *[*Tensor ]) – Speaker index tensor (B,) or (B, 1).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, spk_embed_dim).
- lids (Optional *[*Tensor ]) – Language index tensor (B,) or (B, 1).
- Returns: Waveform tensor (B, 1, segment_size * upsample_factor). Tensor: Binarization loss (). Tensor: Log probability attention matrix (B, T_feats, T_text). Tensor: Segments start index tensor (B,). Tensor: predicted duration (B, T_text). Tensor: ground-truth duration obtained from an alignment module (B, T_text). Tensor: predicted pitch (B, T_text,1). Tensor: ground-truth averaged pitch (B, T_text, 1). Tensor: predicted energy (B, T_text, 1). Tensor: ground-truth averaged energy (B, T_text, 1).
- Return type: Tensor
inference(text: Tensor, text_lengths: Tensor, feats: Tensor | None = None, feats_lengths: Tensor | None = None, pitch: Tensor | None = None, energy: Tensor | None = None, sids: Tensor | None = None, spembs: Tensor | None = None, lids: Tensor | None = None, use_teacher_forcing: bool = False) → Tuple[Tensor, Tensor, Tensor]
Run inference.
- Parameters:
- text (Tensor) – Input text index tensor (B, T_text,).
- text_lengths (Tensor) – Text length tensor (B,).
- feats (Tensor) – Feature tensor (B, T_feats, aux_channels).
- feats_lengths (Tensor) – Feature length tensor (B,).
- pitch (Tensor) – Pitch tensor (B, T_feats, 1)
- energy (Tensor) – Energy tensor (B, T_feats, 1)
- sids (Optional *[*Tensor ]) – Speaker index tensor (B,) or (B, 1).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, spk_embed_dim).
- lids (Optional *[*Tensor ]) – Language index tensor (B,) or (B, 1).
- use_teacher_forcing (bool) – Whether to use teacher forcing.
- Returns: Generated waveform tensor (B, T_wav). Tensor: Duration tensor (B, T_text).
- Return type: Tensor