espnet.nets.pytorch_backend.e2e_tts_fastspeech.FeedForwardTransformer
espnet.nets.pytorch_backend.e2e_tts_fastspeech.FeedForwardTransformer
class espnet.nets.pytorch_backend.e2e_tts_fastspeech.FeedForwardTransformer(idim, odim, args=None)
Bases: TTSInterface
, Module
Feed Forward Transformer for TTS a.k.a. FastSpeech.
This is a module of FastSpeech, feed-forward Transformer with duration predictor described in FastSpeech: Fast, Robust and Controllable Text to Speech, which does not require any auto-regressive processing during inference, resulting in fast decoding compared with auto-regressive Transformer.
Initialize feed-forward Transformer module.
- Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- args (Namespace , optional) –
- elayers (int): Number of encoder layers.
- eunits (int): Number of encoder hidden units.
- adim (int): Number of attention transformation dimensions.
- aheads (int): Number of heads for multi head attention.
- dlayers (int): Number of decoder layers.
- dunits (int): Number of decoder hidden units.
- use_scaled_pos_enc (bool): : Whether to use trainable scaled positional encoding.
- encoder_normalize_before (bool): : Whether to perform layer normalization before encoder block.
- decoder_normalize_before (bool): : Whether to perform layer normalization before decoder block.
- encoder_concat_after (bool): Whether to concatenate attention : layer’s input and output in encoder.
- decoder_concat_after (bool): Whether to concatenate attention : layer’s input and output in decoder.
- duration_predictor_layers (int): Number of duration predictor layers.
- duration_predictor_chans (int): Number of duration predictor channels.
- duration_predictor_kernel_size (int): : Kernel size of duration predictor.
- spk_embed_dim (int): Number of speaker embedding dimensions.
- spk_embed_integration_type: How to integrate speaker embedding.
- teacher_model (str): Teacher auto-regressive transformer model path.
- reduction_factor (int): Reduction factor.
- transformer_init (float): How to initialize transformer parameters.
- transformer_lr (float): Initial value of learning rate.
- transformer_warmup_steps (int): Optimizer warmup steps.
- transformer_enc_dropout_rate (float): : Dropout rate in encoder except attention & positional encoding.
- transformer_enc_positional_dropout_rate (float): : Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float): : Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float): : Dropout rate in decoder except attention & positional encoding.
- transformer_dec_positional_dropout_rate (float): : Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float): : Dropout rate in deocoder self-attention module.
- transformer_enc_dec_attn_dropout_rate (float): : Dropout rate in encoder-deocoder attention module.
- use_masking (bool): : Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool): : Whether to apply weighted masking in loss calculation.
- transfer_encoder_from_teacher: : Whether to transfer encoder using teacher encoder parameters.
- transferred_encoder_module: : Encoder module to be initialized using teacher parameters.
static add_arguments(parser)
Add model-specific arguments to the parser.
property attention_plot_class
Return plot class for attention weight plot.
property base_plot_keys
Return base key names to plot during training.
keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.
- Returns: List of strings which are base keys to plot during training.
- Return type: list
calculate_all_attentions(xs, ilens, ys, olens, spembs=None, extras=None, *args, **kwargs)
Calculate all of the attention weights.
- Parameters:
- xs (Tensor) – Batch of padded character ids (B, Tmax).
- ilens (LongTensor) – Batch of lengths of each input batch (B,).
- ys (Tensor) – Batch of padded target features (B, Lmax, odim).
- olens (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Tensor , optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- extras (Tensor , optional) – Batch of precalculated durations (B, Tmax, 1).
- Returns: Dict of attention weights and outputs.
- Return type: dict
forward(xs, ilens, ys, olens, spembs=None, extras=None, *args, **kwargs)
Calculate forward propagation.
- Parameters:
- xs (Tensor) – Batch of padded character ids (B, Tmax).
- ilens (LongTensor) – Batch of lengths of each input batch (B,).
- ys (Tensor) – Batch of padded target features (B, Lmax, odim).
- olens (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Tensor , optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- extras (Tensor , optional) – Batch of precalculated durations (B, Tmax, 1).
- Returns: Loss value.
- Return type: Tensor
inference(x, inference_args, spemb=None, *args, **kwargs)
Generate the sequence of features given the sequences of characters.
- Parameters:
- x (Tensor) – Input sequence of characters (T,).
- inference_args (Namespace) – Dummy for compatibility.
- spemb (Tensor , optional) – Speaker embedding vector (spk_embed_dim).
- Returns: Output sequence of features (L, odim). None: Dummy for compatibility. None: Dummy for compatibility.
- Return type: Tensor