espnet2.tts package¶
espnet2.tts.fastspeech2¶
Fastspeech2 related modules for ESPnet2.

class
espnet2.tts.fastspeech2.
FastSpeech2
(idim: int, odim: int, adim: int = 384, aheads: int = 4, elayers: int = 6, eunits: int = 1536, dlayers: int = 6, dunits: int = 1536, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, reduction_factor: int = 1, encoder_type: str = 'transformer', decoder_type: str = 'transformer', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, energy_predictor_layers: int = 2, energy_predictor_chans: int = 384, energy_predictor_kernel_size: int = 3, energy_predictor_dropout: float = 0.5, energy_embed_kernel_size: int = 9, energy_embed_dropout: float = 0.5, stop_gradient_from_energy_predictor: bool = False, pitch_predictor_layers: int = 2, pitch_predictor_chans: int = 384, pitch_predictor_kernel_size: int = 3, pitch_predictor_dropout: float = 0.5, pitch_embed_kernel_size: int = 9, pitch_embed_dropout: float = 0.5, stop_gradient_from_pitch_predictor: bool = False, spk_embed_dim: int = None, spk_embed_integration_type: str = 'add', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, duration_predictor_dropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False)[source]¶ Bases:
espnet2.tts.abs_tts.AbsTTS
FastSpeech2 module.
This is a module of FastSpeech2 described in FastSpeech 2: Fast and HighQuality EndtoEnd Text to Speech. Instead of quantized pitch and energy, we use tokenaveraged value introduced in FastPitch: Parallel Texttospeech with Pitch Prediction.
Initialize FastSpeech2 module.

forward
(text: torch.Tensor, text_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, durations: torch.Tensor, durations_lengths: torch.Tensor, pitch: torch.Tensor, pitch_lengths: torch.Tensor, energy: torch.Tensor, energy_lengths: torch.Tensor, spembs: torch.Tensor = None) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Calculate forward propagation.
 Parameters
text (LongTensor) – Batch of padded token ids (B, Tmax).
text_lengths (LongTensor) – Batch of lengths of each input (B,).
speech (Tensor) – Batch of padded target features (B, Lmax, odim).
speech_lengths (LongTensor) – Batch of the lengths of each target (B,).
durations (LongTensor) – Batch of padded durations (B, Tmax + 1).
durations_lengths (LongTensor) – Batch of duration lengths (B, Tmax + 1).
pitch (Tensor) – Batch of padded tokenaveraged pitch (B, Tmax + 1, 1).
pitch_lengths (LongTensor) – Batch of pitch lengths (B, Tmax + 1).
energy (Tensor) – Batch of padded tokenaveraged energy (B, Tmax + 1, 1).
energy_lengths (LongTensor) – Batch of energy lengths (B, Tmax + 1).
spembs (Tensor, optional) – Batch of speaker embeddings (B, spk_embed_dim).
 Returns
Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value.
 Return type
Tensor

inference
(text: torch.Tensor, speech: torch.Tensor = None, spembs: torch.Tensor = None, durations: torch.Tensor = None, pitch: torch.Tensor = None, energy: torch.Tensor = None, alpha: float = 1.0, use_teacher_forcing: bool = False) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Generate the sequence of features given the sequences of characters.
 Parameters
text (LongTensor) – Input sequence of characters (T,).
speech (Tensor, optional) – Feature sequence to extract style (N, idim).
spembs (Tensor, optional) – Speaker embedding vector (spk_embed_dim,).
durations (LongTensor, optional) – Groundtruth of duration (T + 1,).
pitch (Tensor, optional) – Groundtruth of tokenaveraged pitch (T + 1, 1).
energy (Tensor, optional) – Groundtruth of tokenaveraged energy (T + 1, 1).
alpha (float, optional) – Alpha to control the speed.
use_teacher_forcing (bool, optional) – Whether to use teacher forcing. If true, groundtruth of duration, pitch and energy will be used.
 Returns
Output sequence of features (L, odim). None: Dummy for compatibility. None: Dummy for compatibility.
 Return type
Tensor


class
espnet2.tts.fastspeech2.
FastSpeech2Loss
(use_masking: bool = True, use_weighted_masking: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
Loss function module for FastSpeech2.
Initialize feedforward Transformer loss module.
 Parameters
use_masking (bool) – Whether to apply masking for padded part in loss calculation.
use_weighted_masking (bool) – Whether to weighted masking in loss calculation.

forward
(after_outs: torch.Tensor, before_outs: torch.Tensor, d_outs: torch.Tensor, p_outs: torch.Tensor, e_outs: torch.Tensor, ys: torch.Tensor, ds: torch.Tensor, ps: torch.Tensor, es: torch.Tensor, ilens: torch.Tensor, olens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Calculate forward propagation.
 Parameters
after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).
before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).
d_outs (LongTensor) – Batch of outputs of duration predictor (B, Tmax).
p_outs (Tensor) – Batch of outputs of pitch predictor (B, Tmax, 1).
e_outs (Tensor) – Batch of outputs of energy predictor (B, Tmax, 1).
ys (Tensor) – Batch of target features (B, Lmax, odim).
ds (LongTensor) – Batch of durations (B, Tmax).
ps (Tensor) – Batch of target tokenaveraged pitch (B, Tmax, 1).
es (Tensor) – Batch of target tokenaveraged energy (B, Tmax, 1).
ilens (LongTensor) – Batch of the lengths of each input (B,).
olens (LongTensor) – Batch of the lengths of each target (B,).
 Returns
L1 loss value. Tensor: Duration predictor loss value. Tensor: Pitch predictor loss value. Tensor: Energy predictor loss value.
 Return type
Tensor
espnet2.tts.duration_calculator¶
Duration calculator for ESPnet2.
espnet2.tts.transformer¶
TTSTransformer related modules.

class
espnet2.tts.transformer.
Transformer
(idim: int, odim: int, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, dprenet_layers: int = 2, dprenet_units: int = 256, elayers: int = 6, eunits: int = 1024, adim: int = 512, aheads: int = 4, dlayers: int = 6, dunits: int = 1024, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, reduction_factor: int = 1, spk_embed_dim: int = None, spk_embed_integration_type: str = 'add', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, transformer_enc_dec_attn_dropout_rate: float = 0.1, eprenet_dropout_rate: float = 0.5, dprenet_dropout_rate: float = 0.5, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False, bce_pos_weight: float = 5.0, loss_type: str = 'L1', use_guided_attn_loss: bool = True, num_heads_applied_guided_attn: int = 2, num_layers_applied_guided_attn: int = 2, modules_applied_guided_attn: Sequence[str] = 'encoderdecoder', guided_attn_loss_sigma: float = 0.4, guided_attn_loss_lambda: float = 1.0)[source]¶ Bases:
espnet2.tts.abs_tts.AbsTTS
TTSTransformer module.
This is a module of texttospeech Transformer described in Neural Speech Synthesis with Transformer Network, which convert the sequence of tokens into the sequence of Melfilterbanks.
 Parameters
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
embed_dim (int, optional) – Dimension of character embedding.
eprenet_conv_layers (int, optional) – Number of encoder prenet convolution layers.
eprenet_conv_chans (int, optional) – Number of encoder prenet convolution channels.
eprenet_conv_filts (int, optional) – Filter size of encoder prenet convolution.
dprenet_layers (int, optional) – Number of decoder prenet layers.
dprenet_units (int, optional) – Number of decoder prenet hidden units.
elayers (int, optional) – Number of encoder layers.
eunits (int, optional) – Number of encoder hidden units.
adim (int, optional) – Number of attention transformation dimensions.
aheads (int, optional) – Number of heads for multi head attention.
dlayers (int, optional) – Number of decoder layers.
dunits (int, optional) – Number of decoder hidden units.
postnet_layers (int, optional) – Number of postnet layers.
postnet_chans (int, optional) – Number of postnet channels.
postnet_filts (int, optional) – Filter size of postnet.
use_scaled_pos_enc (bool, optional) – Whether to use trainable scaled positional encoding.
use_batch_norm (bool, optional) – Whether to use batch normalization in encoder prenet.
encoder_normalize_before (bool, optional) – Whether to perform layer normalization before encoder block.
decoder_normalize_before (bool, optional) – Whether to perform layer normalization before decoder block.
encoder_concat_after (bool, optional) – Whether to concatenate attention layer’s input and output in encoder.
decoder_concat_after (bool, optional) – Whether to concatenate attention layer’s input and output in decoder.
positionwise_layer_type (str, optional) – Positionwise operation type.
positionwise_conv_kernel_size (int, optional) – Kernel size in position wise conv 1d.
reduction_factor (int, optional) – Reduction factor.
spk_embed_dim (int, optional) – Number of speaker embedding dimenstions.
spk_embed_integration_type (str, optional) – How to integrate speaker embedding.
use_gst (str, optional) – Whether to use global style token.
gst_tokens (int, optional) – The number of GST embeddings.
gst_heads (int, optional) – The number of heads in GST multihead attention.
gst_conv_layers (int, optional) – The number of conv layers in GST.
gst_conv_chans_list – (Sequence[int], optional): List of the number of channels of conv layers in GST.
gst_conv_kernel_size (int, optional) – Kernal size of conv layers in GST.
gst_conv_stride (int, optional) – Stride size of conv layers in GST.
gst_gru_layers (int, optional) – The number of GRU layers in GST.
gst_gru_units (int, optional) – The number of GRU units in GST.
transformer_lr (float, optional) – Initial value of learning rate.
transformer_warmup_steps (int, optional) – Optimizer warmup steps.
transformer_enc_dropout_rate (float, optional) – Dropout rate in encoder except attention and positional encoding.
transformer_enc_positional_dropout_rate (float, optional) – Dropout rate after encoder positional encoding.
transformer_enc_attn_dropout_rate (float, optional) – Dropout rate in encoder selfattention module.
transformer_dec_dropout_rate (float, optional) – Dropout rate in decoder except attention & positional encoding.
transformer_dec_positional_dropout_rate (float, optional) – Dropout rate after decoder positional encoding.
transformer_dec_attn_dropout_rate (float, optional) – Dropout rate in deocoder selfattention module.
transformer_enc_dec_attn_dropout_rate (float, optional) – Dropout rate in encoderdeocoder attention module.
init_type (str, optional) – How to initialize transformer parameters.
init_enc_alpha (float, optional) – Initial value of alpha in scaled pos encoding of the encoder.
init_dec_alpha (float, optional) – Initial value of alpha in scaled pos encoding of the decoder.
eprenet_dropout_rate (float, optional) – Dropout rate in encoder prenet.
dprenet_dropout_rate (float, optional) – Dropout rate in decoder prenet.
postnet_dropout_rate (float, optional) – Dropout rate in postnet.
use_masking (bool, optional) – Whether to apply masking for padded part in loss calculation.
use_weighted_masking (bool, optional) – Whether to apply weighted masking in loss calculation.
bce_pos_weight (float, optional) – Positive sample weight in bce calculation (only for use_masking=true).
loss_type (str, optional) – How to calculate loss.
use_guided_attn_loss (bool, optional) – Whether to use guided attention loss.
num_heads_applied_guided_attn (int, optional) – Number of heads in each layer to apply guided attention loss.
num_layers_applied_guided_attn (int, optional) – Number of layers to apply guided attention loss.
modules_applied_guided_attn (Sequence[str], optional) – List of module names to apply guided attention loss.
guided_attn_loss_sigma (float, optional) –
guided_attn_loss_lambda (float, optional) – Lambda in guided attention loss.
Initialize Transformer module.

forward
(text: torch.Tensor, text_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, spembs: torch.Tensor = None) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Calculate forward propagation.
 Parameters
text (LongTensor) – Batch of padded character ids (B, Tmax).
text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
speech (Tensor) – Batch of padded target features (B, Lmax, odim).
speech_lengths (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embeddings (B, spk_embed_dim).
 Returns
Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value.
 Return type
Tensor

inference
(text: torch.Tensor, speech: torch.Tensor = None, spembs: torch.Tensor = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Generate the sequence of features given the sequences of characters.
 Parameters
text (LongTensor) – Input sequence of characters (T,).
speech (Tensor, optional) – Feature sequence to extract style (N, idim).
spembs (Tensor, optional) – Speaker embedding vector (spk_embed_dim,).
threshold (float, optional) – Threshold in inference.
minlenratio (float, optional) – Minimum length ratio in inference.
maxlenratio (float, optional) – Maximum length ratio in inference.
use_teacher_forcing (bool, optional) – Whether to use teacher forcing.
 Returns
Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Encoderdecoder (source) attention weights (#layers, #heads, L, T).
 Return type
Tensor
espnet2.tts.tacotron2¶
Tacotron 2 related modules for ESPnet2.

class
espnet2.tts.tacotron2.
Tacotron2
(idim: int, odim: int, embed_dim: int = 512, elayers: int = 1, eunits: int = 512, econv_layers: int = 3, econv_chans: int = 512, econv_filts: int = 5, atype: str = 'location', adim: int = 512, aconv_chans: int = 32, aconv_filts: int = 15, cumulate_att_w: bool = True, dlayers: int = 2, dunits: int = 1024, prenet_layers: int = 2, prenet_units: int = 256, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, output_activation: str = None, use_batch_norm: bool = True, use_concate: bool = True, use_residual: bool = False, reduction_factor: int = 1, spk_embed_dim: int = None, spk_embed_integration_type: str = 'concat', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, dropout_rate: float = 0.5, zoneout_rate: float = 0.1, use_masking: bool = True, use_weighted_masking: bool = False, bce_pos_weight: float = 5.0, loss_type: str = 'L1+L2', use_guided_attn_loss: bool = True, guided_attn_loss_sigma: float = 0.4, guided_attn_loss_lambda: float = 1.0)[source]¶ Bases:
espnet2.tts.abs_tts.AbsTTS
Tacotron2 module for endtoend texttospeech.
This is a module of Spectrogram prediction network in Tacotron2 described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, which converts the sequence of characters into the sequence of Melfilterbanks.
 Parameters
idim (int) – Dimension of the inputs.
odim – (int) Dimension of the outputs.
spk_embed_dim (int, optional) – Dimension of the speaker embedding.
embed_dim (int, optional) – Dimension of character embedding.
elayers (int, optional) – The number of encoder blstm layers.
eunits (int, optional) – The number of encoder blstm units.
econv_layers (int, optional) – The number of encoder conv layers.
econv_filts (int, optional) – The number of encoder conv filter size.
econv_chans (int, optional) – The number of encoder conv filter channels.
dlayers (int, optional) – The number of decoder lstm layers.
dunits (int, optional) – The number of decoder lstm units.
prenet_layers (int, optional) – The number of prenet layers.
prenet_units (int, optional) – The number of prenet units.
postnet_layers (int, optional) – The number of postnet layers.
postnet_filts (int, optional) – The number of postnet filter size.
postnet_chans (int, optional) – The number of postnet filter channels.
output_activation (str, optional) – The name of activation function for outputs.
adim (int, optional) – The number of dimension of mlp in attention.
aconv_chans (int, optional) – The number of attention conv filter channels.
aconv_filts (int, optional) – The number of attention conv filter size.
cumulate_att_w (bool, optional) – Whether to cumulate previous attention weight.
use_batch_norm (bool, optional) – Whether to use batch normalization.
use_concate (bool, optional) – Whether to concatenate encoder embedding with decoder lstm outputs.
reduction_factor (int, optional) – Reduction factor.
spk_embed_dim – Number of speaker embedding dimenstions.
spk_embed_integration_type (str, optional) – How to integrate speaker embedding.
use_gst (str, optional) – Whether to use global style token.
gst_tokens (int, optional) – The number of GST embeddings.
gst_heads (int, optional) – The number of heads in GST multihead attention.
gst_conv_layers (int, optional) – The number of conv layers in GST.
gst_conv_chans_list – (Sequence[int], optional): List of the number of channels of conv layers in GST.
gst_conv_kernel_size (int, optional) – Kernal size of conv layers in GST.
gst_conv_stride (int, optional) – Stride size of conv layers in GST.
gst_gru_layers (int, optional) – The number of GRU layers in GST.
gst_gru_units (int, optional) – The number of GRU units in GST.
dropout_rate (float, optional) – Dropout rate.
zoneout_rate (float, optional) – Zoneout rate.
use_masking (bool, optional) – Whether to mask padded part in loss calculation.
use_weighted_masking (bool, optional) – Whether to apply weighted masking in loss calculation.
bce_pos_weight (float, optional) – Weight of positive sample of stop token (only for use_masking=True).
loss_type (str, optional) – How to calculate loss.
use_guided_attn_loss (bool, optional) – Whether to use guided attention loss.
guided_attn_loss_sigma (float, optional) – Sigma in guided attention loss.
guided_attn_loss_lamdba (float, optional) – Lambda in guided attention loss.
Initialize Tacotron2 module.

forward
(text: torch.Tensor, text_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, spembs: torch.Tensor = None) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Calculate forward propagation.
 Parameters
text (LongTensor) – Batch of padded character ids (B, Tmax).
text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
speech (Tensor) – Batch of padded target features (B, Lmax, odim).
speech_lengths (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embeddings (B, spk_embed_dim).
 Returns
Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value.
 Return type
Tensor

inference
(text: torch.Tensor, speech: torch.Tensor = None, spembs: torch.Tensor = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, use_teacher_forcing: bool = False) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Generate the sequence of features given the sequences of characters.
 Parameters
text (LongTensor) – Input sequence of characters (T,).
speech (Tensor, optional) – Feature sequence to extract style (N, idim).
spembs (Tensor, optional) – Speaker embedding vector (spk_embed_dim,).
threshold (float, optional) – Threshold in inference.
minlenratio (float, optional) – Minimum length ratio in inference.
maxlenratio (float, optional) – Maximum length ratio in inference.
use_att_constraint (bool, optional) – Whether to apply attention constraint.
backward_window (int, optional) – Backward window in attention constraint.
forward_window (int, optional) – Forward window in attention constraint.
use_teacher_forcing (bool, optional) – Whether to use teacher forcing.
 Returns
Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
 Return type
Tensor
espnet2.tts.espnet_model¶

class
espnet2.tts.espnet_model.
ESPnetTTSModel
(feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], pitch_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], energy_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], pitch_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], energy_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], tts: espnet2.tts.abs_tts.AbsTTS)[source]¶ Bases:
espnet2.train.abs_espnet_model.AbsESPnetModel

collect_feats
(text: torch.Tensor, text_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, durations: torch.Tensor = None, durations_lengths: torch.Tensor = None, pitch: torch.Tensor = None, pitch_lengths: torch.Tensor = None, energy: torch.Tensor = None, energy_lengths: torch.Tensor = None, spembs: torch.Tensor = None) → Dict[str, torch.Tensor][source]¶

forward
(text: torch.Tensor, text_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, durations: torch.Tensor = None, durations_lengths: torch.Tensor = None, pitch: torch.Tensor = None, pitch_lengths: torch.Tensor = None, energy: torch.Tensor = None, energy_lengths: torch.Tensor = None, spembs: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.tts.abs_tts¶

class
espnet2.tts.abs_tts.
AbsTTS
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract
forward
(text: torch.Tensor, text_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, spembs: torch.Tensor = None, spcs: torch.Tensor = None, spcs_lengths: torch.Tensor = None) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract
espnet2.tts.variance_predictor¶
Variance predictor related modules.

class
espnet2.tts.variance_predictor.
VariancePredictor
(idim: int, n_layers: int = 2, n_chans: int = 384, kernel_size: int = 3, bias: bool = True, dropout_rate: float = 0.5)[source]¶ Bases:
torch.nn.modules.module.Module
Variance predictor module.
This is a module of variacne predictor described in FastSpeech 2: Fast and HighQuality EndtoEnd Text to Speech.
Initilize duration predictor module.
 Parameters
idim (int) – Input dimension.
n_layers (int, optional) – Number of convolutional layers.
n_chans (int, optional) – Number of channels of convolutional layers.
kernel_size (int, optional) – Kernel size of convolutional layers.
dropout_rate (float, optional) – Dropout rate.

forward
(xs: torch.Tensor, x_masks: torch.Tensor = None) → torch.Tensor[source]¶ Calculate forward propagation.
 Parameters
xs (Tensor) – Batch of input sequences (B, Tmax, idim).
x_masks (ByteTensor, optional) – Batch of masks indicating padded part (B, Tmax).
 Returns
Batch of predicted sequences (B, Tmax, 1).
 Return type
Tensor
espnet2.tts.__init__¶
espnet2.tts.fastspeech¶
Fastspeech related modules for ESPnet2.

class
espnet2.tts.fastspeech.
FastSpeech
(idim: int, odim: int, adim: int = 384, aheads: int = 4, elayers: int = 6, eunits: int = 1536, dlayers: int = 6, dunits: int = 1536, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, reduction_factor: int = 1, encoder_type: str = 'transformer', decoder_type: str = 'transformer', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, spk_embed_dim: int = None, spk_embed_integration_type: str = 'add', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, duration_predictor_dropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False)[source]¶ Bases:
espnet2.tts.abs_tts.AbsTTS
FastSpeech module for endtoend texttospeech.
This is a module of FastSpeech, feedforward Transformer with duration predictor described in FastSpeech: Fast, Robust and Controllable Text to Speech, which does not require any autoregressive processing during inference, resulting in fast decoding compared with autoregressive Transformer.
 Parameters
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
elayers (int, optional) – Number of encoder layers.
eunits (int, optional) – Number of encoder hidden units.
dlayers (int, optional) – Number of decoder layers.
dunits (int, optional) – Number of decoder hidden units.
use_scaled_pos_enc (bool, optional) – Whether to use trainable scaled positional encoding.
encoder_normalize_before (bool, optional) – Whether to perform layer normalization before encoder block.
decoder_normalize_before (bool, optional) – Whether to perform layer normalization before decoder block.
encoder_concat_after (bool, optional) – Whether to concatenate attention layer’s input and output in encoder.
decoder_concat_after (bool, optional) – Whether to concatenate attention layer’s input and output in decoder.
duration_predictor_layers (int, optional) – Number of duration predictor layers.
duration_predictor_chans (int, optional) – Number of duration predictor channels.
duration_predictor_kernel_size (int, optional) – Kernel size of duration predictor.
spk_embed_dim (int, optional) – Number of speaker embedding dimensions.
spk_embed_integration_type – How to integrate speaker embedding.
use_gst (str, optional) – Whether to use global style token.
gst_tokens (int, optional) – The number of GST embeddings.
gst_heads (int, optional) – The number of heads in GST multihead attention.
gst_conv_layers (int, optional) – The number of conv layers in GST.
gst_conv_chans_list – (Sequence[int], optional): List of the number of channels of conv layers in GST.
gst_conv_kernel_size (int, optional) – Kernal size of conv layers in GST.
gst_conv_stride (int, optional) – Stride size of conv layers in GST.
gst_gru_layers (int, optional) – The number of GRU layers in GST.
gst_gru_units (int, optional) – The number of GRU units in GST.
reduction_factor (int, optional) – Reduction factor.
transformer_enc_dropout_rate (float, optional) – Dropout rate in encoder except attention & positional encoding.
transformer_enc_positional_dropout_rate (float, optional) – Dropout rate after encoder positional encoding.
transformer_enc_attn_dropout_rate (float, optional) – Dropout rate in encoder selfattention module.
transformer_dec_dropout_rate (float, optional) – Dropout rate in decoder except attention & positional encoding.
transformer_dec_positional_dropout_rate (float, optional) – Dropout rate after decoder positional encoding.
transformer_dec_attn_dropout_rate (float, optional) – Dropout rate in deocoder selfattention module.
init_type (str, optional) – How to initialize transformer parameters.
init_enc_alpha (float, optional) – Initial value of alpha in scaled pos encoding of the encoder.
init_dec_alpha (float, optional) – Initial value of alpha in scaled pos encoding of the decoder.
use_masking (bool, optional) – Whether to apply masking for padded part in loss calculation.
use_weighted_masking (bool, optional) – Whether to apply weighted masking in loss calculation.
Initialize FastSpeech module.

forward
(text: torch.Tensor, text_lengths: torch.Tensor, speech: torch.Tensor, speech_lengths: torch.Tensor, durations: torch.Tensor, durations_lengths: torch.Tensor, spembs: torch.Tensor = None) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Calculate forward propagation.
 Parameters
text (LongTensor) – Batch of padded character ids (B, Tmax).
text_lengths (LongTensor) – Batch of lengths of each input (B,).
speech (Tensor) – Batch of padded target features (B, Lmax, odim).
speech_lengths (LongTensor) – Batch of the lengths of each target (B,).
durations (LongTensor) – Batch of padded durations (B, Tmax + 1).
durations_lengths (LongTensor) – Batch of duration lengths (B, Tmax + 1).
spembs (Tensor, optional) – Batch of speaker embeddings (B, spk_embed_dim).
 Returns
Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value.
 Return type
Tensor

inference
(text: torch.Tensor, speech: torch.Tensor = None, spembs: torch.Tensor = None, durations: torch.Tensor = None, alpha: float = 1.0, use_teacher_forcing: bool = False) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Generate the sequence of features given the sequences of characters.
 Parameters
text (LongTensor) – Input sequence of characters (T,).
speech (Tensor, optional) – Feature sequence to extract style (N, idim).
spembs (Tensor, optional) – Speaker embedding vector (spk_embed_dim,).
durations (LongTensor, optional) – Groundtruth of duration (T + 1,).
alpha (float, optional) – Alpha to control the speed.
use_teacher_forcing (bool, optional) – Whether to use teacher forcing. If true, groundtruth of duration, pitch and energy will be used.
 Returns
Output sequence of features (L, odim). None: Dummy for compatibility. None: Dummy for compatibility.
 Return type
Tensor
espnet2.tts.feats_extract.energy¶
Energy extractor.

class
espnet2.tts.feats_extract.energy.
Energy
(fs: Union[int, str] = 22050, n_fft: int = 1024, win_length: int = None, hop_length: int = 256, window: str = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, use_token_averaged_energy: bool = True, reduction_factor: int = None)[source]¶ Bases:
espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract
Energy extractor.

forward
(input: torch.Tensor, input_lengths: torch.Tensor = None, feats_lengths: torch.Tensor = None, durations: torch.Tensor = None, durations_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.tts.feats_extract.log_spectrogram¶

class
espnet2.tts.feats_extract.log_spectrogram.
LogSpectrogram
(n_fft: int = 1024, win_length: int = None, hop_length: int = 256, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True)[source]¶ Bases:
espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract
Conventional frontend structure for ASR
Stft > logamplitudespec

forward
(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.tts.feats_extract.log_mel_fbank¶

class
espnet2.tts.feats_extract.log_mel_fbank.
LogMelFbank
(fs: Union[int, str] = 16000, n_fft: int = 1024, win_length: int = None, hop_length: int = 256, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: Optional[int] = 80, fmax: Optional[int] = 7600, htk: bool = False)[source]¶ Bases:
espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract
Conventional frontend structure for ASR
Stft > amplitudespec > LogMelFbank

forward
(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.tts.feats_extract.dio¶
F0 extractor using DIO + Stonemask algorithm.

class
espnet2.tts.feats_extract.dio.
Dio
(fs: Union[int, str] = 22050, n_fft: int = 1024, hop_length: int = 256, f0min: int = 80, f0max: int = 400, use_token_averaged_f0: bool = True, use_continuous_f0: bool = True, use_log_f0: bool = True, reduction_factor: int = None)[source]¶ Bases:
espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract
F0 estimation with dio + stonemask algortihm.
This is f0 extractor based on dio + stonmask algorithm introduced in WORLD: a vocoderbased highquality speech synthesis system for realtime applications.
Note
This module is based on NumPy implementation. Therefore, the computational graph is not connected.

forward
(input: torch.Tensor, input_lengths: torch.Tensor = None, feats_lengths: torch.Tensor = None, durations: torch.Tensor = None, durations_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.tts.feats_extract.__init__¶
espnet2.tts.feats_extract.abs_feats_extract¶

class
espnet2.tts.feats_extract.abs_feats_extract.
AbsFeatsExtract
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract
espnet2.tts.gst.style_encoder¶
Style encoder of GSTTacotron.

class
espnet2.tts.gst.style_encoder.
MultiHeadedAttention
(q_dim, k_dim, v_dim, n_head, n_feat, dropout_rate=0.0)[source]¶ Bases:
espnet.nets.pytorch_backend.transformer.attention.MultiHeadedAttention
Multi head attention module with different input dimension.
Initialize multi head attention module.

class
espnet2.tts.gst.style_encoder.
ReferenceEncoder
(idim=80, conv_layers: int = 6, conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), conv_kernel_size: int = 3, conv_stride: int = 2, gru_layers: int = 1, gru_units: int = 128)[source]¶ Bases:
torch.nn.modules.module.Module
Reference encoder module.
This module is refernece encoder introduced in Style Tokens: Unsupervised Style Modeling, Control and Transfer in EndtoEnd Speech Synthesis.
 Parameters
idim (int, optional) – Dimension of the input melspectrogram.
conv_layers (int, optional) – The number of conv layers in the reference encoder.
conv_chans_list – (Sequence[int], optional): List of the number of channels of conv layers in the referece encoder.
conv_kernel_size (int, optional) – Kernal size of conv layers in the reference encoder.
conv_stride (int, optional) – Stride size of conv layers in the reference encoder.
gru_layers (int, optional) – The number of GRU layers in the reference encoder.
gru_units (int, optional) – The number of GRU units in the reference encoder.
Initilize reference encoder module.

class
espnet2.tts.gst.style_encoder.
StyleEncoder
(idim: int = 80, gst_tokens: int = 10, gst_token_dim: int = 256, gst_heads: int = 4, conv_layers: int = 6, conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), conv_kernel_size: int = 3, conv_stride: int = 2, gru_layers: int = 1, gru_units: int = 128)[source]¶ Bases:
torch.nn.modules.module.Module
Style encoder.
This module is style encoder introduced in Style Tokens: Unsupervised Style Modeling, Control and Transfer in EndtoEnd Speech Synthesis.
 Parameters
idim (int, optional) – Dimension of the input melspectrogram.
gst_tokens (int, optional) – The number of GST embeddings.
gst_token_dim (int, optional) – Dimension of each GST embedding.
gst_heads (int, optional) – The number of heads in GST multihead attention.
conv_layers (int, optional) – The number of conv layers in the reference encoder.
conv_chans_list – (Sequence[int], optional): List of the number of channels of conv layers in the referece encoder.
conv_kernel_size (int, optional) – Kernal size of conv layers in the reference encoder.
conv_stride (int, optional) – Stride size of conv layers in the reference encoder.
gru_layers (int, optional) – The number of GRU layers in the reference encoder.
gru_units (int, optional) – The number of GRU units in the reference encoder.
Initilize global style encoder module.

class
espnet2.tts.gst.style_encoder.
StyleTokenLayer
(ref_embed_dim: int = 128, gst_tokens: int = 10, gst_token_dim: int = 256, gst_heads: int = 4, dropout_rate: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
Style token layer module.
This module is style token layer introduced in Style Tokens: Unsupervised Style Modeling, Control and Transfer in EndtoEnd Speech Synthesis.
 Parameters
ref_embed_dim (int, optional) – Dimension of the input reference embedding.
gst_tokens (int, optional) – The number of GST embeddings.
gst_token_dim (int, optional) – Dimension of each GST embedding.
gst_heads (int, optional) – The number of heads in GST multihead attention.
dropout_rate (float, optional) – Dropout rate in multihead attention.
Initilize style token layer module.