espnet2.s2st.synthesizer.translatotron.Translatotron

About 2 min

espnet2.s2st.synthesizer.translatotron.Translatotron

class espnet2.s2st.synthesizer.translatotron.Translatotron(idim: int, odim: int, embed_dim: int = 512, atype: str = 'multihead', adim: int = 512, aheads: int = 4, aconv_chans: int = 32, aconv_filts: int = 15, cumulate_att_w: bool = True, dlayers: int = 4, dunits: int = 1024, prenet_layers: int = 2, prenet_units: int = 32, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, output_activation: str | None = None, use_batch_norm: bool = True, use_concate: bool = True, use_residual: bool = False, reduction_factor: int = 2, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'concat', dropout_rate: float = 0.5, zoneout_rate: float = 0.1)

Bases: AbsSynthesizer

TTranslatotron Synthesizer related modules for speech-to-speech translation.

This is a module of Spectrogram prediction network in Translatotron described in Direct speech-to-speech translation with a sequence-to-sequence model, which converts the sequence of hidden states into the sequence of Mel-filterbanks.

Initialize Tacotron2 module.

Parameters:
- idim (int) – Dimension of the inputs.
- odim – (int) Dimension of the outputs.
- adim (int) – Number of dimension of mlp in attention.
- atype (str) – type of attention
- aconv_chans (int) – Number of attention conv filter channels.
- aconv_filts (int) – Number of attention conv filter size.
- embed_dim (int) – Dimension of the token embedding.
- dlayers (int) – Number of decoder lstm layers.
- dunits (int) – Number of decoder lstm units.
- prenet_layers (int) – Number of prenet layers.
- prenet_units (int) – Number of prenet units.
- postnet_layers (int) – Number of postnet layers.
- postnet_filts (int) – Number of postnet filter size.
- postnet_chans (int) – Number of postnet filter channels.
- output_activation (str) – Name of activation function for outputs.
- cumulate_att_w (bool) – Whether to cumulate previous attention weight.
- use_batch_norm (bool) – Whether to use batch normalization.
- use_concate (bool) – Whether to concat enc outputs w/ dec lstm outputs.
- reduction_factor (int) – Reduction factor.
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- dropout_rate (float) – Dropout rate.
- zoneout_rate (float) – Zoneout rate.

forward(enc_outputs: Tensor, enc_outputs_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None) → Tuple[Tensor, Tensor]

Calculate forward propagation.

Parameters:
- enc_outputs (LongTensor) – Batch of padded character ids (B, T, idim).
- enc_outputs_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, T_feats, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
Returns: after_outs (TODO(jiatong) add full comment) before_outs (TODO(jiatong) add full comments) logits att_ws ys stop_labels olens

inference(enc_outputs: Tensor, feats: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, use_teacher_forcing: bool = False) → Dict[str, Tensor]

Generate the sequence of features given the sequences of characters.

Parameters:
- enc_outputs (LongTensor) – Input sequence of characters (N, idim).
- feats (Optional *[*Tensor ]) – Feature sequence to extract style (N, odim).
- spembs (Optional *[*Tensor ]) – Speaker embedding (spk_embed_dim,).
- sids (Optional *[*Tensor ]) – Speaker ID (1,).
- lids (Optional *[*Tensor ]) – Language ID (1,).
- threshold (float) – Threshold in inference.
- minlenratio (float) – Minimum length ratio in inference.
- maxlenratio (float) – Maximum length ratio in inference.
- use_att_constraint (bool) – Whether to apply attention constraint.
- backward_window (int) – Backward window in attention constraint.
- forward_window (int) – Forward window in attention constraint.
- use_teacher_forcing (bool) – Whether to use teacher forcing.
Returns: Output dict including the following items: : * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- prob (Tensor): Output sequence of stop probabilities (T_feats,).
- att_w (Tensor): Attention weights (T_feats, T).
Return type: Dict[str, Tensor]