espnet2.s2st package¶
espnet2.s2st.espnet_model¶
-
class
espnet2.s2st.espnet_model.
ESPnetS2STModel
(s2st_type: str, frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], tgt_feats_extract: Optional[espnet2.s2st.tgt_feats_extract.abs_tgt_feats_extract.AbsTgtFeatsExtract], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], src_normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], tgt_normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], asr_decoder: Optional[espnet2.asr.decoder.abs_decoder.AbsDecoder], st_decoder: Optional[espnet2.asr.decoder.abs_decoder.AbsDecoder], aux_attention: Optional[espnet2.s2st.aux_attention.abs_aux_attention.AbsS2STAuxAttention], unit_encoder: Optional[espnet2.asr.encoder.abs_encoder.AbsEncoder], synthesizer: Optional[espnet2.s2st.synthesizer.abs_synthesizer.AbsSynthesizer], asr_ctc: Optional[espnet2.asr.ctc.CTC], st_ctc: Optional[espnet2.asr.ctc.CTC], losses: Dict[str, espnet2.s2st.losses.abs_loss.AbsS2STLoss], tgt_vocab_size: Optional[int], tgt_token_list: Union[Tuple[str, ...], List[str], None], src_vocab_size: Optional[int], src_token_list: Union[Tuple[str, ...], List[str], None], unit_vocab_size: Optional[int], unit_token_list: Union[Tuple[str, ...], List[str], None], ignore_id: int = -1, report_cer: bool = True, report_wer: bool = True, report_bleu: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', extract_feats_in_collect_stats: bool = True)[source]¶ Bases:
espnet2.train.abs_espnet_model.AbsESPnetModel
ESPnet speech-to-speech translation model
-
collect_feats
(src_speech: torch.Tensor, src_speech_lengths: torch.Tensor, tgt_speech: torch.Tensor, tgt_speech_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶
-
encode
(speech: torch.Tensor, speech_lengths: torch.Tensor, return_all_hs: bool = False, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]¶ Frontend + Encoder. Note that this method is used by st_inference.py
- Parameters:
speech – (Batch, Length, …)
speech_lengths – (Batch, )
-
forward
(src_speech: torch.Tensor, src_speech_lengths: torch.Tensor, tgt_speech: torch.Tensor, tgt_speech_lengths: torch.Tensor, tgt_text: Optional[torch.Tensor] = None, tgt_text_lengths: Optional[torch.Tensor] = None, src_text: Optional[torch.Tensor] = None, src_text_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
inference
(src_speech: torch.Tensor, src_speech_lengths: Optional[torch.Tensor] = None, tgt_speech: Optional[torch.Tensor] = None, tgt_speech_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, use_teacher_forcing: bool = False) → Dict[str, torch.Tensor][source]¶
-
property
require_vocoder
¶ Return whether or not vocoder is required.
espnet2.s2st.__init__¶
espnet2.s2st.synthesizer.unity_synthesizer¶
Translatotron Synthesizer related modules for ESPnet2.
-
class
espnet2.s2st.synthesizer.unity_synthesizer.
UnitYSynthesizer
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, layer_drop_rate: float = 0.0, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'concat')[source]¶ Bases:
espnet2.s2st.synthesizer.abs_synthesizer.AbsSynthesizer
UnitY Synthesizer related modules for speech-to-speech translation.
This is a module of discrete unit prediction network in discrete-unit described in Direct speech-to-speech translation with discrete units, which converts the sequence of hidden states into the sequence of discrete unit (from SSLs).
Transfomer decoder for discrete unit module.
- Parameters:
vocab_size – output dim
encoder_output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
self_attention_dropout_rate – dropout rate for attention
input_layer – input layer type
use_output_layer – whether to use output layer
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
spk_embed_integration_type (str) – How to integrate speaker embedding.
-
forward
(enc_outputs: torch.Tensor, enc_outputs_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, return_last_hidden: bool = False, return_all_hiddens: bool = False) → Tuple[torch.Tensor, torch.Tensor][source]¶ Calculate forward propagation.
- Parameters:
enc_outputs (LongTensor) – Batch of padded character ids (B, T, idim).
enc_outputs_lengths (LongTensor) – Batch of lengths of each input batch (B,).
feats (Tensor) – Batch of padded target features (B, T_feats, odim).
feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).
lids (Optional[Tensor]) – Batch of language IDs (B, 1).
- Returns:
hs hlens
espnet2.s2st.synthesizer.__init__¶
espnet2.s2st.synthesizer.abs_synthesizer¶
Text-to-speech abstrast class.
-
class
espnet2.s2st.synthesizer.abs_synthesizer.
AbsSynthesizer
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
TTS abstract class.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input_states: torch.Tensor, input_states_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Calculate outputs and return the loss tensor.
-
abstract
inference
(input_states: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶ Return predicted output as a dict.
-
property
require_raw_speech
¶ Return whether or not raw_speech is required.
-
property
require_vocoder
¶ Return whether or not vocoder is required.
-
abstract
espnet2.s2st.synthesizer.discrete_synthesizer¶
Translatotron Synthesizer related modules for ESPnet2.
-
class
espnet2.s2st.synthesizer.discrete_synthesizer.
TransformerDiscreteSynthesizer
(odim: int, idim: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, layer_drop_rate: float = 0.0, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'concat')[source]¶ Bases:
espnet2.s2st.synthesizer.abs_synthesizer.AbsSynthesizer
,espnet.nets.scorer_interface.BatchScorerInterface
Discrete unit Synthesizer related modules for speech-to-speech translation.
This is a module of discrete unit prediction network in discrete-unit described in Direct speech-to-speech translation with discrete units, which converts the sequence of hidden states into the sequence of discrete unit (from SSLs).
Transfomer decoder for discrete unit module.
- Parameters:
vocab_size – output dim
encoder_output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
self_attention_dropout_rate – dropout rate for attention
input_layer – input layer type
use_output_layer – whether to use output layer
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
spk_embed_integration_type (str) – How to integrate speaker embedding.
-
batch_score
(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶ Score new token batch.
- Parameters:
ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).
- Returns:
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type:
tuple[torch.Tensor, List[Any]]
-
forward
(enc_outputs: torch.Tensor, enc_outputs_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, return_hs: bool = False, return_all_hs: bool = False) → Tuple[torch.Tensor, torch.Tensor][source]¶ Calculate forward propagation.
- Parameters:
enc_outputs (LongTensor) – Batch of padded character ids (B, T, idim).
enc_outputs_lengths (LongTensor) – Batch of lengths of each input batch (B,).
feats (Tensor) – Batch of padded target features (B, T_feats, odim).
feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).
lids (Optional[Tensor]) – Batch of language IDs (B, 1).
- Returns:
hs hlens
-
forward_one_step
(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, cache: List[torch.Tensor] = None, **kwargs) → Tuple[torch.Tensor, List[torch.Tensor]][source]¶ Forward one step.
- Parameters:
tgt – input token ids, int64 (batch, maxlen_out)
tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)
memory – encoded memory, float32 (batch, maxlen_in, feat)
cache – cached output list of (batch, max_time_out-1, size)
- Returns:
NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)
- Return type:
y, cache
espnet2.s2st.synthesizer.translatotron2¶
Translatotron2 related modules for ESPnet2.
-
class
espnet2.s2st.synthesizer.translatotron2.
DurationPredictor
(cfg)[source]¶ Bases:
torch.nn.modules.module.Module
Non-Attentive Tacotron (NAT) Duration Predictor module.
-
class
espnet2.s2st.synthesizer.translatotron2.
GaussianUpsampling
[source]¶ Bases:
torch.nn.modules.module.Module
Gaussian Upsample.
- Non-attention Tacotron:
- this source code is implemenation of the ExpressiveTacotron from BridgetteSong
-
forward
(encoder_outputs, durations, vars, input_lengths=None)[source]¶ Gaussian upsampling.
- Parameters:
encoder_outputs – encoder outputs [batch_size, hidden_length, dim]
durations – phoneme durations [batch_size, hidden_length]
vars – phoneme attended ranges [batch_size, hidden_length]
input_lengths – [batch_size]
- Returns:
- upsampled encoder_output
[batch_size, frame_length, dim]
- Return type:
encoder_upsampling_outputs
-
class
espnet2.s2st.synthesizer.translatotron2.
Prenet
(idim, units=128, num_layers=2, dropout=0.5)[source]¶ Bases:
torch.nn.modules.module.Module
Non-Attentive Tacotron (NAT) Prenet.
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
class
espnet2.s2st.synthesizer.translatotron2.
Translatotron2
(idim: int, odim: int, synthesizer_type: str = 'rnn', layers: int = 2, units: int = 1024, prenet_layers: int = 2, prenet_units: int = 128, prenet_dropout_rate: float = 0.5, postnet_layers: int = 5, postnet_chans: int = 512, postnet_dropout_rate: float = 0.5, adim: int = 384, aheads: int = 4, conformer_rel_pos_type: str = 'legacy', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, zero_triu: bool = False, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, duration_predictor_layers: int = 2, duration_predictor_type: str = 'rnn', duration_predictor_units: int = 128, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False)[source]¶ Bases:
espnet2.s2st.synthesizer.abs_synthesizer.AbsSynthesizer
Translatotron2 module.
This is a module of the synthesizer in Translatotron2 described in `Translatotron 2: High-quality direct speech-to-speech translation with voice preservation`_.
- High-quality direct speech-to-speech translation with voice preservation`:
espnet2.s2st.synthesizer.translatotron¶
Translatotron Synthesizer related modules for ESPnet2.
-
class
espnet2.s2st.synthesizer.translatotron.
Translatotron
(idim: int, odim: int, embed_dim: int = 512, atype: str = 'multihead', adim: int = 512, aheads: int = 4, aconv_chans: int = 32, aconv_filts: int = 15, cumulate_att_w: bool = True, dlayers: int = 4, dunits: int = 1024, prenet_layers: int = 2, prenet_units: int = 32, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, output_activation: Optional[str] = None, use_batch_norm: bool = True, use_concate: bool = True, use_residual: bool = False, reduction_factor: int = 2, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'concat', dropout_rate: float = 0.5, zoneout_rate: float = 0.1)[source]¶ Bases:
espnet2.s2st.synthesizer.abs_synthesizer.AbsSynthesizer
TTranslatotron Synthesizer related modules for speech-to-speech translation.
This is a module of Spectrogram prediction network in Translatotron described in Direct speech-to-speech translation with a sequence-to-sequence model, which converts the sequence of hidden states into the sequence of Mel-filterbanks.
Initialize Tacotron2 module.
- Parameters:
idim (int) – Dimension of the inputs.
odim – (int) Dimension of the outputs.
adim (int) – Number of dimension of mlp in attention.
atype (str) – type of attention
aconv_chans (int) – Number of attention conv filter channels.
aconv_filts (int) – Number of attention conv filter size.
embed_dim (int) – Dimension of the token embedding.
dlayers (int) – Number of decoder lstm layers.
dunits (int) – Number of decoder lstm units.
prenet_layers (int) – Number of prenet layers.
prenet_units (int) – Number of prenet units.
postnet_layers (int) – Number of postnet layers.
postnet_filts (int) – Number of postnet filter size.
postnet_chans (int) – Number of postnet filter channels.
output_activation (str) – Name of activation function for outputs.
cumulate_att_w (bool) – Whether to cumulate previous attention weight.
use_batch_norm (bool) – Whether to use batch normalization.
use_concate (bool) – Whether to concat enc outputs w/ dec lstm outputs.
reduction_factor (int) – Reduction factor.
spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
spk_embed_integration_type (str) – How to integrate speaker embedding.
dropout_rate (float) – Dropout rate.
zoneout_rate (float) – Zoneout rate.
-
forward
(enc_outputs: torch.Tensor, enc_outputs_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Calculate forward propagation.
- Parameters:
enc_outputs (LongTensor) – Batch of padded character ids (B, T, idim).
enc_outputs_lengths (LongTensor) – Batch of lengths of each input batch (B,).
feats (Tensor) – Batch of padded target features (B, T_feats, odim).
feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).
sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).
lids (Optional[Tensor]) – Batch of language IDs (B, 1).
- Returns:
after_outs (TODO(jiatong) add full comment) before_outs (TODO(jiatong) add full comments) logits att_ws ys stop_labels olens
-
inference
(enc_outputs: torch.Tensor, feats: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, use_teacher_forcing: bool = False) → Dict[str, torch.Tensor][source]¶ Generate the sequence of features given the sequences of characters.
- Parameters:
enc_outputs (LongTensor) – Input sequence of characters (N, idim).
feats (Optional[Tensor]) – Feature sequence to extract style (N, odim).
spembs (Optional[Tensor]) – Speaker embedding (spk_embed_dim,).
sids (Optional[Tensor]) – Speaker ID (1,).
lids (Optional[Tensor]) – Language ID (1,).
threshold (float) – Threshold in inference.
minlenratio (float) – Minimum length ratio in inference.
maxlenratio (float) – Maximum length ratio in inference.
use_att_constraint (bool) – Whether to apply attention constraint.
backward_window (int) – Backward window in attention constraint.
forward_window (int) – Forward window in attention constraint.
use_teacher_forcing (bool) – Whether to use teacher forcing.
- Returns:
- Output dict including the following items:
feat_gen (Tensor): Output sequence of features (T_feats, odim).
prob (Tensor): Output sequence of stop probabilities (T_feats,).
att_w (Tensor): Attention weights (T_feats, T).
- Return type:
Dict[str, Tensor]
espnet2.s2st.aux_attention.multihead¶
-
class
espnet2.s2st.aux_attention.multihead.
MultiHeadAttention
(n_head: int = 4, n_feat: int = 512, dropout_rate: float = 0.0)[source]¶ Bases:
espnet2.s2st.aux_attention.abs_aux_attention.AbsS2STAuxAttention
Multihead Attention for S2ST.
-
forward
(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, mask: torch.Tensor)[source]¶ Forward.
- Parameters:
query (torch.Tensor) – Query tensor (#batch, time1, size).
key (torch.Tensor) – Key tensor (#batch, time2, size).
value (torch.Tensor) – Value tensor (#batch, time2, size).
mask (torch.Tensor) – Mask tensor (#batch, 1, time2) or (#batch, time1, time2).
- Returns:
Output tensor (#batch, time1, d_model).
- Return type:
torch.Tensor
-
espnet2.s2st.aux_attention.__init__¶
espnet2.s2st.aux_attention.abs_aux_attention¶
-
class
espnet2.s2st.aux_attention.abs_aux_attention.
AbsS2STAuxAttention
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Base class for all S2ST auxiliary attention modules.
Refer to https://arxiv.org/abs/2107.08661
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
() → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
property
name
¶
espnet2.s2st.losses.guided_attention_loss¶
-
class
espnet2.s2st.losses.guided_attention_loss.
S2STGuidedAttentionLoss
(weight: float = 1.0, sigma: float = 0.4, alpha: float = 1.0)[source]¶ Bases:
espnet2.s2st.losses.abs_loss.AbsS2STLoss
Tacotron-based loss for S2ST.
espnet2.s2st.losses.tacotron_loss¶
-
class
espnet2.s2st.losses.tacotron_loss.
S2STTacotron2Loss
(weight: float = 1.0, loss_type: str = 'L1+L2', use_masking: espnet2.utils.types.str2bool = True, use_weighted_masking: espnet2.utils.types.str2bool = False, bce_pos_weight: float = 20.0)[source]¶ Bases:
espnet2.s2st.losses.abs_loss.AbsS2STLoss
Tacotron-based loss for S2ST.
-
forward
(after_outs: torch.Tensor, before_outs: torch.Tensor, logits: torch.Tensor, ys: torch.Tensor, labels: torch.Tensor, olens: torch.Tensor)[source]¶ Forward.
- Parameters:
after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).
before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).
logits (Tensor) – Batch of stop logits (B, Lmax).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
labels (LongTensor) – Batch of the sequences of stop token labels (B, Lmax).
olens (LongTensor) – Batch of the lengths of each target (B,).
- Returns:
L1 loss value. Tensor: Mean square error loss value. Tensor: Binary cross entropy loss value.
- Return type:
Tensor
-
espnet2.s2st.losses.abs_loss¶
-
class
espnet2.s2st.losses.abs_loss.
AbsS2STLoss
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Base class for all S2ST loss modules.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
() → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.-
property
name
¶
espnet2.s2st.losses.ctc_loss¶
-
class
espnet2.s2st.losses.ctc_loss.
S2STCTCLoss
(weight: float = 1.0)[source]¶ Bases:
espnet2.s2st.losses.abs_loss.AbsS2STLoss
CTC-based loss for S2ST.
-
forward
()[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.s2st.losses.attention_loss¶
-
class
espnet2.s2st.losses.attention_loss.
S2STAttentionLoss
(vocab_size: int, padding_idx: int = -1, weight: float = 1.0, smoothing: float = 0.0, normalize_length: espnet2.utils.types.str2bool = False, criterion: torch.nn.modules.module.Module = KLDivLoss())[source]¶ Bases:
espnet2.s2st.losses.abs_loss.AbsS2STLoss
attention-based label smoothing loss for S2ST.
espnet2.s2st.losses.__init__¶
espnet2.s2st.tgt_feats_extract.log_mel_fbank¶
-
class
espnet2.s2st.tgt_feats_extract.log_mel_fbank.
LogMelFbank
(fs: Union[int, str] = 16000, n_fft: int = 1024, win_length: int = None, hop_length: int = 256, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: Optional[int] = 80, fmax: Optional[int] = 7600, htk: bool = False, log_base: Optional[float] = 10.0)[source]¶ Bases:
espnet2.s2st.tgt_feats_extract.abs_tgt_feats_extract.AbsTgtFeatsExtract
Conventional frontend structure for TTS.
Stft -> amplitude-spec -> Log-Mel-Fbank
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.s2st.tgt_feats_extract.linear_spectrogram¶
-
class
espnet2.s2st.tgt_feats_extract.linear_spectrogram.
LinearSpectrogram
(n_fft: int = 1024, win_length: int = None, hop_length: int = 256, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True)[source]¶ Bases:
espnet2.s2st.tgt_feats_extract.abs_tgt_feats_extract.AbsTgtFeatsExtract
Linear amplitude spectrogram.
Stft -> amplitude-spec
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.s2st.tgt_feats_extract.log_spectrogram¶
-
class
espnet2.s2st.tgt_feats_extract.log_spectrogram.
LogSpectrogram
(n_fft: int = 1024, win_length: int = None, hop_length: int = 256, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True)[source]¶ Bases:
espnet2.s2st.tgt_feats_extract.abs_tgt_feats_extract.AbsTgtFeatsExtract
Conventional frontend structure for ASR
Stft -> log-amplitude-spec
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.s2st.tgt_feats_extract.__init__¶
espnet2.s2st.tgt_feats_extract.abs_tgt_feats_extract¶
-
class
espnet2.s2st.tgt_feats_extract.abs_tgt_feats_extract.
AbsTgtFeatsExtract
(*args, **kwargs)[source]¶ Bases:
espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
abstract
-
class
-
-
class
-
-
class
-
-
class
-
-
property
-
abstract
-
property
-
abstract
-
class
-
-
-