espnet2.s2st package

espnet2.s2st.espnet_model

class espnet2.s2st.espnet_model.ESPnetS2STModel(s2st_type: str, frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], tgt_feats_extract: Optional[espnet2.s2st.tgt_feats_extract.abs_tgt_feats_extract.AbsTgtFeatsExtract], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], src_normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], tgt_normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], asr_decoder: Optional[espnet2.asr.decoder.abs_decoder.AbsDecoder], st_decoder: Optional[espnet2.asr.decoder.abs_decoder.AbsDecoder], aux_attention: Optional[espnet2.s2st.aux_attention.abs_aux_attention.AbsS2STAuxAttention], unit_encoder: Optional[espnet2.asr.encoder.abs_encoder.AbsEncoder], synthesizer: Optional[espnet2.s2st.synthesizer.abs_synthesizer.AbsSynthesizer], asr_ctc: Optional[espnet2.asr.ctc.CTC], st_ctc: Optional[espnet2.asr.ctc.CTC], losses: Dict[str, espnet2.s2st.losses.abs_loss.AbsS2STLoss], tgt_vocab_size: Optional[int], tgt_token_list: Union[Tuple[str, ...], List[str], None], src_vocab_size: Optional[int], src_token_list: Union[Tuple[str, ...], List[str], None], unit_vocab_size: Optional[int], unit_token_list: Union[Tuple[str, ...], List[str], None], ignore_id: int = -1, report_cer: bool = True, report_wer: bool = True, report_bleu: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', extract_feats_in_collect_stats: bool = True)[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

ESPnet speech-to-speech translation model

collect_feats(src_speech: torch.Tensor, src_speech_lengths: torch.Tensor, tgt_speech: torch.Tensor, tgt_speech_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]
encode(speech: torch.Tensor, speech_lengths: torch.Tensor, return_all_hs: bool = False, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]

Frontend + Encoder. Note that this method is used by st_inference.py

Parameters:
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

forward(src_speech: torch.Tensor, src_speech_lengths: torch.Tensor, tgt_speech: torch.Tensor, tgt_speech_lengths: torch.Tensor, tgt_text: Optional[torch.Tensor] = None, tgt_text_lengths: Optional[torch.Tensor] = None, src_text: Optional[torch.Tensor] = None, src_text_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

inference(src_speech: torch.Tensor, src_speech_lengths: Optional[torch.Tensor] = None, tgt_speech: Optional[torch.Tensor] = None, tgt_speech_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, use_teacher_forcing: bool = False) → Dict[str, torch.Tensor][source]
property require_vocoder

Return whether or not vocoder is required.

espnet2.s2st.__init__

espnet2.s2st.losses.abs_loss

class espnet2.s2st.losses.abs_loss.AbsS2STLoss(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Base class for all S2ST loss modules.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward() → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property name

espnet2.s2st.losses.ctc_loss

class espnet2.s2st.losses.ctc_loss.S2STCTCLoss(weight: float = 1.0)[source]

Bases: espnet2.s2st.losses.abs_loss.AbsS2STLoss

CTC-based loss for S2ST.

forward()[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.s2st.losses.guided_attention_loss

class espnet2.s2st.losses.guided_attention_loss.S2STGuidedAttentionLoss(weight: float = 1.0, sigma: float = 0.4, alpha: float = 1.0)[source]

Bases: espnet2.s2st.losses.abs_loss.AbsS2STLoss

Tacotron-based loss for S2ST.

forward(att_ws: torch.Tensor, ilens: torch.Tensor, olens_in: torch.Tensor)[source]

Forward.

Args:

Returns:

guided attention loss

Return type:

Tensor

espnet2.s2st.losses.__init__

espnet2.s2st.losses.attention_loss

class espnet2.s2st.losses.attention_loss.S2STAttentionLoss(vocab_size: int, padding_idx: int = -1, weight: float = 1.0, smoothing: float = 0.0, normalize_length: espnet2.utils.types.str2bool = False, criterion: torch.nn.modules.module.Module = KLDivLoss())[source]

Bases: espnet2.s2st.losses.abs_loss.AbsS2STLoss

attention-based label smoothing loss for S2ST.

forward(dense_y: torch.Tensor, token_y: torch.Tensor)[source]

Forward. Args:

espnet2.s2st.losses.tacotron_loss

class espnet2.s2st.losses.tacotron_loss.S2STTacotron2Loss(weight: float = 1.0, loss_type: str = 'L1+L2', use_masking: espnet2.utils.types.str2bool = True, use_weighted_masking: espnet2.utils.types.str2bool = False, bce_pos_weight: float = 20.0)[source]

Bases: espnet2.s2st.losses.abs_loss.AbsS2STLoss

Tacotron-based loss for S2ST.

forward(after_outs: torch.Tensor, before_outs: torch.Tensor, logits: torch.Tensor, ys: torch.Tensor, labels: torch.Tensor, olens: torch.Tensor)[source]

Forward.

Parameters:
  • after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).

  • before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).

  • logits (Tensor) – Batch of stop logits (B, Lmax).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • labels (LongTensor) – Batch of the sequences of stop token labels (B, Lmax).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

Returns:

L1 loss value. Tensor: Mean square error loss value. Tensor: Binary cross entropy loss value.

Return type:

Tensor

espnet2.s2st.tgt_feats_extract.log_mel_fbank

class espnet2.s2st.tgt_feats_extract.log_mel_fbank.LogMelFbank(fs: Union[int, str] = 16000, n_fft: int = 1024, win_length: int = None, hop_length: int = 256, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: Optional[int] = 80, fmax: Optional[int] = 7600, htk: bool = False, log_base: Optional[float] = 10.0)[source]

Bases: espnet2.s2st.tgt_feats_extract.abs_tgt_feats_extract.AbsTgtFeatsExtract

Conventional frontend structure for TTS.

Stft -> amplitude-spec -> Log-Mel-Fbank

forward(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_parameters() → Dict[str, Any][source]

Return the parameters required by Vocoder

output_size() → int[source]
spectrogram() → bool[source]

espnet2.s2st.tgt_feats_extract.abs_tgt_feats_extract

class espnet2.s2st.tgt_feats_extract.abs_tgt_feats_extract.AbsTgtFeatsExtract(*args, **kwargs)[source]

Bases: espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract get_parameters() → Dict[str, Any][source]
abstract output_size() → int[source]
abstract spectrogram() → bool[source]

espnet2.s2st.tgt_feats_extract.log_spectrogram

class espnet2.s2st.tgt_feats_extract.log_spectrogram.LogSpectrogram(n_fft: int = 1024, win_length: int = None, hop_length: int = 256, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True)[source]

Bases: espnet2.s2st.tgt_feats_extract.abs_tgt_feats_extract.AbsTgtFeatsExtract

Conventional frontend structure for ASR

Stft -> log-amplitude-spec

forward(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_parameters() → Dict[str, Any][source]

Return the parameters required by Vocoder

output_size() → int[source]
spectrogram() → bool[source]

espnet2.s2st.tgt_feats_extract.__init__

espnet2.s2st.tgt_feats_extract.linear_spectrogram

class espnet2.s2st.tgt_feats_extract.linear_spectrogram.LinearSpectrogram(n_fft: int = 1024, win_length: int = None, hop_length: int = 256, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True)[source]

Bases: espnet2.s2st.tgt_feats_extract.abs_tgt_feats_extract.AbsTgtFeatsExtract

Linear amplitude spectrogram.

Stft -> amplitude-spec

forward(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_parameters() → Dict[str, Any][source]

Return the parameters required by Vocoder.

output_size() → int[source]
spectrogram() → bool[source]

espnet2.s2st.aux_attention.abs_aux_attention

class espnet2.s2st.aux_attention.abs_aux_attention.AbsS2STAuxAttention(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Base class for all S2ST auxiliary attention modules. Refer to https://arxiv.org/abs/2107.08661

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward() → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property name

espnet2.s2st.aux_attention.multihead

class espnet2.s2st.aux_attention.multihead.MultiHeadAttention(n_head: int = 4, n_feat: int = 512, dropout_rate: float = 0.0)[source]

Bases: espnet2.s2st.aux_attention.abs_aux_attention.AbsS2STAuxAttention

Multihead Attention for S2ST.

forward(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, mask: torch.Tensor)[source]

Forward. :param query: Query tensor (#batch, time1, size). :type query: torch.Tensor :param key: Key tensor (#batch, time2, size). :type key: torch.Tensor :param value: Value tensor (#batch, time2, size). :type value: torch.Tensor :param mask: Mask tensor (#batch, 1, time2) or

(#batch, time1, time2).

Returns:

Output tensor (#batch, time1, d_model).

Return type:

torch.Tensor

espnet2.s2st.aux_attention.__init__

espnet2.s2st.synthesizer.abs_synthesizer

Text-to-speech abstrast class.

class espnet2.s2st.synthesizer.abs_synthesizer.AbsSynthesizer(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

TTS abstract class.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input_states: torch.Tensor, input_states_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Calculate outputs and return the loss tensor.

abstract inference(input_states: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]

Return predicted output as a dict.

property require_raw_speech

Return whether or not raw_speech is required.

property require_vocoder

Return whether or not vocoder is required.

espnet2.s2st.synthesizer.unity_synthesizer

Translatotron Synthesizer related modules for ESPnet2.

class espnet2.s2st.synthesizer.unity_synthesizer.UnitYSynthesizer(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, layer_drop_rate: float = 0.0, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'concat')[source]

Bases: espnet2.s2st.synthesizer.abs_synthesizer.AbsSynthesizer

UnitY Synthesizer related modules for speech-to-speech translation.

This is a module of discrete unit prediction network in discrete-unit described in Direct speech-to-speech translation with discrete units, which converts the sequence of hidden states into the sequence of discrete unit (from SSLs).

Transfomer decoder for discrete unit module.

Parameters:
  • vocab_size – output dim

  • encoder_output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of decoder blocks

  • dropout_rate – dropout rate

  • self_attention_dropout_rate – dropout rate for attention

  • input_layer – input layer type

  • use_output_layer – whether to use output layer

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.

  • langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.

  • spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.

  • spk_embed_integration_type (str) – How to integrate speaker embedding.

forward(enc_outputs: torch.Tensor, enc_outputs_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, return_last_hidden: bool = False, return_all_hiddens: bool = False) → Tuple[torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • enc_outputs (LongTensor) – Batch of padded character ids (B, T, idim).

  • enc_outputs_lengths (LongTensor) – Batch of lengths of each input batch (B,).

  • feats (Tensor) – Batch of padded target features (B, T_feats, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

Returns:

hs hlens

espnet2.s2st.synthesizer.translatotron

Translatotron Synthesizer related modules for ESPnet2.

class espnet2.s2st.synthesizer.translatotron.Translatotron(idim: int, odim: int, embed_dim: int = 512, atype: str = 'multihead', adim: int = 512, aheads: int = 4, aconv_chans: int = 32, aconv_filts: int = 15, cumulate_att_w: bool = True, dlayers: int = 4, dunits: int = 1024, prenet_layers: int = 2, prenet_units: int = 32, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, output_activation: str = None, use_batch_norm: bool = True, use_concate: bool = True, use_residual: bool = False, reduction_factor: int = 2, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'concat', dropout_rate: float = 0.5, zoneout_rate: float = 0.1)[source]

Bases: espnet2.s2st.synthesizer.abs_synthesizer.AbsSynthesizer

TTranslatotron Synthesizer related modules for speech-to-speech translation.

This is a module of Spectrogram prediction network in Translatotron described in Direct speech-to-speech translation with a sequence-to-sequence model, which converts the sequence of hidden states into the sequence of Mel-filterbanks.

Initialize Tacotron2 module.

Parameters:
  • idim (int) – Dimension of the inputs.

  • odim – (int) Dimension of the outputs.

  • adim (int) – Number of dimension of mlp in attention.

  • atype (str) – type of attention

  • aconv_chans (int) – Number of attention conv filter channels.

  • aconv_filts (int) – Number of attention conv filter size.

  • embed_dim (int) – Dimension of the token embedding.

  • dlayers (int) – Number of decoder lstm layers.

  • dunits (int) – Number of decoder lstm units.

  • prenet_layers (int) – Number of prenet layers.

  • prenet_units (int) – Number of prenet units.

  • postnet_layers (int) – Number of postnet layers.

  • postnet_filts (int) – Number of postnet filter size.

  • postnet_chans (int) – Number of postnet filter channels.

  • output_activation (str) – Name of activation function for outputs.

  • cumulate_att_w (bool) – Whether to cumulate previous attention weight.

  • use_batch_norm (bool) – Whether to use batch normalization.

  • use_concate (bool) – Whether to concat enc outputs w/ dec lstm outputs.

  • reduction_factor (int) – Reduction factor.

  • spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.

  • langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.

  • spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.

  • spk_embed_integration_type (str) – How to integrate speaker embedding.

  • dropout_rate (float) – Dropout rate.

  • zoneout_rate (float) – Zoneout rate.

forward(enc_outputs: torch.Tensor, enc_outputs_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • enc_outputs (LongTensor) – Batch of padded character ids (B, T, idim).

  • enc_outputs_lengths (LongTensor) – Batch of lengths of each input batch (B,).

  • feats (Tensor) – Batch of padded target features (B, T_feats, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

Returns:

after_outs (TODO(jiatong) add full comment) before_outs (TODO(jiatong) add full comments) logits att_ws ys stop_labels olens

inference(enc_outputs: torch.Tensor, feats: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, use_teacher_forcing: bool = False) → Dict[str, torch.Tensor][source]

Generate the sequence of features given the sequences of characters.

Parameters:
  • enc_outputs (LongTensor) – Input sequence of characters (N, idim).

  • feats (Optional[Tensor]) – Feature sequence to extract style (N, odim).

  • spembs (Optional[Tensor]) – Speaker embedding (spk_embed_dim,).

  • sids (Optional[Tensor]) – Speaker ID (1,).

  • lids (Optional[Tensor]) – Language ID (1,).

  • threshold (float) – Threshold in inference.

  • minlenratio (float) – Minimum length ratio in inference.

  • maxlenratio (float) – Maximum length ratio in inference.

  • use_att_constraint (bool) – Whether to apply attention constraint.

  • backward_window (int) – Backward window in attention constraint.

  • forward_window (int) – Forward window in attention constraint.

  • use_teacher_forcing (bool) – Whether to use teacher forcing.

Returns:

Output dict including the following items:
  • feat_gen (Tensor): Output sequence of features (T_feats, odim).

  • prob (Tensor): Output sequence of stop probabilities (T_feats,).

  • att_w (Tensor): Attention weights (T_feats, T).

Return type:

Dict[str, Tensor]

espnet2.s2st.synthesizer.__init__

espnet2.s2st.synthesizer.translatotron2

espnet2.s2st.synthesizer.discrete_synthesizer

Translatotron Synthesizer related modules for ESPnet2.

class espnet2.s2st.synthesizer.discrete_synthesizer.TransformerDiscreteSynthesizer(odim: int, idim: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, layer_drop_rate: float = 0.0, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'concat')[source]

Bases: espnet2.s2st.synthesizer.abs_synthesizer.AbsSynthesizer, espnet.nets.scorer_interface.BatchScorerInterface

Discrete unit Synthesizer related modules for speech-to-speech translation.

This is a module of discrete unit prediction network in discrete-unit described in Direct speech-to-speech translation with discrete units, which converts the sequence of hidden states into the sequence of discrete unit (from SSLs).

Transfomer decoder for discrete unit module.

Parameters:
  • vocab_size – output dim

  • encoder_output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of decoder blocks

  • dropout_rate – dropout rate

  • self_attention_dropout_rate – dropout rate for attention

  • input_layer – input layer type

  • use_output_layer – whether to use output layer

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.

  • langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.

  • spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.

  • spk_embed_integration_type (str) – How to integrate speaker embedding.

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch.

Parameters:
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns:

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type:

tuple[torch.Tensor, List[Any]]

forward(enc_outputs: torch.Tensor, enc_outputs_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, return_hs: bool = False, return_all_hs: bool = False) → Tuple[torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • enc_outputs (LongTensor) – Batch of padded character ids (B, T, idim).

  • enc_outputs_lengths (LongTensor) – Batch of lengths of each input batch (B,).

  • feats (Tensor) – Batch of padded target features (B, T_feats, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

Returns:

hs hlens

forward_one_step(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, cache: List[torch.Tensor] = None) → Tuple[torch.Tensor, List[torch.Tensor]][source]

Forward one step.

Parameters:
  • tgt – input token ids, int64 (batch, maxlen_out)

  • tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)

  • memory – encoded memory, float32 (batch, maxlen_in, feat)

  • cache – cached output list of (batch, max_time_out-1, size)

Returns:

NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)

Return type:

y, cache

inference()[source]

Return predicted output as a dict.

score(ys, state, x)[source]

Score.