espnet2.asr package

espnet2.asr.espnet_model

class espnet2.asr.espnet_model.ESPnetASRModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: Optional[espnet2.asr.decoder.abs_decoder.AbsDecoder], ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module], aux_ctc: Optional[dict] = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', transducer_multi_blank_durations: List = [], transducer_multi_blank_sigma: float = 0.05, sym_sos: str = '<sos/eos>', sym_eos: str = '<sos/eos>', extract_feats_in_collect_stats: bool = True, lang_token_id: int = -1)[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

CTC-attention hybrid Encoder-Decoder model

batchify_nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)[source]

Compute negative log likelihood(nll) from transformer-decoder

To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,

you may change this to avoid OOM or increase GPU memory usage

collect_feats(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]
encode(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Frontend + Encoder. Note that this method is used by asr_inference.py

Parameters:
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters:
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

  • text – (Batch, Length)

  • text_lengths – (Batch,)

  • kwargs – “utt_id” is among the input.

nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]

Compute negative log likelihood(nll) from transformer-decoder

Normally, this function is called in batchify_nll.

Parameters:
  • encoder_out – (Batch, Length, Dim)

  • encoder_out_lens – (Batch,)

  • ys_pad – (Batch, Length)

  • ys_pad_lens – (Batch,)

espnet2.asr.bayes_risk_ctc

espnet2.asr.discrete_asr_espnet_model

class espnet2.asr.discrete_asr_espnet_model.ESPnetDiscreteASRModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, ctc: Optional[espnet2.asr.ctc.CTC], ctc_weight: float = 0.5, interctc_weight: float = 0.0, src_vocab_size: int = 0, src_token_list: Union[Tuple[str, ...], List[str]] = [], ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_bleu: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', extract_feats_in_collect_stats: bool = True, share_decoder_input_output_embed: bool = False, share_encoder_decoder_input_embed: bool = False)[source]

Bases: espnet2.mt.espnet_model.ESPnetMTModel

Encoder-Decoder model

encode(src_text: torch.Tensor, src_text_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Frontend + Encoder. Note that this method is used by mt_inference.py

Parameters:
  • src_text – (Batch, Length, …)

  • src_text_lengths – (Batch, )

forward(text: torch.Tensor, text_lengths: torch.Tensor, src_text: torch.Tensor, src_text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters:
  • text – (Batch, Length)

  • text_lengths – (Batch,)

  • src_text – (Batch, length)

  • src_text_lengths – (Batch,)

  • kwargs – “utt_id” is among the input.

espnet2.asr.__init__

espnet2.asr.maskctc_model

class espnet2.asr.maskctc_model.MaskCTCInference(asr_model: espnet2.asr.maskctc_model.MaskCTCModel, n_iterations: int, threshold_probability: float)[source]

Bases: torch.nn.modules.module.Module

Mask-CTC-based non-autoregressive inference

Initialize Mask-CTC inference

forward(enc_out: torch.Tensor) → List[espnet.nets.beam_search.Hypothesis][source]

Perform Mask-CTC inference

ids2text(ids: List[int])[source]
class espnet2.asr.maskctc_model.MaskCTCModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.mlm_decoder.MLMDecoder, ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module] = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', sym_mask: str = '<mask>', extract_feats_in_collect_stats: bool = True)[source]

Bases: espnet2.asr.espnet_model.ESPnetASRModel

Hybrid CTC/Masked LM Encoder-Decoder model (Mask-CTC)

batchify_nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)[source]

Compute negative log likelihood(nll) from transformer-decoder

To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,

you may change this to avoid OOM or increase GPU memory usage

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters:
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

  • text – (Batch, Length)

  • text_lengths – (Batch,)

nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]

Compute negative log likelihood(nll) from transformer-decoder

Normally, this function is called in batchify_nll.

Parameters:
  • encoder_out – (Batch, Length, Dim)

  • encoder_out_lens – (Batch,)

  • ys_pad – (Batch, Length)

  • ys_pad_lens – (Batch,)

espnet2.asr.ctc

class espnet2.asr.ctc.CTC(odim: int, encoder_output_size: int, dropout_rate: float = 0.0, ctc_type: str = 'builtin', reduce: bool = True, ignore_nan_grad: Optional[bool] = None, zero_infinity: bool = True, brctc_risk_strategy: str = 'exp', brctc_group_strategy: str = 'end', brctc_risk_factor: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

CTC module.

Parameters:
  • odim – dimension of outputs

  • encoder_output_size – number of encoder projection units

  • dropout_rate – dropout rate (0.0 ~ 1.0)

  • ctc_type – builtin or gtnctc

  • reduce – reduce the CTC loss into a scalar

  • ignore_nan_grad – Same as zero_infinity (keeping for backward compatiblity)

  • zero_infinity – Whether to zero infinite losses and the associated gradients.

argmax(hs_pad)[source]

argmax of frame activations

Parameters:

hs_pad (torch.Tensor) – 3d tensor (B, Tmax, eprojs)

Returns:

argmax applied 2d tensor (B, Tmax)

Return type:

torch.Tensor

forward(hs_pad, hlens, ys_pad, ys_lens)[source]

Calculate CTC loss.

Parameters:
  • hs_pad – batch of padded hidden state sequences (B, Tmax, D)

  • hlens – batch of lengths of hidden state sequences (B)

  • ys_pad – batch of padded character id sequence tensor (B, Lmax)

  • ys_lens – batch of lengths of character sequence (B)

log_softmax(hs_pad)[source]

log_softmax of frame activations

Parameters:

hs_pad (Tensor) – 3d tensor (B, Tmax, eprojs)

Returns:

log softmax applied 3d tensor (B, Tmax, odim)

Return type:

torch.Tensor

loss_fn(th_pred, th_target, th_ilen, th_olen) → torch.Tensor[source]
softmax(hs_pad)[source]

softmax of frame activations

Parameters:

hs_pad (Tensor) – 3d tensor (B, Tmax, eprojs)

Returns:

softmax applied 3d tensor (B, Tmax, odim)

Return type:

torch.Tensor

espnet2.asr.pit_espnet_model

class espnet2.asr.pit_espnet_model.ESPnetASRModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: Optional[espnet2.asr.decoder.abs_decoder.AbsDecoder], ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module], ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', sym_sos: str = '<sos/eos>', sym_eos: str = '<sos/eos>', extract_feats_in_collect_stats: bool = True, lang_token_id: int = -1, num_inf: int = 1, num_ref: int = 1)[source]

Bases: espnet2.asr.espnet_model.ESPnetASRModel

CTC-attention hybrid Encoder-Decoder model

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters:
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

  • text – (Batch, Length)

  • text_lengths – (Batch,)

  • kwargs – “utt_id” is among the input.

class espnet2.asr.pit_espnet_model.PITLossWrapper(criterion_fn: Callable, num_ref: int)[source]

Bases: espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper

forward(inf: torch.Tensor, inf_lens: torch.Tensor, ref: torch.Tensor, ref_lens: torch.Tensor, others: Dict = None)[source]

PITLoss Wrapper function. Similar to espnet2/enh/loss/wrapper/pit_solver.py

Parameters:
  • inf – Iterable[torch.Tensor], (batch, num_inf, …)

  • inf_lens – Iterable[torch.Tensor], (batch, num_inf, …)

  • ref – Iterable[torch.Tensor], (batch, num_ref, …)

  • ref_lens – Iterable[torch.Tensor], (batch, num_ref, …)

  • permute_inf – If true, permute the inference and inference_lens according to the optimal permutation.

classmethod permutate(perm, *args)[source]

espnet2.asr.frontend.fused

class espnet2.asr.frontend.fused.FusedFrontends(frontends=None, align_method='linear_projection', proj_dim=100, fs=16000)[source]

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.asr.frontend.abs_frontend

class espnet2.asr.frontend.abs_frontend.AbsFrontend(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.asr.frontend.s3prl

class espnet2.asr.frontend.s3prl.S3prlFrontend(fs: Union[int, str] = 16000, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, download_dir: Optional[str] = None, multilayer_feature: bool = False, layer: int = -1)[source]

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Speech Pretrained Representation frontend structure for ASR.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]
reload_pretrained_parameters()[source]

espnet2.asr.frontend.windowing

Sliding Window for raw audio input data.

class espnet2.asr.frontend.windowing.SlidingWindow(win_length: int = 400, hop_length: int = 160, channels: int = 1, padding: Optional[int] = None, fs=None)[source]

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Sliding Window.

Provides a sliding window over a batched continuous raw audio tensor. Optionally, provides padding (Currently not implemented). Combine this module with a pre-encoder compatible with raw audio data, for example Sinc convolutions.

Known issues: Output length is calculated incorrectly if audio shorter than win_length. WARNING: trailing values are discarded - padding not implemented yet. There is currently no additional window function applied to input values.

Initialize.

Parameters:
  • win_length – Length of frame.

  • hop_length – Relative starting point of next frame.

  • channels – Number of input channels.

  • padding – Padding (placeholder, currently not implemented).

  • fs – Sampling rate (placeholder for compatibility, not used).

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Apply a sliding window on the input.

Parameters:
  • input – Input (B, T, C*D) or (B, T*C*D), with D=C=1.

  • input_lengths – Input lengths within batch.

Returns:

Output with dimensions (B, T, C, D), with D=win_length. Tensor: Output lengths within batch.

Return type:

Tensor

output_size() → int[source]

Return output length of feature dimension D, i.e. the window length.

espnet2.asr.frontend.__init__

espnet2.asr.frontend.asteroid_frontend

Sliding Window for raw audio input data.

class espnet2.asr.frontend.asteroid_frontend.AsteroidFrontend(sinc_filters: int = 256, sinc_kernel_size: int = 251, sinc_stride: int = 16, preemph_coef: float = 0.97, log_term: float = 1e-06)[source]

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Asteroid Filterbank Frontend.

Provides a Sinc-convolutional-based audio feature extractor. The same function can be achieved by using sliding_winodw frontend + sinc preencoder.

NOTE(jiatong): this function is used in sentence-level classification tasks (e.g., spk). Other usages are not fully investigated.

NOTE(jeeweon): this function implements the parameterized analytic filterbank layer in M. Pariente, S. Cornell, A. Deleforge and E. Vincent, “Filterbank design for end-to-end speech separation,” in Proc. ICASSP, 2020

Initialize.

Parameters:
  • sinc_filters – the filter numbers for sinc.

  • sinc_kernel_size – the kernel size for sinc.

  • sinc_stride – the sincstride size of the first sinc-conv layer where it decides the compression rate (Hz).

  • preemph_coef – the coeifficient for preempahsis.

  • log_term – the log term to prevent infinity.

forward(input: torch.Tensor, input_length: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Apply the Asteroid filterbank frontend to the input.

Parameters:
  • input – Input (B, T).

  • input_length – Input length (B,).

Returns:

Frame-wise output (B, T’, D).

Return type:

Tensor

output_size() → int[source]

Return output length of feature dimension D.

espnet2.asr.frontend.melspec_torch

Torchaudio MFCC

class espnet2.asr.frontend.melspec_torch.MelSpectrogramTorch(preemp: bool = True, n_fft: int = 512, log: bool = False, win_length: int = 400, hop_length: int = 160, f_min: int = 20, f_max: int = 7600, n_mels: int = 80, window_fn: str = 'hamming', mel_scale: str = 'htk', normalize: Optional[str] = None)[source]

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Mel-Spectrogram using Torchaudio Implementation.

forward(input: torch.Tensor, input_length: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

Return output length of feature dimension D.

espnet2.asr.frontend.default

class espnet2.asr.frontend.default.DefaultFrontend(fs: Union[int, str] = 16000, n_fft: int = 512, win_length: Optional[int] = None, hop_length: int = 128, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: Optional[int] = None, fmax: Optional[int] = None, htk: bool = False, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, apply_stft: bool = True)[source]

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Conventional frontend structure for ASR.

Stft -> WPE -> MVDR-Beamformer -> Power-spec -> Log-Mel-Fbank

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.asr.frontend.whisper

class espnet2.asr.frontend.whisper.WhisperFrontend(whisper_model: str = 'small', freeze_weights: bool = True, download_dir: Optional[str] = None)[source]

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Speech Representation Using Encoder Outputs from OpenAI’s Whisper Model:

URL: https://github.com/openai/whisper

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

log_mel_spectrogram(audio: torch.Tensor, ilens: torch.Tensor = None) → torch.Tensor[source]
output_size() → int[source]
whisper_encode(input: torch.Tensor, ilens: torch.Tensor = None) → torch.Tensor[source]

espnet2.asr.state_spaces.s4

Standalone version of Structured (Sequence) State Space (S4) model.

class espnet2.asr.state_spaces.s4.OptimModule(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module

Interface for Module that allows registering buffers/parameters with configurable optimizer hyperparameters. # noqa

Initializes internal Module state, shared by both nn.Module and ScriptModule.

register(name, tensor, lr=None)[source]

Register a tensor with a configurable learning rate and 0 weight decay.

class espnet2.asr.state_spaces.s4.S4(d_model, d_state=64, l_max=None, channels=1, bidirectional=False, activation='gelu', postact='glu', hyper_act=None, dropout=0.0, tie_dropout=False, bottleneck=None, gate=None, transposed=True, verbose=False, **kernel_args)[source]

Bases: torch.nn.modules.module.Module

Initialize S4 module.

d_state: the dimension of the state, also denoted by N l_max: the maximum kernel length, also denoted by L.

Set l_max=None to always use a global kernel

channels: can be interpreted as a number of “heads”;

the SSM is a map from a 1-dim to C-dim sequence. It’s not recommended to change this unless desperate for things to tune; instead, increase d_model for larger models

bidirectional: if True, convolution kernel will be two-sided

activation: activation in between SS and FF postact: activation after FF hyper_act: use a “hypernetwork” multiplication (experimental) dropout: standard dropout argument. tie_dropout=True ties the dropout

mask across the sequence length, emulating nn.Dropout1d

transposed: choose backbone axis ordering of

(B, L, H) (if False) or (B, H, L) (if True) [B=batch size, L=sequence length, H=hidden dimension]

gate: add gated activation (GSS) bottleneck: reduce SSM dimension (GSS)

See the class SSKernel for the kernel constructor which accepts kernel_args. Relevant options that are worth considering and tuning include “mode” + “measure”, “dt_min”, “dt_max”, “lr”

Other options are all experimental and should not need to be configured

property d_output
default_state(*batch_shape, device=None)[source]
forward(u, state=None, rate=1.0, lengths=None, **kwargs)[source]

Forward pass.

u: (B H L) if self.transposed else (B L H) state: (H N) never needed unless you know what you’re doing

Returns: same shape as u

setup_step(**kwargs)[source]
step(u, state, **kwargs)[source]

Step one time step as a recurrent model.

Intended to be used during validation.

u: (B H) state: (B H N) Returns: output (B H), state (B H N)

class espnet2.asr.state_spaces.s4.SSKernel(H, N=64, L=None, measure='legs', rank=1, channels=1, dt_min=0.001, dt_max=0.1, deterministic=False, lr=None, mode='nplr', n_ssm=None, verbose=False, measure_args={}, **kernel_args)[source]

Bases: torch.nn.modules.module.Module

Wrapper around SSKernel parameterizations.

The SSKernel is expected to support the interface forward() default_state() _setup_step() step()

State Space Kernel which computes the convolution kernel $\bar{K}$.

H: Number of independent SSM copies;

controls the size of the model. Also called d_model in the config.

N: State size (dimensionality of parameters A, B, C).

Also called d_state in the config. Generally shouldn’t need to be adjusted and doens’t affect speed much.

L: Maximum length of convolution kernel, if known.

Should work in the majority of cases even if not known.

measure: Options for initialization of (A, B).

For NPLR mode, recommendations are “legs”, “fout”, “hippo” (combination of both). For Diag mode, recommendations are “diag-inv”, “diag-lin”, “diag-legs”, and “diag” (combination of diag-inv and diag-lin)

rank: Rank of low-rank correction for NPLR mode.

Needs to be increased for measure “legt”

channels: C channels turns the SSM from a 1-dim to C-dim map;

can think of it having C separate “heads” per SSM. This was partly a feature to make it easier to implement bidirectionality; it is recommended to set channels=1 and adjust H to control parameters instead

dt_min, dt_max: min and max values for the step size dt (Delta) mode: Which kernel algorithm to use. ‘nplr’ is the full S4 model;

‘diag’ is the simpler S4D; ‘slow’ is a dense version for testing

n_ssm: Number of independent trainable (A, B) SSMs,

e.g. n_ssm=1 means all A/B parameters are tied across the H different instantiations of C. n_ssm=None means all H SSMs are completely independent. Generally, changing this option can save parameters but doesn’t affect performance or speed much. This parameter must divide H

lr: Passing in a number (e.g. 0.001) sets

attributes of SSM parameers (A, B, dt). A custom optimizer hook is needed to configure the optimizer to set the learning rates appropriately for these parameters.

default_state(*args, **kwargs)[source]
forward(state=None, L=None, rate=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

forward_state(u, state)[source]

Forward the state through a sequence.

i.e. computes the state after passing chunk through SSM

state: (B, H, N) u: (B, H, L)

Returns: (B, H, N)

step(u, state, **kwargs)[source]
class espnet2.asr.state_spaces.s4.SSKernelDiag(A, B, C, log_dt, L=None, disc='bilinear', real_type='exp', lr=None, bandlimit=None)[source]

Bases: espnet2.asr.state_spaces.s4.OptimModule

Version using (complex) diagonal state matrix (S4D).

default_state(*batch_shape)[source]
forward(L, state=None, rate=1.0, u=None)[source]

Forward pass.

state: (B, H, N) initial state rate: sampling rate factor L: target length

returns: (C, H, L) convolution kernel (generally C=1) (B, H, L) output from initial state

forward_state(u, state)[source]
step(u, state)[source]
class espnet2.asr.state_spaces.s4.SSKernelNPLR(w, P, B, C, log_dt, L=None, lr=None, verbose=False, keops=False, real_type='exp', real_tolerance=0.001, bandlimit=None)[source]

Bases: espnet2.asr.state_spaces.s4.OptimModule

Stores a representation of and computes the SSKernel function.

K_L(A^dt, B^dt, C) corresponding to a discretized state space, where A is Normal + Low Rank (NPLR)

Initialize kernel.

L: Maximum length; this module computes an SSM kernel of length L A is represented by diag(w) - PP^* w: (S, N) diagonal part P: (R, S, N) low-rank part

B: (S, N) C: (C, H, N) dt: (H) timescale per feature lr: [dict | float | None] hook to set lr of special parameters (A, B, dt)

Dimensions: N (or d_state): state size H (or d_model): total SSM copies S (or n_ssm): number of trainable copies of (A, B, dt); must divide H R (or rank): rank of low-rank part C (or channels): system is 1-dim to C-dim

The forward pass of this Module returns a tensor of shape (C, H, L)

Note: tensor shape N here denotes half the true state size,

because of conjugate symmetry

default_state(*batch_shape)[source]
forward(state=None, rate=1.0, L=None)[source]

Forward pass.

state: (B, H, N) initial state rate: sampling rate factor L: target length

returns: (C, H, L) convolution kernel (generally C=1) (B, H, L) output from initial state

step(u, state)[source]

Step one time step as a recurrent model.

Must have called self._setup_step() and created state with self.default_state() before calling this

espnet2.asr.state_spaces.s4.cauchy_naive(v, z, w)[source]

Naive version.

v, w: (…, N) z: (…, L) returns: (…, L)

espnet2.asr.state_spaces.s4.combination(measures, N, R, S, **ssm_args)[source]
espnet2.asr.state_spaces.s4.dplr(scaling, N, rank=1, H=1, dtype=torch.float32, real_scale=1.0, imag_scale=1.0, random_real=False, random_imag=False, normalize=False, diagonal=True, random_B=False)[source]
espnet2.asr.state_spaces.s4.get_logger(name='espnet2.asr.state_spaces.s4', level=20) → logging.Logger[source]

Initialize multi-GPU-friendly python logger.

espnet2.asr.state_spaces.s4.log = <Logger espnet2.asr.state_spaces.s4 (INFO)>[source]

Cauchy and Vandermonde kernels

espnet2.asr.state_spaces.s4.log_vandermonde(v, x, L)[source]

Compute Vandermonde product.

v: (…, N) x: (…, N) returns: (…, L) sum v x^l

espnet2.asr.state_spaces.s4.log_vandermonde_transpose(u, v, x, L)[source]
espnet2.asr.state_spaces.s4.nplr(measure, N, rank=1, dtype=torch.float32, diagonalize_precision=True)[source]

Decompose as Normal Plus Low-Rank (NPLR).

Return w, p, q, V, B such that (w - p q^*, B) is unitarily equivalent to the original HiPPO A, B by the matrix V i.e. A = V[w - p q^*]V^*, B = V B

espnet2.asr.state_spaces.s4.power(L, A, v=None)[source]

Compute A^L and the scan sum_i A^i v_i.

A: (…, N, N) v: (…, N, L)

espnet2.asr.state_spaces.s4.rank_correction(measure, N, rank=1, dtype=torch.float32)[source]

Return low-rank matrix L such that A + L is normal.

espnet2.asr.state_spaces.s4.rank_zero_only(fn: Callable) → Callable[source]

Decorator function from PyTorch Lightning.

Function that can be used as a decorator to enable a function/method being called only on global rank 0.

espnet2.asr.state_spaces.s4.ssm(measure, N, R, H, **ssm_args)[source]

Dispatcher to create single SSM initialization.

N: state size R: rank (for DPLR parameterization) H: number of independent SSM copies

espnet2.asr.state_spaces.s4.transition(measure, N)[source]

A, B transition matrices for different measures.

espnet2.asr.state_spaces.base

class espnet2.asr.state_spaces.base.SequenceIdentity(*args, transposed=False, **kwargs)[source]

Bases: espnet2.asr.state_spaces.base.SequenceIdentity

Simple SequenceModule for testing purposes.

forward(x, state=None, **kwargs)[source]
class espnet2.asr.state_spaces.base.SequenceModule(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module

Abstract sequence model class.

All models must adhere to this interface

A SequenceModule is generally a model that transforms an input of shape (n_batch, l_sequence, d_model) to (n_batch, l_sequence, d_output)

REQUIRED methods and attributes forward, d_model, d_output: controls standard forward pass, a sequence-to-sequence transformation __init__ should also satisfy the following interface; see SequenceIdentity for an example

def __init__(self, d_model, transposed=False, **kwargs)

OPTIONAL methods default_state, step: allows stepping the model recurrently with a hidden state state_to_tensor, d_state: allows decoding from hidden state

Initializes internal Module state, shared by both nn.Module and ScriptModule.

property d_model

Model dimension (generally same as input dimension).

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, encoder) to track the internal shapes of the full model.

property d_output

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

property d_state

Return dimension of output of self.state_to_tensor.

default_state(*batch_shape, device=None)[source]

Create initial state for a batch of inputs.

forward(x, state=None, **kwargs)[source]

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

property state_to_tensor

Return a function mapping a state to a single tensor.

This method should be implemented if one wants to use the hidden state insteadof the output sequence for final prediction. Currently only used with the StateDecoder.

step(x, state=None, **kwargs)[source]

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

espnet2.asr.state_spaces.base.TransposedModule(module)[source]

Transpose module.

Wrap a SequenceModule class to accept transposed parameter, handle state, absorb kwargs

espnet2.asr.state_spaces.components

espnet2.asr.state_spaces.components.Activation(activation=None, size=None, dim=-1)[source]
class espnet2.asr.state_spaces.components.DropoutNd(p: float = 0.5, tie=True, transposed=True)[source]

Bases: torch.nn.modules.module.Module

Initialize dropout module.

tie: tie dropout mask across sequence lengths (Dropout1d/2d/3d)

forward(X)[source]

Forward pass.

X: (batch, dim, lengths…)

espnet2.asr.state_spaces.components.LinearActivation(d_input, d_output, bias=True, zero_bias_init=False, transposed=False, initializer=None, activation=None, activate=False, weight_norm=False, **kwargs)[source]

Return a linear module, initialization, and activation.

class espnet2.asr.state_spaces.components.Normalization(d, transposed=False, _name_='layer', **kwargs)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

step(x, **kwargs)[source]
class espnet2.asr.state_spaces.components.ReversibleInstanceNorm1dInput(d, transposed=False)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.ReversibleInstanceNorm1dOutput(norm_input)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.SquaredReLU(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.StochasticDepth(p: float, mode: str)[source]

Bases: torch.nn.modules.module.Module

Stochastic depth module.

See stochastic_depth().

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.TSInverseNormalization(method, normalizer)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.TSNormalization(method, horizon)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.TransposedLN(d, scalar=True)[source]

Bases: torch.nn.modules.module.Module

Transposed LayerNorm module.

LayerNorm module over second dimension Assumes shape (B, D, L), where L can be 1 or more axis

This is slow and a dedicated CUDA/Triton implementation shuld provide substantial end-to-end speedup

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.components.TransposedLinear(d_input, d_output, bias=True)[source]

Bases: torch.nn.modules.module.Module

Transposed linear module.

Linear module on the second-to-last dimension Assumes shape (B, D, L), where L can be 1 or more axis

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.state_spaces.components.get_initializer(name, activation=None)[source]
espnet2.asr.state_spaces.components.stochastic_depth(input: torch._VariableFunctionsClass.tensor, p: float, mode: str, training: bool = True)[source]

Apply stochastic depth.

Implements the Stochastic Depth from “Deep Networks with Stochastic Depth” used for randomly dropping residual branches of residual architectures.

Parameters:
  • input (Tensor[N, ..]) – The input tensor or arbitrary dimensions with the first one being its batch i.e. a batch with N rows.

  • p (float) – probability of the input to be zeroed.

  • mode (str) – "batch" or "row". "batch" randomly zeroes the entire input, "row" zeroes randomly selected rows from the batch.

  • training – apply stochastic depth if is True. Default: True

Returns:

The randomly zeroed tensor.

Return type:

Tensor[N, ..]

espnet2.asr.state_spaces.residual

Implementations of different types of residual functions.

class espnet2.asr.state_spaces.residual.Affine(*args, scalar=True, gamma=0.0, **kwargs)[source]

Bases: espnet2.asr.state_spaces.residual.Residual

Residual connection with learnable scalar multipliers on the main branch.

scalar: Single scalar multiplier, or one per dimension scale, power: Initialize to scale * layer_num**(-power)

forward(x, y, transposed)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.residual.DecayResidual(*args, power=0.5, l2=True)[source]

Bases: espnet2.asr.state_spaces.residual.Residual

Residual connection that can decay the linear combination depending on depth.

forward(x, y, transposed)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.residual.Feedforward(*args)[source]

Bases: espnet2.asr.state_spaces.residual.Residual

class espnet2.asr.state_spaces.residual.Highway(*args, scaling_correction=False, elemwise=False)[source]

Bases: espnet2.asr.state_spaces.residual.Residual

forward(x, y, transposed=False)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.state_spaces.residual.Residual(i_layer, d_input, d_model, alpha=1.0, beta=1.0)[source]

Bases: torch.nn.modules.module.Module

Residual connection with constant affine weights.

Can simulate standard residual, no residual, and “constant gates”.

property d_output
forward(x, y, transposed)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.state_spaces.pool

Implements downsampling and upsampling on sequences.

class espnet2.asr.state_spaces.pool.DownAvgPool(d_input, stride=1, expand=1, transposed=True)[source]

Bases: espnet2.asr.state_spaces.base.SequenceModule

property d_output

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

forward(x)[source]

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

step(x, state, **kwargs)[source]

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

class espnet2.asr.state_spaces.pool.DownLinearPool(d_input, stride=1, expand=1, transposed=True)[source]

Bases: espnet2.asr.state_spaces.base.SequenceModule

property d_output

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

forward(x)[source]

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

step(x, state, **kwargs)[source]

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

class espnet2.asr.state_spaces.pool.DownPool(d_input, d_output=None, expand=None, stride=1, transposed=True, weight_norm=True, initializer=None, activation=None)[source]

Bases: espnet2.asr.state_spaces.base.SequenceModule

default_state(*batch_shape, device=None)[source]

Create initial state for a batch of inputs.

forward(x)[source]

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

step(x, state, **kwargs)[source]

Step one time step as a recurrent model.

x: (…, H)

class espnet2.asr.state_spaces.pool.DownPool2d(d_input, d_output, stride=1, transposed=True, weight_norm=True)[source]

Bases: espnet2.asr.state_spaces.base.SequenceModule

forward(x)[source]

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

class espnet2.asr.state_spaces.pool.DownSample(d_input, stride=1, expand=1, transposed=True)[source]

Bases: espnet2.asr.state_spaces.base.SequenceModule

property d_output

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

forward(x)[source]

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

step(x, state, **kwargs)[source]

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

class espnet2.asr.state_spaces.pool.DownSpectralPool(d_input, stride=1, expand=1, transposed=True)[source]

Bases: espnet2.asr.state_spaces.base.SequenceModule

property d_output

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

forward(x)[source]

Forward pass.

x: (B, L…, D)

step(x, state, **kwargs)[source]

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

class espnet2.asr.state_spaces.pool.UpPool(d_input, d_output, stride, transposed=True, weight_norm=True, initializer=None, activation=None)[source]

Bases: espnet2.asr.state_spaces.base.SequenceModule

property d_output

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

default_state(*batch_shape, device=None)[source]

Create initial state for a batch of inputs.

forward(x, skip=None)[source]

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

step(x, state, **kwargs)[source]

Step one time step as a recurrent model.

x: (…, H)

class espnet2.asr.state_spaces.pool.UpSample(d_input, stride=1, expand=1, transposed=True)[source]

Bases: torch.nn.modules.module.Module

property d_output
forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

step(x, state, **kwargs)[source]
espnet2.asr.state_spaces.pool.downsample(x, stride=1, expand=1, transposed=False)[source]
espnet2.asr.state_spaces.pool.upsample(x, stride=1, expand=1, transposed=False)[source]

espnet2.asr.state_spaces.registry

espnet2.asr.state_spaces.attention

Multi-Head Attention layer definition.

class espnet2.asr.state_spaces.attention.MultiHeadedAttention(n_feat, n_head, dropout=0.0, transposed=False, **kwargs)[source]

Bases: espnet2.asr.state_spaces.base.SequenceModule

Multi-Head Attention layer inheriting SequenceModule.

Comparing default MHA module in ESPnet, this module returns additional dummy state and has step function for autoregressive inference.

Parameters:
  • n_head (int) – The number of heads.

  • n_feat (int) – The number of features.

  • dropout_rate (float) – Dropout rate.

Construct an MultiHeadedAttention object.

forward(query, memory=None, mask=None, *args, **kwargs)[source]

Compute scaled dot product attention.

Parameters:
  • query (torch.Tensor) – Query tensor (#batch, time1, size).

  • key (torch.Tensor) – Key tensor (#batch, time2, size).

  • value (torch.Tensor) – Value tensor (#batch, time2, size).

  • mask (torch.Tensor) – Mask tensor (#batch, 1, time2) or (#batch, time1, time2).

Returns:

Output tensor (#batch, time1, d_model).

Return type:

torch.Tensor

forward_attention(value, scores, mask)[source]

Compute attention context vector.

Parameters:
  • value (torch.Tensor) – Transformed value (#batch, n_head, time2, d_k).

  • scores (torch.Tensor) – Attention score (#batch, n_head, time1, time2).

  • mask (torch.Tensor) – Mask (#batch, 1, time2) or (#batch, time1, time2).

Returns:

Transformed value (#batch, time1, d_model)

weighted by the attention score (#batch, time1, time2).

Return type:

torch.Tensor

forward_qkv(query, key, value)[source]

Transform query, key and value.

Parameters:
  • query (torch.Tensor) – Query tensor (#batch, time1, size).

  • key (torch.Tensor) – Key tensor (#batch, time2, size).

  • value (torch.Tensor) – Value tensor (#batch, time2, size).

Returns:

Transformed query tensor (#batch, n_head, time1, d_k). torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k). torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).

Return type:

torch.Tensor

step(query, state, memory=None, mask=None, **kwargs)[source]

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

espnet2.asr.state_spaces.__init__

Initialize sub package.

espnet2.asr.state_spaces.ff

Implementation of FFN block in the style of Transformers.

class espnet2.asr.state_spaces.ff.FF(d_input, expand=2, d_output=None, transposed=False, activation='gelu', initializer=None, dropout=0.0, tie_dropout=False)[source]

Bases: espnet2.asr.state_spaces.base.SequenceModule

forward(x, *args, **kwargs)[source]

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

step(x, state, **kwargs)[source]

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

espnet2.asr.state_spaces.utils

Utilities for dealing with collection objects (lists, dicts) and configs.

espnet2.asr.state_spaces.utils.extract_attrs_from_obj(obj, *attrs)[source]
espnet2.asr.state_spaces.utils.get_class(registry, _name_)[source]
espnet2.asr.state_spaces.utils.instantiate(registry, config, *args, partial=False, wrap=None, **kwargs)[source]

Instantiate registered module.

registry: Dictionary mapping names to functions or target paths

(e.g. {‘model’: ‘models.SequenceModel’})

config: Dictionary with a ‘_name_’ key indicating which element of the registry

to grab, and kwargs to be passed into the target constructor

wrap: wrap the target class (e.g. ema optimizer or tasks.wrap) *args, **kwargs: additional arguments

to override the config to pass into the target constructor

espnet2.asr.state_spaces.utils.is_dict(x)[source]
espnet2.asr.state_spaces.utils.is_list(x)[source]
espnet2.asr.state_spaces.utils.omegaconf_filter_keys(d, fn=None)[source]

Only keep keys where fn(key) is True. Support nested DictConfig.

espnet2.asr.state_spaces.utils.to_dict(x, recursive=True)[source]

Convert Sequence or Mapping object to dict.

lists get converted to {0: x[0], 1: x[1], …}

espnet2.asr.state_spaces.utils.to_list(x, recursive=False)[source]

Convert an object to list.

If Sequence (e.g. list, tuple, Listconfig): just return it

Special case: If non-recursive and not a list, wrap in list

espnet2.asr.state_spaces.model

class espnet2.asr.state_spaces.model.SequenceModel(d_model, n_layers=1, transposed=False, dropout=0.0, tie_dropout=False, prenorm=True, n_repeat=1, layer=None, residual=None, norm=None, pool=None, track_norms=True, dropinp=0.0, drop_path=0.0)[source]

Bases: espnet2.asr.state_spaces.base.SequenceModule

Isotropic deep sequence model backbone, in the style of ResNets / Transformers.

The SequenceModel class implements a generic (batch, length, d_input) -> (batch, length, d_output) transformation

Parameters:
  • d_model – Resize input (useful for deep models with residuals)

  • n_layers – Number of layers

  • transposed – Transpose inputs so each layer receives (batch, dim, length)

  • dropout – Dropout parameter applied on every residual and every layer

  • tie_dropout – Tie dropout mask across sequence like nn.Dropout1d/nn.Dropout2d

  • prenorm – Pre-norm vs. post-norm

  • n_repeat – Each layer is repeated n times per stage before applying pooling

  • layer – Layer config, must be specified

  • residual – Residual config

  • norm – Normalization config (e.g. layer vs batch)

  • pool – Config for pooling layer per stage

  • track_norms – Log norms of each layer output

  • dropinp – Input dropout

  • drop_path – Stochastic depth for each residual path

property d_state

Return dimension of output of self.state_to_tensor.

default_state(*batch_shape, device=None)[source]

Create initial state for a batch of inputs.

forward(inputs, *args, state=None, **kwargs)[source]

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

property state_to_tensor

Return a function mapping a state to a single tensor.

This method should be implemented if one wants to use the hidden state insteadof the output sequence for final prediction. Currently only used with the StateDecoder.

step(x, state, **kwargs)[source]

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

espnet2.asr.state_spaces.cauchy

espnet2.asr.state_spaces.block

Implements a full residual block around a black box layer.

Configurable options include: normalization position: prenorm or postnorm normalization type: batchnorm, layernorm etc. subsampling/pooling residual options: feedforward, residual, affine scalars, depth-dependent scaling, etc.

class espnet2.asr.state_spaces.block.SequenceResidualBlock(d_input, i_layer=None, prenorm=True, dropout=0.0, tie_dropout=False, transposed=False, layer=None, residual=None, norm=None, pool=None, drop_path=0.0)[source]

Bases: espnet2.asr.state_spaces.base.SequenceModule

Residual block wrapper for black box layer.

The SequenceResidualBlock class implements a generic (batch, length, d_input) -> (batch, length, d_input) transformation

Parameters:
  • d_input – Input feature dimension

  • i_layer – Layer index, only needs to be passed into certain residuals like Decay

  • dropout – Dropout for black box module

  • tie_dropout – Tie dropout mask across sequence like nn.Dropout1d/nn.Dropout2d

  • transposed – Transpose inputs so each layer receives (batch, dim, length)

  • layer – Config for black box module

  • residual – Config for residual function

  • norm – Config for normalization layer

  • pool – Config for pooling layer per stage

  • drop_path – Drop ratio for stochastic depth

property d_output

Output dimension of model.

This attribute is required for all SequenceModule instantiations. It is used by the rest of the pipeline (e.g. model backbone, decoder) to track the internal shapes of the full model.

property d_state

Return dimension of output of self.state_to_tensor.

default_state(*args, **kwargs)[source]

Create initial state for a batch of inputs.

forward(x, state=None, **kwargs)[source]

Forward pass.

A sequence-to-sequence transformation with an optional state.

Generally, this should map a tensor of shape (batch, length, self.d_model) to (batch, length, self.d_output)

Additionally, it returns a “state” which can be any additional information For example, RNN and SSM layers may return their hidden state, while some types of transformer layers (e.g. Transformer-XL) may want to pass a state as well

property state_to_tensor

Return a function mapping a state to a single tensor.

This method should be implemented if one wants to use the hidden state insteadof the output sequence for final prediction. Currently only used with the StateDecoder.

step(x, state, **kwargs)[source]

Step the model recurrently for one step of the input sequence.

For example, this should correspond to unrolling an RNN for one step. If the forward pass has signature (B, L, H1) -> (B, L, H2), this method should generally have signature (B, H1) -> (B, H2) with an optional recurrent state.

espnet2.asr.preencoder.sinc

Sinc convolutions for raw audio input.

class espnet2.asr.preencoder.sinc.LightweightSincConvs(fs: Union[int, str, float] = 16000, in_channels: int = 1, out_channels: int = 256, activation_type: str = 'leakyrelu', dropout_type: str = 'dropout', windowing_type: str = 'hamming', scale_type: str = 'mel')[source]

Bases: espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder

Lightweight Sinc Convolutions.

Instead of using precomputed features, end-to-end speech recognition can also be done directly from raw audio using sinc convolutions, as described in “Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions” by Kürzinger et al. https://arxiv.org/abs/2010.07597

To use Sinc convolutions in your model instead of the default f-bank frontend, set this module as your pre-encoder with preencoder: sinc and use the input of the sliding window frontend with frontend: sliding_window in your yaml configuration file. So that the process flow is:

Frontend (SlidingWindow) -> SpecAug -> Normalization -> Pre-encoder (LightweightSincConvs) -> Encoder -> Decoder

Note that this method also performs data augmentation in time domain (vs. in spectral domain in the default frontend). Use plot_sinc_filters.py to visualize the learned Sinc filters.

Initialize the module.

Parameters:
  • fs – Sample rate.

  • in_channels – Number of input channels.

  • out_channels – Number of output channels (for each input channel).

  • activation_type – Choice of activation function.

  • dropout_type – Choice of dropout function.

  • windowing_type – Choice of windowing function.

  • scale_type – Choice of filter-bank initialization scale.

espnet_initialization_fn()[source]

Initialize sinc filters with filterbank values.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Apply Lightweight Sinc Convolutions.

The input shall be formatted as (B, T, C_in, D_in) with B as batch size, T as time dimension, C_in as channels, and D_in as feature dimension.

The output will then be (B, T, C_out*D_out) with C_out and D_out as output dimensions.

The current module structure only handles D_in=400, so that D_out=1. Remark for the multichannel case: C_out is the number of out_channels given at initialization multiplied with C_in.

gen_lsc_block(in_channels: int, out_channels: int, depthwise_kernel_size: int = 9, depthwise_stride: int = 1, depthwise_groups=None, pointwise_groups=0, dropout_probability: float = 0.15, avgpool=False)[source]

Generate a convolutional block for Lightweight Sinc convolutions.

Each block consists of either a depthwise or a depthwise-separable convolutions together with dropout, (batch-)normalization layer, and an optional average-pooling layer.

Parameters:
  • in_channels – Number of input channels.

  • out_channels – Number of output channels.

  • depthwise_kernel_size – Kernel size of the depthwise convolution.

  • depthwise_stride – Stride of the depthwise convolution.

  • depthwise_groups – Number of groups of the depthwise convolution.

  • pointwise_groups – Number of groups of the pointwise convolution.

  • dropout_probability – Dropout probability in the block.

  • avgpool – If True, an AvgPool layer is inserted.

Returns:

Neural network building block.

Return type:

torch.nn.Sequential

output_size() → int[source]

Get the output size.

class espnet2.asr.preencoder.sinc.SpatialDropout(dropout_probability: float = 0.15, shape: Union[tuple, list, None] = None)[source]

Bases: torch.nn.modules.module.Module

Spatial dropout module.

Apply dropout to full channels on tensors of input (B, C, D)

Initialize.

Parameters:
  • dropout_probability – Dropout probability.

  • shape (tuple, list) – Shape of input tensors.

forward(x: torch.Tensor) → torch.Tensor[source]

Forward of spatial dropout module.

espnet2.asr.preencoder.abs_preencoder

class espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.asr.preencoder.__init__

espnet2.asr.preencoder.linear

Linear Projection.

class espnet2.asr.preencoder.linear.LinearProjection(input_size: int, output_size: int, dropout: float = 0.0)[source]

Bases: espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder

Linear Projection Preencoder.

Initialize the module.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward.

output_size() → int[source]

Get the output size.

espnet2.asr.decoder.transformer_decoder

Decoder definition.

class espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder(vocab_size: int, encoder_output_size: int, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True)[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder, espnet.nets.scorer_interface.BatchScorerInterface

Base class of Transfomer decoder module.

Parameters:
  • vocab_size – output dim

  • encoder_output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of decoder blocks

  • dropout_rate – dropout rate

  • self_attention_dropout_rate – dropout rate for attention

  • input_layer – input layer type

  • use_output_layer – whether to use output layer

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor, return_hs: bool = False) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch.

Parameters:
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns:

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type:

tuple[torch.Tensor, List[Any]]

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor, return_hs: bool = False, return_all_hs: bool = False) → Tuple[torch.Tensor, torch.Tensor][source]

Forward decoder.

Parameters:
  • hs_pad – encoded memory, float32 (batch, maxlen_in, feat)

  • hlens – (batch)

  • ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases

  • ys_in_lens – (batch)

  • return_hs – (bool) whether to return the last hidden output before output layer

  • return_all_hs – (bool) whether to return all the hidden intermediates

Returns:

tuple containing:

x: decoded token score before softmax (batch, maxlen_out, token)

if use_output_layer is True,

olens: (batch, )

Return type:

(tuple)

forward_one_step(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, memory_mask: torch.Tensor = None, *, cache: List[torch.Tensor] = None, return_hs: bool = False) → Tuple[torch.Tensor, List[torch.Tensor]][source]

Forward one step.

Parameters:
  • tgt – input token ids, int64 (batch, maxlen_out)

  • tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)

  • memory – encoded memory, float32 (batch, maxlen_in, feat)

  • memory_mask – encoded memory mask (batch, 1, maxlen_in)

  • cache – cached output list of (batch, max_time_out-1, size)

  • return_hs – dec hidden state corresponding to ys, used for searchable hidden ints

Returns:

NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)

Return type:

y, cache

score(ys, state, x, return_hs=False)[source]

Score.

class espnet2.asr.decoder.transformer_decoder.DynamicConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.DynamicConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.LightweightConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.LightweightConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.TransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, layer_drop_rate: float = 0.0)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.TransformerMDDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, use_speech_attn: bool = True)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor, speech: torch.Tensor = None) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch.

Parameters:
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns:

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type:

tuple[torch.Tensor, List[Any]]

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor, speech: torch.Tensor = None, speech_lens: torch.Tensor = None, return_hs: bool = False) → Tuple[torch.Tensor, torch.Tensor][source]

Forward decoder.

Parameters:
  • hs_pad – encoded memory, float32 (batch, maxlen_in, feat)

  • hlens – (batch)

  • ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases

  • ys_in_lens – (batch)

  • return_hs – dec hidden state corresponding to ys, used for searchable hidden ints

Returns:

tuple containing:

x: decoded token score before softmax (batch, maxlen_out, token)

if use_output_layer is True,

olens: (batch, )

Return type:

(tuple)

forward_one_step(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, memory_mask: torch.Tensor = None, *, speech: torch.Tensor = None, speech_mask: torch.Tensor = None, cache: List[torch.Tensor] = None, return_hs: bool = False) → Tuple[torch.Tensor, List[torch.Tensor]][source]

Forward one step.

Parameters:
  • tgt – input token ids, int64 (batch, maxlen_out)

  • tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)

  • memory – encoded memory, float32 (batch, maxlen_in, feat)

  • memory_mask – encoded memory mask (batch, 1, maxlen_in)

  • speech – encoded speech, float32 (batch, maxlen_in, feat)

  • speech_mask – encoded memory mask (batch, 1, maxlen_in)

  • cache – cached output list of (batch, max_time_out-1, size)

  • return_hs – dec hidden state corresponding to ys, used for searchable hidden ints

Returns:

NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)

Return type:

y, cache

score(ys, state, x, speech=None)[source]

Score.

espnet2.asr.decoder.abs_decoder

class espnet2.asr.decoder.abs_decoder.AbsDecoder(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, espnet.nets.scorer_interface.ScorerInterface, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.decoder.transducer_decoder

(RNN-)Transducer decoder definition.

class espnet2.asr.decoder.transducer_decoder.TransducerDecoder(vocab_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, dropout: float = 0.0, dropout_embed: float = 0.0, embed_pad: int = 0)[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

(RNN-)Transducer decoder module.

Parameters:
  • vocab_size – Output dimension.

  • layers_type – (RNN-)Decoder layers type.

  • num_layers – Number of decoder layers.

  • hidden_size – Number of decoder units per layer.

  • dropout – Dropout rate for decoder layers.

  • dropout_embed – Dropout rate for embedding layer.

  • embed_pad – Embed/Blank symbol ID.

batch_score(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]], dec_states: Tuple[torch.Tensor, Optional[torch.Tensor]], cache: Dict[str, Any], use_lm: bool) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor], torch.Tensor][source]

One-step forward hypotheses.

Parameters:
  • hyps – Hypotheses.

  • states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

  • cache – Pairs of (dec_out, dec_states) for each label sequences. (keys)

  • use_lm – Whether to compute label ID sequences for LM.

Returns:

Decoder output sequences. (B, D_dec) dec_states: Decoder hidden states. ((N, B, D_dec), (N, B, D_dec)) lm_labels: Label ID sequences for LM. (B,)

Return type:

dec_out

create_batch_states(states: Tuple[torch.Tensor, Optional[torch.Tensor]], new_states: List[Tuple[torch.Tensor, Optional[torch.Tensor]]], check_list: Optional[List] = None) → List[Tuple[torch.Tensor, Optional[torch.Tensor]]][source]

Create decoder hidden states.

Parameters:
  • states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

  • new_states – Decoder hidden states. [N x ((1, D_dec), (1, D_dec))]

Returns:

Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

Return type:

states

forward(labels: torch.Tensor) → torch.Tensor[source]

Encode source label sequences.

Parameters:

labels – Label ID sequences. (B, L)

Returns:

Decoder output sequences. (B, T, U, D_dec)

Return type:

dec_out

init_state(batch_size: int) → Tuple[torch.Tensor, Optional[torch._VariableFunctionsClass.tensor]][source]

Initialize decoder states.

Parameters:

batch_size – Batch size.

Returns:

Initial decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

rnn_forward(sequence: torch.Tensor, state: Tuple[torch.Tensor, Optional[torch.Tensor]]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]

Encode source label sequences.

Parameters:
  • sequence – RNN input sequences. (B, D_emb)

  • state – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

Returns:

RNN output sequences. (B, D_dec) (h_next, c_next): Decoder hidden states. (N, B, D_dec), (N, B, D_dec))

Return type:

sequence

score(hyp: espnet2.asr.transducer.beam_search_transducer.Hypothesis, cache: Dict[str, Any]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]], torch.Tensor][source]

One-step forward hypothesis.

Parameters:
  • hyp – Hypothesis.

  • cache – Pairs of (dec_out, state) for each label sequence. (key)

Returns:

Decoder output sequence. (1, D_dec) new_state: Decoder hidden states. ((N, 1, D_dec), (N, 1, D_dec)) label: Label ID for LM. (1,)

Return type:

dec_out

select_state(states: Tuple[torch.Tensor, Optional[torch.Tensor]], idx: int) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Get specified ID state from decoder hidden states.

Parameters:
  • states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

  • idx – State ID to extract.

Returns:

Decoder hidden state for given ID.

((N, 1, D_dec), (N, 1, D_dec))

set_device(device: torch.device)[source]

Set GPU device to use.

Parameters:

device – Device ID.

espnet2.asr.decoder.s4_decoder

Decoder definition.

class espnet2.asr.decoder.s4_decoder.S4Decoder(vocab_size: int, encoder_output_size: int, input_layer: str = 'embed', dropinp: float = 0.0, dropout: float = 0.25, prenorm: bool = True, n_layers: int = 16, transposed: bool = False, tie_dropout: bool = False, n_repeat=1, layer=None, residual=None, norm=None, pool=None, track_norms=True, drop_path: float = 0.0)[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder, espnet.nets.scorer_interface.BatchScorerInterface

S4 decoder module.

Parameters:
  • vocab_size – output dim

  • encoder_output_size – dimension of hidden vector

  • input_layer – input layer type

  • dropinp – input dropout

  • dropout – dropout parameter applied on every residual and every layer

  • prenorm – pre-norm vs. post-norm

  • n_layers – number of layers

  • transposed – transpose inputs so each layer receives (batch, dim, length)

  • tie_dropout – tie dropout mask across sequence like nn.Dropout1d/nn.Dropout2d

  • n_repeat – each layer is repeated n times per stage before applying pooling

  • layer – layer config, must be specified

  • residual – residual config

  • norm – normalization config (e.g. layer vs batch)

  • pool – config for pooling layer per stage

  • track_norms – log norms of each layer output

  • drop_path – drop rate for stochastic depth

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch.

Parameters:
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns:

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type:

tuple[torch.Tensor, List[Any]]

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor, state=None) → Tuple[torch.Tensor, torch.Tensor][source]

Forward decoder.

Parameters:
  • hs_pad – encoded memory, float32 (batch, maxlen_in, feat)

  • hlens – (batch)

  • ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases

  • ys_in_lens – (batch)

Returns:

tuple containing:

x: decoded token score before softmax (batch, maxlen_out, token)

if use_output_layer is True,

olens: (batch, )

Return type:

(tuple)

init_state(x: torch.Tensor)[source]

Initialize state.

score(ys, state, x)[source]

Score new token (required).

Parameters:
  • y (torch.Tensor) – 1D torch.int64 prefix tokens.

  • state – Scorer state for prefix tokens

  • x (torch.Tensor) – The encoder feature that generates ys.

Returns:

Tuple of

scores for next token that has a shape of (n_vocab) and next state for ys

Return type:

tuple[torch.Tensor, Any]

espnet2.asr.decoder.hugging_face_transformers_decoder

Hugging Face Transformers Decoder.

class espnet2.asr.decoder.hugging_face_transformers_decoder.HuggingFaceTransformersDecoder(vocab_size: int, encoder_output_size: int, model_name_or_path: str, causal_lm: bool = False, prefix: str = '', postfix: str = '')[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder, espnet.nets.scorer_interface.BatchScorerInterface

Hugging Face Transformers Decoder.

Parameters:
  • encoder_output_size – dimension of encoder attention

  • model_name_or_path – Hugging Face Transformers model name

add_prefix_postfix(enc_out, hlens, ys_in_pad, ys_in_lens)[source]
batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor, speech: torch.Tensor = None) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch (required).

Parameters:
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns:

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type:

tuple[torch.Tensor, List[Any]]

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward decoder.

Parameters:
  • hs_pad – encoded memory, float32 (batch, maxlen_in, feat)

  • hlens – (batch)

  • ys_in_pad – input tensor (batch, maxlen_out, #mels)

  • ys_in_lens – (batch)

Returns:

tuple containing:

x: decoded token score before softmax (batch, maxlen_out, token)

if use_output_layer is True,

olens: (batch, )

Return type:

(tuple)

reload_pretrained_parameters()[source]
score(ys, state, x, speech=None)[source]

Score new token (required).

Parameters:
  • y (torch.Tensor) – 1D torch.int64 prefix tokens.

  • state – Scorer state for prefix tokens

  • x (torch.Tensor) – The encoder feature that generates ys.

Returns:

Tuple of

scores for next token that has a shape of (n_vocab) and next state for ys

Return type:

tuple[torch.Tensor, Any]

espnet2.asr.decoder.hugging_face_transformers_decoder.get_hugging_face_model_lm_head(model)[source]
espnet2.asr.decoder.hugging_face_transformers_decoder.get_hugging_face_model_network(model)[source]

espnet2.asr.decoder.rnn_decoder

class espnet2.asr.decoder.rnn_decoder.RNNDecoder(vocab_size: int, encoder_output_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, sampling_probability: float = 0.0, dropout: float = 0.0, context_residual: bool = False, replace_sos: bool = False, num_encs: int = 1, att_conf: dict = {'aconv_chans': 10, 'aconv_filts': 100, 'adim': 320, 'aheads': 4, 'atype': 'location', 'awin': 5, 'han_conv_chans': -1, 'han_conv_filts': 100, 'han_dim': 320, 'han_heads': 4, 'han_mode': False, 'han_type': None, 'han_win': 5, 'num_att': 1, 'num_encs': 1})[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

forward(hs_pad, hlens, ys_in_pad, ys_in_lens, strm_idx=0)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_state(x)[source]

Get an initial state for decoding (optional).

Parameters:

x (torch.Tensor) – The encoded feature tensor

Returns: initial state

rnn_forward(ey, z_list, c_list, z_prev, c_prev)[source]
score(yseq, state, x)[source]

Score new token (required).

Parameters:
  • y (torch.Tensor) – 1D torch.int64 prefix tokens.

  • state – Scorer state for prefix tokens

  • x (torch.Tensor) – The encoder feature that generates ys.

Returns:

Tuple of

scores for next token that has a shape of (n_vocab) and next state for ys

Return type:

tuple[torch.Tensor, Any]

zero_state(hs_pad)[source]
espnet2.asr.decoder.rnn_decoder.build_attention_list(eprojs: int, dunits: int, atype: str = 'location', num_att: int = 1, num_encs: int = 1, aheads: int = 4, adim: int = 320, awin: int = 5, aconv_chans: int = 10, aconv_filts: int = 100, han_mode: bool = False, han_type=None, han_heads: int = 4, han_dim: int = 320, han_conv_chans: int = -1, han_conv_filts: int = 100, han_win: int = 5)[source]

espnet2.asr.decoder.__init__

espnet2.asr.decoder.mlm_decoder

Masked LM Decoder definition.

class espnet2.asr.decoder.mlm_decoder.MLMDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False)[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward decoder.

Parameters:
  • hs_pad – encoded memory, float32 (batch, maxlen_in, feat)

  • hlens – (batch)

  • ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases

  • ys_in_lens – (batch)

Returns:

tuple containing: x: decoded token score before softmax (batch, maxlen_out, token)

if use_output_layer is True,

olens: (batch, )

Return type:

(tuple)

espnet2.asr.decoder.whisper_decoder

class espnet2.asr.decoder.whisper_decoder.ExpandedTokenEmbedding(ori_emebedding, additional_size)[source]

Bases: torch.nn.modules.module.Module

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property weight
class espnet2.asr.decoder.whisper_decoder.OpenAIWhisperDecoder(vocab_size: int, encoder_output_size: int, dropout_rate: float = 0.0, whisper_model: str = 'small', download_dir: Optional[str] = None, load_origin_token_embedding=False)[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder, espnet.nets.scorer_interface.BatchScorerInterface

Transformer-based Speech-to-Text Decoder from OpenAI’s Whisper Model:

URL: https://github.com/openai/whisper

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch.

Parameters:
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns:

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type:

tuple[torch.Tensor, List[Any]]

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward decoder.

Parameters:
  • hs_pad – encoded memory, float32 (batch, maxlen_in, feat)

  • hlens – (batch)

  • ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases

  • ys_in_lens – (batch)

Returns:

tuple containing:

x: decoded token score before softmax (batch, maxlen_out, token)

if use_output_layer is True,

olens: (batch, )

Return type:

(tuple)

forward_one_step(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, *, cache: List[torch.Tensor] = None) → Tuple[torch.Tensor, List[torch.Tensor]][source]

Forward one step.

Parameters:
  • tgt – input token ids, int64 (batch, maxlen_out)

  • tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)

  • memory – encoded memory, float32 (batch, maxlen_in, feat)

  • cache – cached output list of (batch, max_time_out-1, size)

Returns:

NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)

Return type:

y, cache

NOTE (Shih-Lun):

cache implementation is ignored for now for simplicity & correctness

score(ys, state, x)[source]

Score.

espnet2.asr.transducer.beam_search_transducer_streaming

Search algorithms for Transducer models.

class espnet2.asr.transducer.beam_search_transducer_streaming.BeamSearchTransducerStreaming(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr_transducer.joint_network.JointNetwork, beam_size: int, lm: torch.nn.modules.module.Module = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 2, u_max: int = 50, nstep: int = 1, prefix_alpha: int = 1, expansion_gamma: int = 2.3, expansion_beta: int = 2, score_norm: bool = True, score_norm_during: bool = False, nbest: int = 1, penalty: float = 0.0, token_list: Optional[List[str]] = None, hold_n: int = 0)[source]

Bases: object

Beam search implementation for Transducer.

Initialize Transducer search module.

Parameters:
  • decoder – Decoder module.

  • joint_network – Joint network module.

  • beam_size – Beam size.

  • lm – LM class.

  • lm_weight – LM weight for soft fusion.

  • search_type – Search algorithm to use during inference.

  • max_sym_exp – Number of maximum symbol expansions at each time step. (TSD)

  • u_max – Maximum output sequence length. (ALSD)

  • nstep – Number of maximum expansion steps at each time step. (NSC/mAES)

  • prefix_alpha – Maximum prefix length in prefix search. (NSC/mAES)

  • expansion_beta – Number of additional candidates for expanded hypotheses selection. (mAES)

  • expansion_gamma – Allowed logp difference for prune-by-value method. (mAES)

  • score_norm – Normalize final scores by length. (“default”)

  • score_norm_during – Normalize scores by length during search. (default, TSD, ALSD)

  • nbest – Number of final hypothesis.

align_length_sync_decoding(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer_streaming.Hypothesis][source]

Alignment-length synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters:

h – Encoder output sequences. (T, D)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

Beam search implementation.

Modified from https://arxiv.org/pdf/1211.3711.pdf

Parameters:

enc_out – Encoder output sequence. (T, D)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

Greedy search implementation.

Parameters:

enc_out – Encoder output sequence. (T, D_enc)

Returns:

1-best hypotheses.

Return type:

hyp

It’s the modified Adaptive Expansion Search (mAES) implementation.

Based on/modified from https://ieeexplore.ieee.org/document/9250505 and NSC.

Parameters:

enc_out – Encoder output sequence. (T, D_enc)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

N-step constrained beam search implementation.

Based on/Modified from https://arxiv.org/pdf/2002.03577.pdf. Please reference ESPnet (b-flo, PR #2444) for any usage outside ESPnet until further modifications.

Parameters:

enc_out – Encoder output sequence. (T, D_enc)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

Prefix search for NSC and mAES strategies.

Based on https://arxiv.org/pdf/1211.3711.pdf

reset()[source]
sort_nbest(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer_streaming.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer_streaming.ExtendedHypothesis]]) → Union[List[espnet2.asr.transducer.beam_search_transducer_streaming.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer_streaming.ExtendedHypothesis]][source]

Sort hypotheses by score or score given sequence length.

Parameters:

hyps – Hypothesis.

Returns:

Sorted hypothesis.

Return type:

hyps

time_sync_decoding(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer_streaming.Hypothesis][source]

Time synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters:

enc_out – Encoder output sequence. (T, D)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

class espnet2.asr.transducer.beam_search_transducer_streaming.ExtendedHypothesis(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None, dec_out: List[torch.Tensor] = None, lm_scores: torch.Tensor = None)[source]

Bases: espnet2.asr.transducer.beam_search_transducer_streaming.Hypothesis

Extended hypothesis definition for NSC beam search and mAES.

dec_out = None
lm_scores = None
class espnet2.asr.transducer.beam_search_transducer_streaming.Hypothesis(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None)[source]

Bases: object

Default hypothesis definition for Transducer search algorithms.

lm_state = None

espnet2.asr.transducer.__init__

espnet2.asr.transducer.beam_search_transducer

Search algorithms for Transducer models.

class espnet2.asr.transducer.beam_search_transducer.BeamSearchTransducer(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr_transducer.joint_network.JointNetwork, beam_size: int, lm: torch.nn.modules.module.Module = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 2, u_max: int = 50, nstep: int = 1, prefix_alpha: int = 1, expansion_gamma: int = 2.3, expansion_beta: int = 2, multi_blank_durations: List[int] = [], multi_blank_indices: List[int] = [], score_norm: bool = True, score_norm_during: bool = False, nbest: int = 1, token_list: Optional[List[str]] = None)[source]

Bases: object

Beam search implementation for Transducer.

Initialize Transducer search module.

Parameters:
  • decoder – Decoder module.

  • joint_network – Joint network module.

  • beam_size – Beam size.

  • lm – LM class.

  • lm_weight – LM weight for soft fusion.

  • search_type – Search algorithm to use during inference.

  • max_sym_exp – Number of maximum symbol expansions at each time step. (TSD)

  • u_max – Maximum output sequence length. (ALSD)

  • nstep – Number of maximum expansion steps at each time step. (NSC/mAES)

  • prefix_alpha – Maximum prefix length in prefix search. (NSC/mAES)

  • expansion_beta – Number of additional candidates for expanded hypotheses selection. (mAES)

  • expansion_gamma – Allowed logp difference for prune-by-value method. (mAES)

  • multi_blank_durations – The duration of each blank token. (MBG)

  • multi_blank_indices – The index of each blank token in token_list. (MBG)

  • score_norm – Normalize final scores by length. (“default”)

  • score_norm_during – Normalize scores by length during search. (default, TSD, ALSD)

  • nbest – Number of final hypothesis.

align_length_sync_decoding(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]

Alignment-length synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters:

h – Encoder output sequences. (T, D)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

Beam search implementation.

Modified from https://arxiv.org/pdf/1211.3711.pdf

Parameters:

enc_out – Encoder output sequence. (T, D)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

Greedy search implementation.

Parameters:

enc_out – Encoder output sequence. (T, D_enc)

Returns:

1-best hypotheses.

Return type:

hyp

It’s the modified Adaptive Expansion Search (mAES) implementation.

Based on/modified from https://ieeexplore.ieee.org/document/9250505 and NSC.

Parameters:

enc_out – Encoder output sequence. (T, D_enc)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

Greedy Search for Multi-Blank Transducer (Multi-Blank Greedy, MBG).

In this implementation, we assume: 1. the index of standard blank is the last entry of self.multi_blank_indices

rather than self.blank_id (to avoid too much change on original transducer)

  1. other entries in self.multi_blank_indices are big blanks that account for multiple frames.

Based on https://arxiv.org/abs/2211.03541

Parameters:

enc_out – Encoder output sequence. (T, D_enc)

Returns:

1-best hypothesis.

Return type:

hyp

N-step constrained beam search implementation.

Based on/Modified from https://arxiv.org/pdf/2002.03577.pdf. Please reference ESPnet (b-flo, PR #2444) for any usage outside ESPnet until further modifications.

Parameters:

enc_out – Encoder output sequence. (T, D_enc)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

Prefix search for NSC and mAES strategies.

Based on https://arxiv.org/pdf/1211.3711.pdf

sort_nbest(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]]) → Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]][source]

Sort hypotheses by score or score given sequence length.

Parameters:

hyps – Hypothesis.

Returns:

Sorted hypothesis.

Return type:

hyps

time_sync_decoding(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]

Time synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters:

enc_out – Encoder output sequence. (T, D)

Returns:

N-best hypothesis.

Return type:

nbest_hyps

class espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None, dec_out: List[torch.Tensor] = None, lm_scores: torch.Tensor = None)[source]

Bases: espnet2.asr.transducer.beam_search_transducer.Hypothesis

Extended hypothesis definition for NSC beam search and mAES.

dec_out = None
lm_scores = None
class espnet2.asr.transducer.beam_search_transducer.Hypothesis(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None)[source]

Bases: object

Default hypothesis definition for Transducer search algorithms.

lm_state = None

espnet2.asr.transducer.error_calculator

Error Calculator module for Transducer.

class espnet2.asr.transducer.error_calculator.ErrorCalculatorTransducer(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: torch.nn.modules.module.Module, token_list: List[int], sym_space: str, sym_blank: str, report_cer: bool = False, report_wer: bool = False)[source]

Bases: object

Calculate CER and WER for transducer models.

Parameters:
  • decoder – Decoder module.

  • token_list – List of tokens.

  • sym_space – Space symbol.

  • sym_blank – Blank symbol.

  • report_cer – Whether to compute CER.

  • report_wer – Whether to compute WER.

Construct an ErrorCalculatorTransducer.

calculate_cer(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]

Calculate sentence-level CER score.

Parameters:
  • char_pred – Prediction character sequences. (B, ?)

  • char_target – Target character sequences. (B, ?)

Returns:

Average sentence-level CER score.

calculate_wer(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]

Calculate sentence-level WER score.

Parameters:
  • char_pred – Prediction character sequences. (B, ?)

  • char_target – Target character sequences. (B, ?)

Returns:

Average sentence-level WER score

convert_to_char(pred: torch.Tensor, target: torch.Tensor) → Tuple[List, List][source]

Convert label ID sequences to character sequences.

Parameters:
  • pred – Prediction label ID sequences. (B, U)

  • target – Target label ID sequences. (B, L)

Returns:

Prediction character sequences. (B, ?) char_target: Target character sequences. (B, ?)

Return type:

char_pred

espnet2.asr.transducer.rnnt_multi_blank.rnnt

espnet2.asr.transducer.rnnt_multi_blank.rnnt.multiblank_rnnt_loss_gpu(acts: torch.Tensor, labels: torch.Tensor, input_lengths: torch.Tensor, label_lengths: torch.Tensor, costs: torch.Tensor, grads: torch.Tensor, blank_label: int, big_blank_durations: list, fastemit_lambda: float, clamp: float, num_threads: int, sigma: float)[source]

Wrapper method for accessing GPU Multi-blank RNNT loss

(https://arxiv.org/pdf/2211.03541.pdf).

CUDA implementation ported from [HawkAaron/warp-transducer]

(https://github.com/HawkAaron/warp-transducer).

Parameters:
  • acts – Activation tensor of shape [B, T, U, V + num_big_blanks + 1].

  • labels – Ground truth labels of shape [B, U].

  • input_lengths – Lengths of the acoustic sequence as a vector of ints [B].

  • label_lengths – Lengths of the target sequence as a vector of ints [B].

  • costs – Zero vector of length [B] in which costs will be set.

  • grads – Zero tensor of shape [B, T, U, V + num_big_blanks + 1] where the gradient will be set.

  • blank_label – Index of the standard blank token in the vocabulary.

  • big_blank_durations – A list of supported durations for big blank symbols in the model, e.g. [2, 4, 8]. Note we only include durations for ``big blanks’’ here and it should not include 1 for the standard blank. Those big blanks have vocabulary indices after the standard blank index.

  • fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.

  • clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

  • num_threads – Number of threads for OpenMP.

  • sigma – logit-undernormalization weight used in the multi-blank model. Refer to the multi-blank paper https://arxiv.org/pdf/2211.03541 for detailed explanations.

espnet2.asr.transducer.rnnt_multi_blank.rnnt.rnnt_loss_cpu(acts: torch.Tensor, labels: torch.Tensor, input_lengths: torch.Tensor, label_lengths: torch.Tensor, costs: torch.Tensor, grads: torch.Tensor, blank_label: int, fastemit_lambda: float, clamp: float, num_threads: int)[source]

Wrapper method for accessing CPU RNNT loss.

CPU implementation ported from [HawkAaron/warp-transducer]

(https://github.com/HawkAaron/warp-transducer).

Parameters:
  • acts – Activation tensor of shape [B, T, U, V+1].

  • labels – Ground truth labels of shape [B, U].

  • input_lengths – Lengths of the acoustic sequence as a vector of ints [B].

  • label_lengths – Lengths of the target sequence as a vector of ints [B].

  • costs – Zero vector of length [B] in which costs will be set.

  • grads – Zero tensor of shape [B, T, U, V+1] where the gradient will be set.

  • blank_label – Index of the blank token in the vocabulary.

  • fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.

  • clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

  • num_threads – Number of threads for OpenMP.

espnet2.asr.transducer.rnnt_multi_blank.rnnt.rnnt_loss_gpu(acts: torch.Tensor, labels: torch.Tensor, input_lengths: torch.Tensor, label_lengths: torch.Tensor, costs: torch.Tensor, grads: torch.Tensor, blank_label: int, fastemit_lambda: float, clamp: float, num_threads: int)[source]

Wrapper method for accessing GPU RNNT loss.

CUDA implementation ported from [HawkAaron/warp-transducer]

(https://github.com/HawkAaron/warp-transducer).

Parameters:
  • acts – Activation tensor of shape [B, T, U, V+1].

  • labels – Ground truth labels of shape [B, U].

  • input_lengths – Lengths of the acoustic sequence as a vector of ints [B].

  • label_lengths – Lengths of the target sequence as a vector of ints [B].

  • costs – Zero vector of length [B] in which costs will be set.

  • grads – Zero tensor of shape [B, T, U, V+1] where the gradient will be set.

  • blank_label – Index of the blank token in the vocabulary.

  • fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.

  • clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

  • num_threads – Number of threads for OpenMP.

espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank

espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank.rnnt_loss(acts, labels, act_lens, label_lens, blank=0, reduction='mean', fastemit_lambda: float = 0.0, clamp: float = 0.0)[source]

RNN Transducer Loss (functional form)

Parameters:
  • acts – Tensor of (batch x seqLength x labelLength x outputDim) containing output from network

  • labels – 2 dimensional Tensor containing all the targets of the batch with zero padded

  • act_lens – Tensor of size (batch) containing size of each output sequence from the network

  • label_lens – Tensor of (batch) containing label length of each example

  • blank (int, optional) – blank label. Default: 0.

  • reduction (string, optional) – Specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied, ‘mean’: the output losses will be divided by the target lengths and then the mean over the batch is taken. Default: ‘mean’

class espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank.RNNTLossNumba(blank=0, reduction='mean', fastemit_lambda: float = 0.0, clamp: float = -1)[source]

Bases: torch.nn.modules.module.Module

RNNT Loss Numba

Parameters:
  • blank (int, optional) – blank label. Default: 0.

  • reduction (string, optional) – Specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied, ‘mean’: the output losses will be divided by the target lengths and then the mean over the batch is taken. Default: ‘mean’

  • fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.

  • clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

forward(acts, labels, act_lens, label_lens)[source]

Forward RNNTLossNumba.

log_probs: Tensor of (batch x seqLength x labelLength x outputDim)

containing output from network

labels: 2 dimensional Tensor containing all the targets of the

batch with zero padded

act_lens: Tensor of size (batch) containing size of each output

sequence from the network

label_lens: Tensor of (batch) containing label length of each example

class espnet2.asr.transducer.rnnt_multi_blank.rnnt_multi_blank.MultiblankRNNTLossNumba(blank, big_blank_durations, reduction='mean', fastemit_lambda: float = 0.0, clamp: float = -1, sigma: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

Multiblank RNNT Loss Numba

Parameters:
  • blank (int) – standard blank label.

  • big_blank_durations – list of durations for multi-blank transducer, e.g. [2, 4, 8].

  • sigma – hyper-parameter for logit under-normalization method for training multi-blank transducers. Recommended value 0.05.

  • to https (Refer) – //arxiv.org/pdf/2211.03541 for detailed explanations for the above parameters;

  • reduction (string, optional) – Specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied, ‘mean’: the output losses will be divided by the target lengths and then the mean over the batch is taken. Default: ‘mean’

  • fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.

  • clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

forward(acts, labels, act_lens, label_lens)[source]

MultiblankRNNTLossNumba Forward.

log_probs: Tensor of (batch x seqLength x labelLength x outputDim)

containing output from network

labels: 2 dimensional Tensor containing all the targets of

the batch with zero padded

act_lens: Tensor of size (batch) containing size of each output

sequence from the network

label_lens: Tensor of (batch) containing label length of each example

espnet2.asr.transducer.rnnt_multi_blank.__init__

espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants

class espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus[source]

Bases: enum.Enum

An enumeration.

RNNT_STATUS_INVALID_VALUE = 1
RNNT_STATUS_SUCCESS = 0
espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.THRESHOLD = 0.1

Getters

espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.dtype()[source]
espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.threads_per_block()[source]
espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.warp_size()[source]

espnet2.asr.transducer.rnnt_multi_blank.utils.__init__

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper

espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.add[source]
espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.compute_costs_data[source]
espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.copy_data_1d[source]
espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.div_up[source]
espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.exponential[source]
espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.flatten_tensor(x: torch.Tensor)[source]
espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.get_workspace_size(maxT: int, maxU: int, minibatch: int, gpu: bool) → Tuple[Optional[int], espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus][source]
espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.identity[source]
espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.log_plus[source]
espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.log_sum_exp[source]
espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.maximum[source]
espnet2.asr.transducer.rnnt_multi_blank.utils.rnnt_helper.negate[source]

espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.__init__

espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt

class espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.CPURNNT(minibatch: int, maxT: int, maxU: int, alphabet_size: int, workspace: torch.Tensor, blank: int, fastemit_lambda: float, clamp: float, num_threads: int, batch_first: bool)[source]

Bases: object

Helper class to compute the Transducer Loss on CPU.

Parameters:
  • minibatch – Size of the minibatch b.

  • maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.

  • maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.

  • alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).

  • workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory.

  • blank – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.

  • fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.

  • clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

  • num_threads – Number of OMP threads to launch.

  • batch_first – Bool that decides if batch dimension is first or third.

compute_alphas(log_probs: torch.Tensor, T: int, U: int, alphas: torch.Tensor)[source]

Compute the probability of the forward variable alpha.

Parameters:
  • log_probs – Flattened tensor [B, T, U, V+1]

  • T – Length of the acoustic sequence T (not padded).

  • U – Length of the target sequence U (not padded).

  • alphas – Working space memory for alpha of shape [B, T, U].

Returns:

Loglikelihood of the forward variable alpha.

compute_betas_and_grads(grad: torch.Tensor, log_probs: torch.Tensor, T: int, U: int, alphas: torch.Tensor, betas: torch.Tensor, labels: torch.Tensor, logll: torch.Tensor)[source]

Compute backward variable beta as well as gradients of the activation

matrix wrt loglikelihood of forward variable.

Parameters:
  • grad – Working space memory of flattened shape [B, T, U, V+1]

  • log_probs – Activatio tensor of flattented shape [B, T, U, V+1]

  • T – Length of the acoustic sequence T (not padded).

  • U – Length of the target sequence U (not padded).

  • alphas – Working space memory for alpha of shape [B, T, U].

  • betas – Working space memory for alpha of shape [B, T, U].

  • labels – Ground truth label of shape [B, U]

  • logll – Loglikelihood of the forward variable.

Returns:

Loglikelihood of the forward variable and inplace updates the grad tensor.

cost_and_grad(log_probs: torch.Tensor, grads: torch.Tensor, costs: torch.Tensor, flat_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor) → espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus[source]
cost_and_grad_kernel(log_probs: torch.Tensor, grad: torch.Tensor, labels: torch.Tensor, mb: int, T: int, U: int, bytes_used: int)[source]
score_forward(log_probs: torch.Tensor, costs: torch.Tensor, flat_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor)[source]
class espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.CpuRNNT_index(U: int, maxU: int, minibatch: int, alphabet_size: int, batch_first: bool)[source]

Bases: object

A placeholder Index computation class that emits the resolved index in a

flattened tensor, mimicing pointer indexing in CUDA kernels on the CPU.

Parameters:
  • U – Length of the current target sample (without padding).

  • maxU – Max Length of the padded target samples.

  • minibatch – Minibatch index

  • alphabet_size – Size of the vocabulary including RNNT blank - V+1.

  • batch_first – Bool flag determining if batch index is first or third.

class espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.CpuRNNT_metadata(T: int, U: int, workspace: torch.Tensor, bytes_used: int, blank: int, labels: torch.Tensor, log_probs: torch.Tensor, idx: espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.CpuRNNT_index)[source]

Bases: object

Metadata for CPU based RNNT loss calculation. Holds the working space memory.

Parameters:
  • T – Length of the acoustic sequence (without padding).

  • U – Length of the target sequence (without padding).

  • workspace – Working space memory for the CPU.

  • bytes_used – Number of bytes currently used for indexing the working space memory. Generally 0.

  • blank – Index of the blank token in the vocabulary.

  • labels – Ground truth padded labels matrix of shape [B, U]

  • log_probs – Log probs / activation matrix of flattented shape [B, T, U, V+1]

  • idx

setup_probs(T: int, U: int, labels: torch.Tensor, blank: int, log_probs: torch.Tensor, idx: espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.CpuRNNT_index)[source]
class espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.LogSoftmaxGradModification(*args, **kwargs)[source]

Bases: torch.autograd.function.Function

static backward(ctx, grad_output)[source]

Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs as the forward() returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, acts, clamp)[source]

This function is to be overridden by all subclasses. There are two ways to define forward:

Usage 1 (Combined forward and ctx):

@staticmethod
def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
    pass
  • It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).

  • See combining-forward-context for more details

Usage 2 (Separate forward and ctx):

@staticmethod
def forward(*args: Any, **kwargs: Any) -> Any:
    pass

@staticmethod
def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
    pass
  • The forward no longer accepts a ctx argument.

  • Instead, you must also override the torch.autograd.Function.setup_context() staticmethod to handle setting up the ctx object. output is the output of the forward, inputs are a Tuple of inputs to the forward.

  • See extending-autograd for more details

The context can be used to store arbitrary data that can be then retrieved during the backward pass. Tensors should not be stored directly on ctx (though this is not currently enforced for backward compatibility). Instead, tensors should be saved either with ctx.save_for_backward() if they are intended to be used in backward (equivalently, vjp) or ctx.save_for_forward() if they are intended to be used for in jvp.

espnet2.asr.transducer.rnnt_multi_blank.utils.cpu_utils.cpu_rnnt.log_sum_exp(a: torch.Tensor, b: torch.Tensor)[source]

Logsumexp with safety checks for infs.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt

class espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.GPURNNT(minibatch: int, maxT: int, maxU: int, alphabet_size: int, workspace, blank: int, fastemit_lambda: float, clamp: float, num_threads: int, stream)[source]

Bases: object

Helper class to launch the CUDA Kernels to compute the Transducer Loss.

Parameters:
  • minibatch – Int representing the batch size.

  • maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.

  • maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.

  • alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).

  • workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory.

  • blank – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.

  • fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.

  • clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

  • num_threads – Number of OMP threads to launch.

  • stream – Numba Cuda Stream.

compute_cost_and_score(acts: torch.Tensor, grads: Optional[torch.Tensor], costs: torch.Tensor, labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor) → espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus[source]

Compute both the loss and the gradients.

Parameters:
  • acts – A flattened tensor of shape [B, T, U, V+1] representing the activation matrix.

  • grad – A flattented zero tensor of same shape as acts.

  • costs – A zero vector of length B which will be updated inplace with the log probability costs.

  • flat_labels – A flattened matrix of labels of shape [B, U]

  • label_lengths – A vector of length B that contains the original lengths of the acoustic sequence.

  • input_lengths – A vector of length B that contains the original lengths of the target sequence.

Updates:

This will launch kernels that will update inline the following variables: - grads: Gradients of the activation matrix wrt the costs vector. - costs: Negative log likelihood of the forward variable.

Returns:

An enum that either represents a successful RNNT operation or failure.

cost_and_grad(acts: torch.Tensor, grads: torch.Tensor, costs: torch.Tensor, pad_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor)[source]
log_softmax(acts: torch.Tensor, denom: torch.Tensor)[source]

Computes the log softmax denominator of the input activation tensor

and stores the result in denom.

Parameters:
  • acts – Activation tensor of shape [B, T, U, V+1]. The input must be represented as a flat tensor of shape [B * T * U * (V+1)] to allow pointer indexing.

  • denom – A zero tensor of same shape as acts.

Updates:

This kernel inplace updates the denom tensor

score_forward(acts: torch.Tensor, costs: torch.Tensor, pad_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor)[source]
class espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.MultiblankGPURNNT(sigma: float, num_big_blanks: int, minibatch: int, maxT: int, maxU: int, alphabet_size: int, workspace, big_blank_workspace, blank: int, fastemit_lambda: float, clamp: float, num_threads: int, stream)[source]

Bases: espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.GPURNNT

Helper class to launch the CUDA Kernels to compute Multi-blank

Transducer Loss(https://arxiv.org/pdf/2211.03541).

Parameters:
  • sigma – Hyper-parameter related to the logit-normalization method in training multi-blank transducers.

  • num_big_blanks – Number of big blank symbols the model has. This should not include the standard blank symbol.

  • minibatch – Int representing the batch size.

  • maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.

  • maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.

  • alphabet_size – The vocabulary dimension V + 1 + num-big-blanks

  • workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory.

  • big_blank_workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory specifically for the multi-blank related computations.

  • blank – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.

  • fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.

  • clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

  • num_threads – Number of OMP threads to launch.

  • stream – Numba Cuda Stream.

compute_cost_and_score(acts: torch.Tensor, grads: Optional[torch.Tensor], costs: torch.Tensor, labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor) → espnet2.asr.transducer.rnnt_multi_blank.utils.global_constants.RNNTStatus[source]

Compute both the loss and the gradients.

Parameters:
  • acts – A flattened tensor of shape [B, T, U, V+1] representing the activation matrix.

  • grad – A flattented zero tensor of same shape as acts.

  • costs – A zero vector of length B which will be updated inplace with the log probability costs.

  • flat_labels – A flattened matrix of labels of shape [B, U]

  • label_lengths – A vector of length B that contains the original lengths of the acoustic sequence.

  • input_lengths – A vector of length B that contains the original lengths of the target sequence.

Updates:

This will launch kernels that will update inline the following variables: - grads: Gradients of the activation matrix wrt the costs vector. - costs: Negative log likelihood of the forward variable.

Returns:

An enum that either represents a successful RNNT operation or failure.

cost_and_grad(acts: torch.Tensor, grads: torch.Tensor, costs: torch.Tensor, pad_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor)[source]
score_forward(acts: torch.Tensor, costs: torch.Tensor, pad_labels: torch.Tensor, label_lengths: torch.Tensor, input_lengths: torch.Tensor)[source]

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.__init__

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.CTAReduce[source]

CUDA Warp reduction kernel.

It is a device kernel to be called by other kernels.

The data will be read from the right segement recursively, and reduced (ROP) onto the left half. Operation continues while warp size is larger than a given offset. Beyond this offset, warp reduction is performed via shfl_down_sync, which halves the reduction space and sums the two halves at each call.

Note

Efficient warp occurs at input shapes of 2 ^ K.

References

Parameters:
  • tid – CUDA thread index

  • x – activation. Single float.

  • storage – shared memory of size CTA_REDUCE_SIZE used for reduction in parallel threads.

  • count – equivalent to num_rows, which is equivalent to alphabet_size (V+1)

  • R_opid – Operator ID for reduction. See R_Op for more information.

class espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.I_Op[source]

Bases: enum.Enum

Represents an operation that is performed on the input tensor

EXPONENTIAL = 0
IDENTITY = 1
class espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.R_Op[source]

Bases: enum.Enum

Represents a reduction operation performed on the input tensor

ADD = 0
MAXIMUM = 1
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.ReduceHelper(I_opid: int, R_opid: int, acts: torch.Tensor, output: torch.Tensor, num_rows: int, num_cols: int, minus: bool, stream)[source]

CUDA Warp reduction kernel helper which reduces via the R_Op.Add and writes

the result to output according to I_op id.

The result is stored in the blockIdx.

Note

Efficient warp occurs at input shapes of 2 ^ K.

References

Parameters:
  • I_opid – Operator ID for input. See I_Op for more information.

  • R_opid – Operator ID for reduction. See R_Op for more information.

  • acts – Flatened activation matrix of shape [B * T * U * (V+1)].

  • output – Flatened output matrix of shape [B * T * U * (V+1)]. Data will be overwritten.

  • num_rows – Vocabulary size (including blank token) - V+1. Represents the number of threads per block.

  • num_cols – Flattened shape of activation matrix, without vocabulary dimension (B * T * U). Represents number of blocks per grid.

  • minus – Bool flag whether to add or subtract as reduction. If minus is set; calls _reduce_minus, else calls _reduce_rows kernel.

  • stream – CUDA Stream.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.reduce_exp(acts: torch.Tensor, denom, rows: int, cols: int, minus: bool, stream)[source]

Helper method to call the Warp Reduction Kernel to perform exp reduction.

Note

Efficient warp occurs at input shapes of 2 ^ K.

References

Parameters:
  • acts – Flatened activation matrix of shape [B * T * U * (V+1)].

  • output – Flatened output matrix of shape [B * T * U * (V+1)]. Data will be overwritten.

  • rows – Vocabulary size (including blank token) - V+1. Represents the number of threads per block.

  • cols – Flattened shape of activation matrix, without vocabulary dimension (B * T * U). Represents number of blocks per grid.

  • minus – Bool flag whether to add or subtract as reduction. If minus is set; calls _reduce_minus, else calls _reduce_rows kernel.

  • stream – CUDA Stream.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.reduce_max(acts: torch.Tensor, denom, rows: int, cols: int, minus: bool, stream)[source]

Helper method to call the Warp Reduction Kernel to perform max reduction.

Note

Efficient warp occurs at input shapes of 2 ^ K.

References

Parameters:
  • acts – Flatened activation matrix of shape [B * T * U * (V+1)].

  • output – Flatened output matrix of shape [B * T * U * (V+1)]. Data will be overwritten.

  • rows – Vocabulary size (including blank token) - V+1. Represents the number of threads per block.

  • cols – Flattened shape of activation matrix, without vocabulary dimension (B * T * U). Represents number of blocks per grid.

  • minus – Bool flag whether to add or subtract as reduction. If minus is set; calls _reduce_minus, else calls _reduce_rows kernel.

  • stream – CUDA Stream.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.compute_alphas_kernel[source]

Compute alpha (forward variable) probabilities over the transduction step.

Parameters:
  • acts – Tensor of shape [B, T, U, V+1] flattened. Represents the logprobs activation tensor.

  • denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.

  • alphas – Zero tensor of shape [B, T, U]. Will be updated inside the kernel with the forward variable probabilities.

  • llForward – Zero tensor of shape [B]. Represents the log-likelihood of the forward pass. Returned as the forward pass loss that is reduced by the optimizer.

  • xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.

  • ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.

  • mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.

  • minibatch – Int representing the batch size.

  • maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.

  • maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.

  • alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).

  • blank_ – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.

Updates:

Kernel inplace updates the following inputs: - alphas: forward variable scores. - llForward: log-likelihood of forward variable.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.compute_betas_kernel[source]

Compute beta (backward variable) probabilities over the transduction step.

Parameters:
  • acts – Tensor of shape [B, T, U, V+1] flattened. Represents the logprobs activation tensor.

  • denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.

  • betas – Zero tensor of shape [B, T, U]. Will be updated inside the kernel with the backward variable probabilities.

  • llBackward – Zero tensor of shape [B]. Represents the log-likelihood of the backward pass. Returned as the backward pass loss that is reduced by the optimizer.

  • xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.

  • ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.

  • mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.

  • minibatch – Int representing the batch size.

  • maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.

  • maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.

  • alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).

  • blank_ – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.

Updates:

Kernel inplace updates the following inputs: - betas: backward variable scores. - llBackward: log-likelihood of backward variable.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.compute_grad_kernel[source]

Compute gradients over the transduction step.

Parameters:
  • grads – Zero Tensor of shape [B, T, U, V+1]. Is updated by this kernel to contain the gradients of this batch of samples.

  • acts – Tensor of shape [B, T, U, V+1] flattened. Represents the logprobs activation tensor.

  • denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.

  • alphas – Alpha variable, contains forward probabilities. A tensor of shape [B, T, U].

  • betas – Beta varoable, contains backward probabilities. A tensor of shape [B, T, U].

  • logll – Log-likelihood of the forward variable, represented as a vector of shape [B]. Represents the log-likelihood of the forward pass.

  • xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.

  • ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.

  • mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.

  • minibatch – Int representing the batch size.

  • maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.

  • maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.

  • alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).

  • blank_ – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.

  • fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.

  • clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

Updates:

Kernel inplace updates the following inputs: - grads: Gradients with respect to the log likelihood (logll).

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.compute_multiblank_alphas_kernel[source]

Compute alpha (forward variable) probabilities for multi-blank transducuer loss

(https://arxiv.org/pdf/2211.03541).

Parameters:
  • acts – Tensor of shape [B, T, U, V + 1 + num_big_blanks] flattened. Represents the logprobs activation tensor.

  • denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.

  • sigma – Hyper-parameter for logit-undernormalization technique for training multi-blank transducers.

  • alphas – Zero tensor of shape [B, T, U]. Will be updated inside the kernel with the forward variable probabilities.

  • llForward – Zero tensor of shape [B]. Represents the log-likelihood of the forward pass. Returned as the forward pass loss that is reduced by the optimizer.

  • xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.

  • ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.

  • mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.

  • minibatch – Int representing the batch size.

  • maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.

  • maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.

  • alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).

  • blank_ – Index of the RNNT standard blank token in the vocabulary.

  • big_blank_durations – Vector of supported big blank durations of the model.

  • num_big_blanks – Number of big blanks of the model.

Updates:

Kernel inplace updates the following inputs: - alphas: forward variable scores. - llForward: log-likelihood of forward variable.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.compute_multiblank_betas_kernel[source]

Compute beta (backward variable) probabilities for multi-blank transducer loss

(https://arxiv.org/pdf/2211.03541).

Parameters:
  • acts – Tensor of shape [B, T, U, V + 1 + num-big-blanks] flattened. Represents the logprobs activation tensor.

  • denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.

  • sigma – Hyper-parameter for logit-undernormalization technique for training multi-blank transducers.

  • betas – Zero tensor of shape [B, T, U]. Will be updated inside the kernel with the backward variable probabilities.

  • llBackward – Zero tensor of shape [B]. Represents the log-likelihood of the backward pass. Returned as the backward pass loss that is reduced by the optimizer.

  • xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.

  • ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.

  • mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.

  • minibatch – Int representing the batch size.

  • maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.

  • maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.

  • alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).

  • blank_ – Index of the RNNT standard blank token in the vocabulary.

  • big_blank_durations – Vector of supported big blank durations of the model.

  • num_big_blanks – Number of big blanks of the model.

Updates:

Kernel inplace updates the following inputs: - betas: backward variable scores. - llBackward: log-likelihood of backward variable.

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.compute_multiblank_grad_kernel[source]

Compute gradients for multi-blank transducer loss

(https://arxiv.org/pdf/2211.03541).

Parameters:
  • grads – Zero Tensor of shape [B, T, U, V + 1 + num_big_blanks]. Is updated by this kernel to contain the gradients of this batch of samples.

  • acts – Tensor of shape [B, T, U, V + 1 + num_big_blanks] flattened. Represents the logprobs activation tensor.

  • denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.

  • sigma – Hyper-parameter for logit-undernormalization technique for training multi-blank transducers.

  • alphas – Alpha variable, contains forward probabilities. A tensor of shape [B, T, U].

  • betas – Beta varoable, contains backward probabilities. A tensor of shape [B, T, U].

  • logll – Log-likelihood of the forward variable, represented as a vector of shape [B]. Represents the log-likelihood of the forward pass.

  • xlen – Vector of length B which contains the actual acoustic sequence lengths in the padded activation tensor.

  • ylen – Vector of length B which contains the actual target sequence lengths in the padded activation tensor.

  • mlabels – Matrix of shape [B, U+1] (+1 here is due to <SOS> token - usually the RNNT blank). The matrix contains the padded target transcription that must be predicted.

  • minibatch – Int representing the batch size.

  • maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.

  • maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.

  • alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).

  • blank_ – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.

  • fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.

  • clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].

  • big_blank_durations – Vector of supported big blank durations of the model.

  • num_big_blanks – Number of big blanks of the model.

Updates:

Kernel inplace updates the following inputs: - grads: Gradients with respect to the log likelihood (logll).

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt_kernel.logp[source]

Compute the sum of log probability from the activation tensor

and its denominator.

Parameters:
  • denom – Tensor of shape [B, T, U] flattened. Represents the denominator of the logprobs activation tensor across entire vocabulary.

  • acts – Tensor of shape [B, T, U, V+1] flattened. Represents the logprobs activation tensor.

  • maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.

  • maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.

  • alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).

  • mb – Batch indexer.

  • t – Acoustic sequence timestep indexer.

  • u – Target sequence timestep indexer.

  • v – Vocabulary token indexer.

Returns:

The sum of logprobs[mb, t, u, v] + denom[mb, t, u]

espnet2.asr.layers.fastformer

Fastformer attention definition.

Reference:

Wu et al., “Fastformer: Additive Attention Can Be All You Need” https://arxiv.org/abs/2108.09084 https://github.com/wuch15/Fastformer

class espnet2.asr.layers.fastformer.FastSelfAttention(size, attention_heads, dropout_rate)[source]

Bases: torch.nn.modules.module.Module

Fast self-attention used in Fastformer.

espnet_initialization_fn()[source]
forward(xs_pad, mask)[source]

Forward method.

Parameters:
  • xs_pad – (batch, time, size = n_heads * attn_dim)

  • mask – (batch, 1, time), nonpadding is 1, padding is 0

Returns:

(batch, time, size)

Return type:

torch.Tensor

init_weights(module)[source]
transpose_for_scores(x)[source]

Reshape and transpose to compute scores.

Parameters:

x – (batch, time, size = n_heads * attn_dim)

Returns:

(batch, n_heads, time, attn_dim)

espnet2.asr.layers.__init__

espnet2.asr.layers.cgmlp

MLP with convolutional gating (cgMLP) definition.

References

https://openreview.net/forum?id=RA-zVvZLYIy https://arxiv.org/abs/2105.08050

class espnet2.asr.layers.cgmlp.ConvolutionalGatingMLP(size: int, linear_units: int, kernel_size: int, dropout_rate: float, use_linear_after_conv: bool, gate_activation: str)[source]

Bases: torch.nn.modules.module.Module

Convolutional Gating MLP (cgMLP).

forward(x, mask)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.layers.cgmlp.ConvolutionalSpatialGatingUnit(size: int, kernel_size: int, dropout_rate: float, use_linear_after_conv: bool, gate_activation: str)[source]

Bases: torch.nn.modules.module.Module

Convolutional Spatial Gating Unit (CSGU).

espnet_initialization_fn()[source]
forward(x, gate_add=None)[source]

Forward method

Parameters:
  • x (torch.Tensor) – (N, T, D)

  • gate_add (torch.Tensor) – (N, T, D/2)

Returns:

(N, T, D/2)

Return type:

out (torch.Tensor)

espnet2.asr.encoder.avhubert_encoder

Encoder definition.

class espnet2.asr.encoder.avhubert_encoder.AVHubertConfig(sample_rate: int = 16000, label_rate: int = -1, encoder_layers: int = 12, encoder_embed_dim: int = 768, encoder_ffn_embed_dim: int = 3072, encoder_attention_heads: int = 12, activation_fn: str = 'gelu', dropout: float = 0.1, attention_dropout: float = 0.1, activation_dropout: float = 0.0, encoder_layerdrop: float = 0.0, dropout_input: float = 0.0, dropout_features: float = 0.0, final_dim: int = 0, untie_final_proj: bool = False, layer_norm_first: bool = False, conv_feature_layers: str = '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512, 2, 2)] * 2', conv_bias: bool = False, logit_temp: float = 0.1, target_glu: bool = False, feature_grad_mult: float = 1.0, mask_length_audio: int = 10, mask_prob_audio: float = 0.65, mask_length_image: int = 10, mask_prob_image: float = 0.65, mask_selection: str = 'static', mask_other: float = 0, no_mask_overlap: bool = False, mask_min_space: int = 1, mask_channel_length: int = 10, mask_channel_prob: float = 0.0, mask_channel_selection: str = 'static', mask_channel_other: float = 0, no_mask_channel_overlap: bool = False, mask_channel_min_space: int = 1, conv_pos: int = 128, conv_pos_groups: int = 16, latent_temp: Tuple[float, float, float] = (2, 0.5, 0.999995), skip_masked: bool = False, skip_nomask: bool = False, resnet_relu_type: str = 'prelu', resnet_weights: Optional[str] = None, sim_type: str = 'cosine', sub_encoder_layers: int = 0, audio_feat_dim: int = -1, modality_dropout: float = 0, audio_dropout: float = 0, modality_fuse: str = 'concat', selection_type: str = 'same_other_seq', masking_type: str = 'input', decoder_embed_dim: int = 768, decoder_ffn_embed_dim: int = 3072, decoder_layers: int = 6, decoder_layerdrop: float = 0.0, decoder_attention_heads: int = 4, decoder_learned_pos: bool = False, decoder_normalize_before: bool = False, no_token_positional_embeddings: bool = False, decoder_dropout: float = 0.1, decoder_attention_dropout: float = 0.1, decoder_activation_dropout: float = 0.0, max_target_positions: int = 2048, share_decoder_input_output_embed: bool = False, audio_only: bool = False, no_scale_embedding: bool = True)[source]

Bases: object

Configuration from original AVHubert Github

activation_dropout = 0.0
activation_fn = 'gelu'
attention_dropout = 0.1
audio_dropout = 0
audio_feat_dim = -1
audio_only = False
conv_bias = False
conv_feature_layers = '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2'
conv_pos = 128
conv_pos_groups = 16
decoder_activation_dropout = 0.0
decoder_attention_dropout = 0.1
decoder_attention_heads = 4
decoder_dropout = 0.1
decoder_embed_dim = 768
decoder_ffn_embed_dim = 3072
decoder_layerdrop = 0.0
decoder_layers = 6
decoder_learned_pos = False
decoder_normalize_before = False
dropout = 0.1
dropout_features = 0.0
dropout_input = 0.0
encoder_attention_heads = 12
encoder_embed_dim = 768
encoder_ffn_embed_dim = 3072
encoder_layerdrop = 0.0
encoder_layers = 12
feature_grad_mult = 1.0
final_dim = 0
label_rate = -1
latent_temp = (2, 0.5, 0.999995)
layer_norm_first = False
logit_temp = 0.1
mask_channel_length = 10
mask_channel_min_space = 1
mask_channel_other = 0
mask_channel_prob = 0.0
mask_channel_selection = 'static'
mask_length_audio = 10
mask_length_image = 10
mask_min_space = 1
mask_other = 0
mask_prob_audio = 0.65
mask_prob_image = 0.65
mask_selection = 'static'
masking_type = 'input'
max_target_positions = 2048
modality_dropout = 0
modality_fuse = 'concat'
no_mask_channel_overlap = False
no_mask_overlap = False
no_scale_embedding = True
no_token_positional_embeddings = False
resnet_relu_type = 'prelu'
resnet_weights = None
sample_rate = 16000
selection_type = 'same_other_seq'
share_decoder_input_output_embed = False
sim_type = 'cosine'
skip_masked = False
skip_nomask = False
sub_encoder_layers = 0
target_glu = False
untie_final_proj = False
class espnet2.asr.encoder.avhubert_encoder.AVHubertModel(cfg: espnet2.asr.encoder.avhubert_encoder.AVHubertConfig, **kwargs)[source]

Bases: torch.nn.modules.module.Module

classmethod build_model(cfg: espnet2.asr.encoder.avhubert_encoder.AVHubertConfig)[source]

Build a new model instance.

extract_finetune(source, padding_mask=None, mask=False, ret_conv=False, output_layer=None)[source]

Forward AVHubert Pretrain Encoder.

Parameters:
  • source['video'] – input tensor (B, 1, L, H, W)

  • source['audio'] – input tensor (B, F, L)

  • padding_mask – input tensor (B, L)

Returns:

encoded tensor and mask

forward_audio(source_audio)[source]
forward_features(source: torch.Tensor, modality: str) → torch.Tensor[source]
forward_padding_mask(features: torch.Tensor, padding_mask: torch.Tensor) → torch.Tensor[source]
forward_transformer(source, padding_mask=None, output_layer=None)[source]

Forward AVHubert Pretrain Encoder (without frontend).

Assume the source is already fused feature. :param source: input tensor (B, L, D*2) :param padding_mask: input tensor (B, L)

Returns:

encoded tensor and mask

forward_video(source_video)[source]
modality_fusion(features_audio, features_video)[source]
class espnet2.asr.encoder.avhubert_encoder.BasicBlock(inplanes, planes, stride=1, downsample=None, relu_type='relu')[source]

Bases: torch.nn.modules.module.Module

expansion = 1
forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.encoder.avhubert_encoder.FairseqAVHubertEncoder(input_size: int = 1, avhubert_url: str = './', avhubert_dir_path: str = './', freeze_finetune_updates: int = 0, encoder_embed_dim: int = 1024, encoder_layerdrop: float = 0.05, dropout_input: float = 0.1, dropout_features: float = 0.1, dropout: float = 0.1, attention_dropout: float = 0.1, feature_grad_mult: float = 0.1, activation_dropout: float = 0.0, wav_input: bool = False, layer_norm_first: bool = True, audio_feat_dim: int = 104, encoder_layers: int = 24, encoder_ffn_embed_dim: int = 4096, encoder_attention_heads: int = 16, extracted: bool = False, pretrain: bool = True, modality_dropout: float = 0.0, audio_dropout: float = 0.0, noise_augmentation: bool = False, noise_path: str = './data/babble_noise.pt', max_noise_weight: float = 0.5, audio_only: bool = False)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq AVHubert pretrained encoder module

Parameters:
  • input_size – input dim

  • avhubert_url – download link for pre-trained avhubert model

  • avhubert_dir_path – dir_path for downloading pre-trained avhubert model

forward(xs_pad: Dict[str, torch.Tensor], ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Forward AVHubert Encoder.

Parameters:
  • xs_pad[video] – input tensor (B, 1, L, H, W)

  • xs_pad[audio] – input tensor (B, D, L)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns:

position embedded tensor and mask

forward_fusion(xs_pad: Dict[str, torch.Tensor]) → torch.Tensor[source]
output_size() → int[source]
reload_pretrained_parameters()[source]
class espnet2.asr.encoder.avhubert_encoder.GradMultiply(*args, **kwargs)[source]

Bases: torch.autograd.function.Function

static backward(ctx, grad)[source]

Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs as the forward() returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, x, scale)[source]

This function is to be overridden by all subclasses. There are two ways to define forward:

Usage 1 (Combined forward and ctx):

@staticmethod
def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
    pass
  • It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).

  • See combining-forward-context for more details

Usage 2 (Separate forward and ctx):

@staticmethod
def forward(*args: Any, **kwargs: Any) -> Any:
    pass

@staticmethod
def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
    pass
  • The forward no longer accepts a ctx argument.

  • Instead, you must also override the torch.autograd.Function.setup_context() staticmethod to handle setting up the ctx object. output is the output of the forward, inputs are a Tuple of inputs to the forward.

  • See extending-autograd for more details

The context can be used to store arbitrary data that can be then retrieved during the backward pass. Tensors should not be stored directly on ctx (though this is not currently enforced for backward compatibility). Instead, tensors should be saved either with ctx.save_for_backward() if they are intended to be used in backward (equivalently, vjp) or ctx.save_for_forward() if they are intended to be used for in jvp.

class espnet2.asr.encoder.avhubert_encoder.ResEncoder(relu_type, weights)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

threeD_to_2D_tensor(x)[source]
class espnet2.asr.encoder.avhubert_encoder.ResNet(block, layers, num_classes=1000, relu_type='relu', gamma_zero=False, avg_pool_downsample=False)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.encoder.avhubert_encoder.SamePad(kernel_size, causal=False)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.encoder.avhubert_encoder.SubModel(resnet=None, input_dim=None, cfg=None)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.encoder.avhubert_encoder.TransformerEncoder(args)[source]

Bases: torch.nn.modules.module.Module

From AVHubert github

extract_features(x, padding_mask=None, tgt_layer=None)[source]
forward(x, padding_mask=None, layer=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

max_positions()[source]

Maximum output length supported by the encoder.

upgrade_state_dict_named(state_dict, name)[source]

Upgrade a (possibly old) state dict for new versions of fairseq.

espnet2.asr.encoder.avhubert_encoder.conv3x3(in_planes, out_planes, stride=1)[source]
espnet2.asr.encoder.avhubert_encoder.download_avhubert(model_url, dir_path)[source]
espnet2.asr.encoder.avhubert_encoder.downsample_basic_block(inplanes, outplanes, stride)[source]
espnet2.asr.encoder.avhubert_encoder.downsample_basic_block_v2(inplanes, outplanes, stride)[source]
espnet2.asr.encoder.avhubert_encoder.index_put(tensor, indices, value)[source]
espnet2.asr.encoder.avhubert_encoder.is_xla_tensor(tensor)[source]
espnet2.asr.encoder.avhubert_encoder.time_masking(xs_pad, min_T=5, max_T=20)[source]

Masking Contiguous Frames with random length of [min_T, max_T]

espnet2.asr.encoder.e_branchformer_encoder

E-Branchformer encoder definition. Reference:

Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe, “E-Branchformer: Branchformer with Enhanced merging for speech recognition,” in SLT 2022.

class espnet2.asr.encoder.e_branchformer_encoder.EBranchformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, attention_layer_type: str = 'rel_selfattn', pos_enc_layer_type: str = 'rel_pos', rel_pos_type: str = 'latest', cgmlp_linear_units: int = 2048, cgmlp_conv_kernel: int = 31, use_linear_after_conv: bool = False, gate_activation: str = 'identity', num_blocks: int = 12, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', zero_triu: bool = False, padding_idx: int = -1, layer_drop_rate: float = 0.0, max_pos_emb_len: int = 5000, use_ffn: bool = False, macaron_ffn: bool = False, ffn_activation_type: str = 'swish', linear_units: int = 2048, positionwise_layer_type: str = 'linear', merge_conv_kernel: int = 3, interctc_layer_idx=None, interctc_use_conditioning: bool = False)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

E-Branchformer encoder module.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None, max_layer: int = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Calculate forward propagation.

Parameters:
  • xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).

  • ilens (torch.Tensor) – Input length (#batch).

  • prev_states (torch.Tensor) – Not to be used now.

  • ctc (CTC) – Intermediate CTC module.

  • max_layer (int) – Layer depth below which InterCTC is applied.

Returns:

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type:

torch.Tensor

output_size() → int[source]
class espnet2.asr.encoder.e_branchformer_encoder.EBranchformerEncoderLayer(size: int, attn: torch.nn.modules.module.Module, cgmlp: torch.nn.modules.module.Module, feed_forward: Optional[torch.nn.modules.module.Module], feed_forward_macaron: Optional[torch.nn.modules.module.Module], dropout_rate: float, merge_conv_kernel: int = 3)[source]

Bases: torch.nn.modules.module.Module

E-Branchformer encoder layer module.

Parameters:
  • size (int) – model dimension

  • attn – standard self-attention or efficient attention

  • cgmlp – ConvolutionalGatingMLP

  • feed_forward – feed-forward module, optional

  • feed_forward – macaron-style feed-forward module, optional

  • dropout_rate (float) – dropout probability

  • merge_conv_kernel (int) – kernel size of the depth-wise conv in merge module

forward(x_input, mask, cache=None)[source]

Compute encoded features.

Parameters:
  • x_input (Union[Tuple, torch.Tensor]) – Input tensor w/ or w/o pos emb. - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)]. - w/o pos emb: Tensor (#batch, time, size).

  • mask (torch.Tensor) – Mask tensor for the input (#batch, 1, time).

  • cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).

Returns:

Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, time).

Return type:

torch.Tensor

espnet2.asr.encoder.longformer_encoder

Conformer encoder definition.

class espnet2.asr.encoder.longformer_encoder.LongformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'abs_pos', selfattention_layer_type: str = 'lf_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, attention_windows: list = [100, 100, 100, 100, 100, 100], attention_dilation: list = [1, 1, 1, 1, 1, 1], attention_mode: str = 'sliding_chunks')[source]

Bases: espnet2.asr.encoder.conformer_encoder.ConformerEncoder

Longformer SA Conformer encoder module.

Parameters:
  • input_size (int) – Input dimension.

  • output_size (int) – Dimension of attention.

  • attention_heads (int) – The number of heads of multi head attention.

  • linear_units (int) – The number of units of position-wise feed forward.

  • num_blocks (int) – The number of decoder blocks.

  • dropout_rate (float) – Dropout rate.

  • attention_dropout_rate (float) – Dropout rate in attention.

  • positional_dropout_rate (float) – Dropout rate after adding positional encoding.

  • input_layer (Union[str, torch.nn.Module]) – Input layer type.

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.

  • positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.

  • rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.

  • encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.

  • encoder_attn_layer_type (str) – Encoder attention layer type.

  • activation_type (str) – Encoder activation function type.

  • macaron_style (bool) – Whether to use macaron style for positionwise layer.

  • use_cnn_module (bool) – Whether to use convolution module.

  • zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.

  • cnn_module_kernel (int) – Kernerl size of convolution module.

  • padding_idx (int) – Padding idx for input_layer=embed.

  • attention_windows (list) – Layer-wise attention window sizes for longformer self-attn

  • attention_dilation (list) – Layer-wise attention dilation sizes for longformer self-attn

  • attention_mode (str) – Implementation for longformer self-attn. Default=”sliding_chunks” Choose ‘n2’, ‘tvm’ or ‘sliding_chunks’. More details in https://github.com/allenai/longformer

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None, return_all_hs: bool = False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Calculate forward propagation.

Parameters:
  • xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).

  • ilens (torch.Tensor) – Input length (#batch).

  • prev_states (torch.Tensor) – Not to be used now.

  • ctc (CTC) – ctc module for intermediate CTC loss

  • return_all_hs (bool) – whether to return all hidden states

Returns:

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type:

torch.Tensor

output_size() → int[source]

espnet2.asr.encoder.contextual_block_transformer_encoder

Encoder definition.

class espnet2.asr.encoder.contextual_block_transformer_encoder.ContextualBlockTransformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.StreamPositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, block_size: int = 40, hop_size: int = 16, look_ahead: int = 16, init_average: bool = True, ctx_pos_enc: bool = True)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Contextual Block Transformer encoder module.

Details in Tsunoo et al. “Transformer ASR with contextual block processing” (https://arxiv.org/abs/1910.07204)

Parameters:
  • input_size – input dim

  • output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of encoder blocks

  • dropout_rate – dropout rate

  • attention_dropout_rate – dropout rate in attention

  • positional_dropout_rate – dropout rate after adding positional encoding

  • input_layer – input layer type

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type – linear of conv1d

  • positionwise_conv_kernel_size – kernel size of positionwise conv1d layer

  • padding_idx – padding_idx for input_layer=embed

  • block_size – block size for contextual block processing

  • hop_Size – hop size for block processing

  • look_ahead – look-ahead size for block_processing

  • init_average – whether to use average as initial context (otherwise max values)

  • ctx_pos_enc – whether to use positional encoding to the context vectors

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final=True, infer_mode=False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

  • infer_mode – whether to be used for inference. This is used to distinguish between forward_train (train and validate) and forward_infer (decode).

Returns:

position embedded tensor and mask

forward_infer(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final: bool = True) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns:

position embedded tensor and mask

forward_train(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]

espnet2.asr.encoder.branchformer_encoder

Branchformer encoder definition.

Reference:

Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Watanabe, “Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding,” in Proceedings of ICML, 2022.

class espnet2.asr.encoder.branchformer_encoder.BranchformerEncoder(input_size: int, output_size: int = 256, use_attn: bool = True, attention_heads: int = 4, attention_layer_type: str = 'rel_selfattn', pos_enc_layer_type: str = 'rel_pos', rel_pos_type: str = 'latest', use_cgmlp: bool = True, cgmlp_linear_units: int = 2048, cgmlp_conv_kernel: int = 31, use_linear_after_conv: bool = False, gate_activation: str = 'identity', merge_method: str = 'concat', cgmlp_weight: Union[float, List[float]] = 0.5, attn_branch_drop_rate: Union[float, List[float]] = 0.0, num_blocks: int = 12, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', zero_triu: bool = False, padding_idx: int = -1, stochastic_depth_rate: Union[float, List[float]] = 0.0)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Branchformer encoder module.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Calculate forward propagation.

Parameters:
  • xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).

  • ilens (torch.Tensor) – Input length (#batch).

  • prev_states (torch.Tensor) – Not to be used now.

Returns:

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type:

torch.Tensor

output_size() → int[source]
class espnet2.asr.encoder.branchformer_encoder.BranchformerEncoderLayer(size: int, attn: Optional[torch.nn.modules.module.Module], cgmlp: Optional[torch.nn.modules.module.Module], dropout_rate: float, merge_method: str, cgmlp_weight: float = 0.5, attn_branch_drop_rate: float = 0.0, stochastic_depth_rate: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

Branchformer encoder layer module.

Parameters:
  • size (int) – model dimension

  • attn – standard self-attention or efficient attention, optional

  • cgmlp – ConvolutionalGatingMLP, optional

  • dropout_rate (float) – dropout probability

  • merge_method (str) – concat, learned_ave, fixed_ave

  • cgmlp_weight (float) – weight of the cgmlp branch, between 0 and 1, used if merge_method is fixed_ave

  • attn_branch_drop_rate (float) – probability of dropping the attn branch, used if merge_method is learned_ave

  • stochastic_depth_rate (float) – stochastic depth probability

forward(x_input, mask, cache=None)[source]

Compute encoded features.

Parameters:
  • x_input (Union[Tuple, torch.Tensor]) – Input tensor w/ or w/o pos emb. - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)]. - w/o pos emb: Tensor (#batch, time, size).

  • mask (torch.Tensor) – Mask tensor for the input (#batch, 1, time).

  • cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).

Returns:

Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, time).

Return type:

torch.Tensor

espnet2.asr.encoder.transformer_encoder

Transformer encoder definition.

class espnet2.asr.encoder.transformer_encoder.TransformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, layer_drop_rate: float = 0.0)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Transformer encoder module.

Parameters:
  • input_size – input dim

  • output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of decoder blocks

  • dropout_rate – dropout rate

  • attention_dropout_rate – dropout rate in attention

  • positional_dropout_rate – dropout rate after adding positional encoding

  • input_layer – input layer type

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type – linear of conv1d

  • positionwise_conv_kernel_size – kernel size of positionwise conv1d layer

  • padding_idx – padding_idx for input_layer=embed

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None, return_all_hs: bool = False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

  • ctc (CTC) – ctc module for intermediate CTC loss

  • return_all_hs (bool) – whether to return all hidden states

Returns:

position embedded tensor and mask

output_size() → int[source]

espnet2.asr.encoder.whisper_encoder

class espnet2.asr.encoder.whisper_encoder.OpenAIWhisperEncoder(input_size: int = 1, dropout_rate: float = 0.0, whisper_model: str = 'small', download_dir: Optional[str] = None, use_specaug: bool = False, specaug_conf: Optional[dict] = None, do_pad_trim: bool = False)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Transformer-based Speech Encoder from OpenAI’s Whisper Model:

URL: https://github.com/openai/whisper

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

log_mel_spectrogram(audio: torch.Tensor, ilens: torch.Tensor = None) → torch.Tensor[source]

Use log-mel spectrogram computation native to Whisper training

output_size() → int[source]
pad_or_trim(array: torch.Tensor, length: int, axis: int = -1) → torch.Tensor[source]

Pad or trim the audio array to N_SAMPLES.

Used in zero-shot inference cases.

whisper_encode(input: torch.Tensor, ilens: torch.Tensor = None) → torch.Tensor[source]

espnet2.asr.encoder.rnn_encoder

class espnet2.asr.encoder.rnn_encoder.RNNEncoder(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, subsample: Optional[Sequence[int]] = (2, 2, 1, 1))[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

RNNEncoder class.

Parameters:
  • input_size – The number of expected features in the input

  • output_size – The number of output features

  • hidden_size – The number of hidden features

  • bidirectional – If True becomes a bidirectional LSTM

  • use_projection – Use projection layer or not

  • num_layers – Number of recurrent layers

  • dropout – dropout probability

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.asr.encoder.contextual_block_conformer_encoder

Created on Sat Aug 21 17:27:16 2021.

@author: Keqi Deng (UCAS)

class espnet2.asr.encoder.contextual_block_conformer_encoder.ContextualBlockConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.StreamPositionalEncoding'>, selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, cnn_module_kernel: int = 31, padding_idx: int = -1, block_size: int = 40, hop_size: int = 16, look_ahead: int = 16, init_average: bool = True, ctx_pos_enc: bool = True)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Contextual Block Conformer encoder module.

Parameters:
  • input_size – input dim

  • output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of decoder blocks

  • dropout_rate – dropout rate

  • attention_dropout_rate – dropout rate in attention

  • positional_dropout_rate – dropout rate after adding positional encoding

  • input_layer – input layer type

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type – linear of conv1d

  • positionwise_conv_kernel_size – kernel size of positionwise conv1d layer

  • padding_idx – padding_idx for input_layer=embed

  • block_size – block size for contextual block processing

  • hop_Size – hop size for block processing

  • look_ahead – look-ahead size for block_processing

  • init_average – whether to use average as initial context (otherwise max values)

  • ctx_pos_enc – whether to use positional encoding to the context vectors

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final=True, infer_mode=False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

  • infer_mode – whether to be used for inference. This is used to distinguish between forward_train (train and validate) and forward_infer (decode).

Returns:

position embedded tensor and mask

forward_infer(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final: bool = True) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns:

position embedded tensor and mask

forward_train(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]

espnet2.asr.encoder.transformer_encoder_multispkr

Encoder definition.

class espnet2.asr.encoder.transformer_encoder_multispkr.TransformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, num_blocks_sd: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, num_inf: int = 1)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Transformer encoder module.

Parameters:
  • input_size – input dim

  • output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of recognition encoder blocks

  • num_blocks_sd – the number of speaker dependent encoder blocks

  • dropout_rate – dropout rate

  • attention_dropout_rate – dropout rate in attention

  • positional_dropout_rate – dropout rate after adding positional encoding

  • input_layer – input layer type

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type – linear of conv1d

  • positionwise_conv_kernel_size – kernel size of positionwise conv1d layer

  • padding_idx – padding_idx for input_layer=embed

  • num_inf – number of inference output

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]

espnet2.asr.encoder.vgg_rnn_encoder

class espnet2.asr.encoder.vgg_rnn_encoder.VGGRNNEncoder(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, in_channel: int = 1)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

VGGRNNEncoder class.

Parameters:
  • input_size – The number of expected features in the input

  • bidirectional – If True becomes a bidirectional LSTM

  • use_projection – Use projection layer or not

  • num_layers – Number of recurrent layers

  • hidden_size – The number of hidden features

  • output_size – The number of output features

  • dropout – dropout probability

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.asr.encoder.__init__

espnet2.asr.encoder.hugging_face_transformers_encoder

Hugging Face Transformers PostEncoder.

class espnet2.asr.encoder.hugging_face_transformers_encoder.HuggingFaceTransformersEncoder(input_size: int, model_name_or_path: str, lang_token_id: int = -1)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Hugging Face Transformers PostEncoder.

Initialize the module.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward.

output_size() → int[source]

Get the output size.

reload_pretrained_parameters()[source]

espnet2.asr.encoder.hubert_encoder

Encoder definition.

class espnet2.asr.encoder.hubert_encoder.FairseqHubertEncoder(input_size: int, hubert_url: str = './', hubert_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0, dropout_rate: float = 0.0, activation_dropout: float = 0.1, attention_dropout: float = 0.0, mask_length: int = 10, mask_prob: float = 0.75, mask_selection: str = 'static', mask_other: int = 0, apply_mask: bool = True, mask_channel_length: int = 64, mask_channel_prob: float = 0.5, mask_channel_other: int = 0, mask_channel_selection: str = 'static', layerdrop: float = 0.1, feature_grad_mult: float = 0.0)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq Hubert encoder module, used for loading pretrained weight and finetuning

Parameters:
  • input_size – input dim

  • hubert_url – url to Hubert pretrained model

  • hubert_dir_path – directory to download the Wav2Vec2.0 pretrained model.

  • output_size – dimension of attention

  • normalize_before – whether to use layer_norm before the first block

  • freeze_finetune_updates – steps that freeze all layers except output layer before tuning the whole model (nessasary to prevent overfit).

  • dropout_rate – dropout rate

  • activation_dropout – dropout rate in activation function

  • attention_dropout – dropout rate in attention

Hubert specific Args:

Please refer to: https://github.com/pytorch/fairseq/blob/master/fairseq/models/hubert/hubert.py

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Forward Hubert ASR Encoder.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]
reload_pretrained_parameters()[source]
class espnet2.asr.encoder.hubert_encoder.FairseqHubertPretrainEncoder(input_size: int = 1, output_size: int = 1024, linear_units: int = 1024, attention_heads: int = 12, num_blocks: int = 12, dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, activation_dropout_rate: float = 0.0, hubert_dict: str = './dict.txt', label_rate: int = 100, checkpoint_activations: bool = False, sample_rate: int = 16000, use_amp: bool = False, **kwargs)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq Hubert pretrain encoder module, only used for pretraining stage

Parameters:
  • input_size – input dim

  • output_size – dimension of attention

  • linear_units – dimension of feedforward layers

  • attention_heads – the number of heads of multi head attention

  • num_blocks – the number of encoder blocks

  • dropout_rate – dropout rate

  • attention_dropout_rate – dropout rate in attention

  • hubert_dict – target dictionary for Hubert pretraining

  • label_rate – label frame rate. -1 for sequence label

  • sample_rate – target sample rate.

  • use_amp – whether to use automatic mixed precision

  • normalize_before – whether to use layer_norm before the first block

cast_mask_emb()[source]
forward(xs_pad: torch.Tensor, ilens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_length: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Forward Hubert Pretrain Encoder.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]
reload_pretrained_parameters()[source]
class espnet2.asr.encoder.hubert_encoder.TorchAudioHuBERTPretrainEncoder(input_size: int = None, extractor_mode: str = 'group_norm', extractor_conv_layer_config: Optional[List[List[int]]] = [[512, 10, 5], [512, 3, 2], [512, 3, 2], [512, 3, 2], [512, 3, 2], [512, 2, 2], [512, 2, 2]], extractor_conv_bias: bool = False, encoder_embed_dim: int = 768, encoder_projection_dropout: float = 0.1, encoder_pos_conv_kernel: int = 128, encoder_pos_conv_groups: int = 16, encoder_num_layers: int = 12, encoder_num_heads: int = 12, encoder_attention_dropout: float = 0.1, encoder_ff_interm_features: int = 3072, encoder_ff_interm_dropout: float = 0.0, encoder_dropout: float = 0.1, encoder_layer_norm_first: bool = False, encoder_layer_drop: float = 0.05, mask_prob: float = 0.8, mask_selection: str = 'static', mask_other: float = 0.0, mask_length: int = 10, no_mask_overlap: bool = False, mask_min_space: int = 1, mask_channel_prob: float = 0.0, mask_channel_selection: str = 'static', mask_channel_other: float = 0.0, mask_channel_length: int = 10, no_mask_channel_overlap: bool = False, mask_channel_min_space: int = 1, skip_masked: bool = False, skip_nomask: bool = False, num_classes: int = 100, final_dim: int = 256, feature_grad_mult: Optional[float] = 0.1, finetuning: bool = False, freeze_encoder_updates: int = 0)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Torch Audio Hubert encoder module.

Parameters:
  • extractor_mode – Operation mode of feature extractor. Valid values are “group_norm” or “layer_norm”.

  • extractor_conv_layer_config – Configuration of convolution layers in feature extractor. List of convolution configuration, i.e. [[output_channel, kernel_size, stride], …]

  • extractor_conv_bias – Whether to include bias term to each convolution operation.

  • encoder_embed_dim – The dimension of embedding in encoder.

  • encoder_projection_dropout – The dropout probability applied after the input feature is projected to “encoder_embed_dim”.

  • encoder_pos_conv_kernel – Kernel size of convolutional positional embeddings.

  • encoder_pos_conv_groups – Number of groups of convolutional positional embeddings.

  • encoder_num_layers – Number of self attention layers in transformer block.

  • encoder_num_heads – Number of heads in self attention layers.

  • encoder_attention_dropout – Dropout probability applied after softmax in self-attention layer.

  • encoder_ff_interm_features – Dimension of hidden features in feed forward layer.

  • encoder_ff_interm_dropout – Dropout probability applied in feedforward layer.

  • encoder_dropout – Dropout probability applied at the end of feed forward layer.

  • encoder_layer_norm_first – Control the order of layer norm in transformer layer and each encoder layer. If True, in transformer layer, layer norm is applied before features are fed to encoder layers.

  • encoder_layer_drop – Probability to drop each encoder layer during training.

  • mask_prob – Probability for each token to be chosen as start of the span to be masked.

  • mask_selection – How to choose the mask length. Options: [static, uniform, normal, poisson].

  • mask_other – Secondary mask argument (used for more complex distributions).

  • mask_length – The lengths of the mask.

  • no_mask_overlap – Whether to allow masks to overlap.

  • mask_min_space – Minimum space between spans (if no overlap is enabled).

  • mask_channel_prob – (float): The probability of replacing a feature with 0.

  • mask_channel_selection – How to choose the mask length for channel masking. Options: [static, uniform, normal, poisson].

  • mask_channel_other – Secondary mask argument for channel masking(used for more complex distributions).

  • mask_channel_length – Minimum space between spans (if no overlap is enabled) for channel masking.

  • no_mask_channel_overlap – Whether to allow channel masks to overlap.

  • mask_channel_min_space – Minimum space between spans for channel masking(if no overlap is enabled).

  • skip_masked – If True, skip computing losses over masked frames.

  • skip_nomask – If True, skip computing losses over unmasked frames.

  • num_classes – The number of classes in the labels.

  • final_dim – Project final representations and targets to final_dim.

  • feature_grad_mult – The factor to scale the convolutional feature extraction layer gradients by. The scale factor will not affect the forward pass.

  • finetuning – Whether to finetuning the model with ASR or other tasks.

  • freeze_encoder_updates – The number of steps to freeze the encoder parameters in ASR finetuning.

Hubert specific Args:

Please refer to: https://pytorch.org/audio/stable/generated/torchaudio.models.hubert_pretrain_model.html#torchaudio.models.hubert_pretrain_model

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, ys_pad: torch.Tensor = None, ys_pad_length: torch.Tensor = None, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Forward Hubert Pretrain Encoder.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]
reload_pretrained_parameters()[source]
espnet2.asr.encoder.hubert_encoder.download_hubert(model_url, dir_path)[source]

espnet2.asr.encoder.conformer_encoder

Conformer encoder definition.

class espnet2.asr.encoder.conformer_encoder.ConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, ctc_trim: bool = False, stochastic_depth_rate: Union[float, List[float]] = 0.0, layer_drop_rate: float = 0.0, max_pos_emb_len: int = 5000)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Conformer encoder module.

Parameters:
  • input_size (int) – Input dimension.

  • output_size (int) – Dimension of attention.

  • attention_heads (int) – The number of heads of multi head attention.

  • linear_units (int) – The number of units of position-wise feed forward.

  • num_blocks (int) – The number of decoder blocks.

  • dropout_rate (float) – Dropout rate.

  • attention_dropout_rate (float) – Dropout rate in attention.

  • positional_dropout_rate (float) – Dropout rate after adding positional encoding.

  • input_layer (Union[str, torch.nn.Module]) – Input layer type.

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.

  • positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.

  • rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.

  • encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.

  • encoder_attn_layer_type (str) – Encoder attention layer type.

  • activation_type (str) – Encoder activation function type.

  • macaron_style (bool) – Whether to use macaron style for positionwise layer.

  • use_cnn_module (bool) – Whether to use convolution module.

  • zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.

  • cnn_module_kernel (int) – Kernerl size of convolution module.

  • padding_idx (int) – Padding idx for input_layer=embed.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None, return_all_hs: bool = False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Calculate forward propagation.

Parameters:
  • xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).

  • ilens (torch.Tensor) – Input length (#batch).

  • prev_states (torch.Tensor) – Not to be used now.

  • ctc (CTC) – ctc module for intermediate CTC loss

  • return_all_hs (bool) – whether to return all hidden states

Returns:

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type:

torch.Tensor

output_size() → int[source]

espnet2.asr.encoder.wav2vec2_encoder

Encoder definition.

class espnet2.asr.encoder.wav2vec2_encoder.FairSeqWav2Vec2Encoder(input_size: int, w2v_url: str, w2v_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq Wav2Vec2 encoder module.

Parameters:
  • input_size – input dim

  • output_size – dimension of attention

  • w2v_url – url to Wav2Vec2.0 pretrained model

  • w2v_dir_path – directory to download the Wav2Vec2.0 pretrained model.

  • normalize_before – whether to use layer_norm before the first block

  • finetune_last_n_layers – last n layers to be finetuned in Wav2Vec2.0 0 means to finetune every layer if freeze_w2v=False.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Forward FairSeqWav2Vec2 Encoder.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]
reload_pretrained_parameters()[source]
espnet2.asr.encoder.wav2vec2_encoder.download_w2v(model_url, dir_path)[source]

espnet2.asr.encoder.linear_encoder

Linear encoder definition.

class espnet2.asr.encoder.linear_encoder.LinearEncoder(input_size: int, output_size: int = 256, dropout_rate: float = 0.1, input_layer: Optional[str] = 'conv2d', normalize_before: bool = True, padding_idx: int = -1)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Linear encoder module.

Parameters:
  • input_size – input dim

  • output_size – dimension of attention

  • linear_units – the number of units of position-wise feed forward

  • dropout_rate – dropout rate

  • input_layer – input layer type

  • normalize_before – whether to use layer_norm before the first block

  • padding_idx – padding_idx for input_layer=embed

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]

espnet2.asr.encoder.abs_encoder

class espnet2.asr.encoder.abs_encoder.AbsEncoder(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.asr.specaug.specaug

SpecAugment module.

class espnet2.asr.specaug.specaug.SpecAug(apply_time_warp: bool = True, time_warp_window: int = 5, time_warp_mode: str = 'bicubic', apply_freq_mask: bool = True, freq_mask_width_range: Union[int, Sequence[int]] = (0, 20), num_freq_mask: int = 2, apply_time_mask: bool = True, time_mask_width_range: Union[int, Sequence[int], None] = None, time_mask_width_ratio_range: Union[float, Sequence[float], None] = None, num_time_mask: int = 2)[source]

Bases: espnet2.asr.specaug.abs_specaug.AbsSpecAug

Implementation of SpecAug.

Reference:

Daniel S. Park et al. “SpecAugment: A Simple Data

Augmentation Method for Automatic Speech Recognition”

Warning

When using cuda mode, time_warp doesn’t have reproducibility due to torch.nn.functional.interpolate.

forward(x, x_lengths=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.specaug.__init__

espnet2.asr.specaug.abs_specaug

class espnet2.asr.specaug.abs_specaug.AbsSpecAug(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module

Abstract class for the augmentation of spectrogram

The process-flow:

Frontend -> SpecAug -> Normalization -> Encoder -> Decoder

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: torch.Tensor, x_lengths: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.postencoder.length_adaptor_postencoder

Length adaptor PostEncoder.

class espnet2.asr.postencoder.length_adaptor_postencoder.LengthAdaptorPostEncoder(input_size: int, length_adaptor_n_layers: int = 0, input_layer: Optional[str] = None, output_size: Optional[int] = None, dropout_rate: float = 0.1, return_int_enc: bool = False)[source]

Bases: espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder

Length Adaptor PostEncoder.

Initialize the module.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward.

output_size() → int[source]

Get the output size.

espnet2.asr.postencoder.abs_postencoder

class espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.asr.postencoder.__init__

espnet2.asr.postencoder.hugging_face_transformers_postencoder

Hugging Face Transformers PostEncoder.

class espnet2.asr.postencoder.hugging_face_transformers_postencoder.HuggingFaceTransformersPostEncoder(input_size: int, model_name_or_path: str, length_adaptor_n_layers: int = 0, lang_token_id: int = -1)[source]

Bases: espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder

Hugging Face Transformers PostEncoder.

Initialize the module.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward.

output_size() → int[source]

Get the output size.

reload_pretrained_parameters()[source]