espnet2.asr package

espnet2.asr.espnet_joint_model

class espnet2.asr.espnet_joint_model.ESPnetEnhASRModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], enh: Optional[espnet2.enh.abs_enh.AbsEnhancement], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, ctc: espnet2.asr.ctc.CTC, rnnt_decoder: None, ctc_weight: float = 0.5, ignore_id: int = -1, lsm_weight: float = 0.0, enh_weight: float = 0.5, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>')[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

CTC-attention hybrid Encoder-Decoder model

collect_feats(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]
encode(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Frontend + Encoder. Note that this method is used by asr_inference.py

Parameters
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

forward(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor, speech_ref1: torch.Tensor, speech_ref2: torch.Tensor, text_ref1: torch.Tensor, text_ref2: torch.Tensor, text_ref1_lengths: torch.Tensor, text_ref2_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Enhancement + Frontend + Encoder + Decoder + Calc loss

Parameters
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

  • text – (Batch, Length)

  • text_lengths – (Batch,)

forward_enh(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor = None, resort_pre: bool = True, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters
  • speech_mix – (Batch, samples) or (Batch, samples, channels)

  • speech_ref – (Batch, num_speaker, samples) or (Batch, num_speaker, samples, channels)

  • speech_mix_lengths – (Batch,), default None for chunk interator, because the chunk-iterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py

static si_snr_loss(ref, inf)[source]

si-snr loss

Parameters
  • ref – (Batch, samples)

  • inf – (Batch, samples)

Returns

(Batch)

static si_snr_loss_zeromean(ref, inf)[source]

si_snr loss with zero-mean in pre-processing.

Parameters
  • ref – (Batch, samples)

  • inf – (Batch, samples)

Returns

(Batch)

static tf_l1_loss(ref, inf)[source]

time-frequency L1 loss.

Parameters
  • ref – (Batch, T, F) or (Batch, T, C, F)

  • inf – (Batch, T, F) or (Batch, T, C, F)

Returns

(Batch)

static tf_mse_loss(ref, inf)[source]

time-frequency MSE loss.

Parameters
  • ref – (Batch, T, F)

  • inf – (Batch, T, F)

Returns

(Batch)

espnet2.asr.espnet_model

class espnet2.asr.espnet_model.ESPnetASRModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, ctc: espnet2.asr.ctc.CTC, rnnt_decoder: None, ctc_weight: float = 0.5, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>')[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

CTC-attention hybrid Encoder-Decoder model

collect_feats(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor) → Dict[str, torch.Tensor][source]
encode(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Frontend + Encoder. Note that this method is used by asr_inference.py

Parameters
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

  • text – (Batch, Length)

  • text_lengths – (Batch,)

espnet2.asr.ctc

class espnet2.asr.ctc.CTC(odim: int, encoder_output_sizse: int, dropout_rate: float = 0.0, ctc_type: str = 'builtin', reduce: bool = True, ignore_nan_grad: bool = False)[source]

Bases: torch.nn.modules.module.Module

CTC module.

Parameters
  • odim – dimension of outputs

  • encoder_output_sizse – number of encoder projection units

  • dropout_rate – dropout rate (0.0 ~ 1.0)

  • ctc_type – builtin or warpctc

  • reduce – reduce the CTC loss into a scalar

argmax(hs_pad)[source]

argmax of frame activations

Parameters

hs_pad (torch.Tensor) – 3d tensor (B, Tmax, eprojs)

Returns

argmax applied 2d tensor (B, Tmax)

Return type

torch.Tensor

forward(hs_pad, hlens, ys_pad, ys_lens)[source]

Calculate CTC loss.

Parameters
  • hs_pad – batch of padded hidden state sequences (B, Tmax, D)

  • hlens – batch of lengths of hidden state sequences (B)

  • ys_pad – batch of padded character id sequence tensor (B, Lmax)

  • ys_lens – batch of lengths of character sequence (B)

log_softmax(hs_pad)[source]

log_softmax of frame activations

Parameters

hs_pad (Tensor) – 3d tensor (B, Tmax, eprojs)

Returns

log softmax applied 3d tensor (B, Tmax, odim)

Return type

torch.Tensor

loss_fn(th_pred, th_target, th_ilen, th_olen) → torch.Tensor[source]

espnet2.asr.__init__

espnet2.asr.encoder.conformer_encoder

Conformer encoder definition.

class espnet2.asr.encoder.conformer_encoder.ConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, cnn_module_kernel: int = 31, padding_idx: int = -1)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Conformer encoder module.

Parameters
  • input_size (int) – Input dimension.

  • output_size (int) – Dimention of attention.

  • attention_heads (int) – The number of heads of multi head attention.

  • linear_units (int) – The number of units of position-wise feed forward.

  • num_blocks (int) – The number of decoder blocks.

  • dropout_rate (float) – Dropout rate.

  • attention_dropout_rate (float) – Dropout rate in attention.

  • positional_dropout_rate (float) – Dropout rate after adding positional encoding.

  • input_layer (Union[str, torch.nn.Module]) – Input layer type.

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.

  • positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.

  • encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.

  • encoder_attn_layer_type (str) – Encoder attention layer type.

  • activation_type (str) – Encoder activation function type.

  • macaron_style (bool) – Whether to use macaron style for positionwise layer.

  • use_cnn_module (bool) – Whether to use convolution module.

  • cnn_module_kernel (int) – Kernerl size of convolution module.

  • padding_idx (int) – Padding idx for input_layer=embed.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Calculate forward propagation.

Parameters
  • xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).

  • ilens (torch.Tensor) – Input length (#batch).

  • prev_states (torch.Tensor) – Not to be used now.

Returns

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type

torch.Tensor

output_size() → int[source]

espnet2.asr.encoder.rnn_encoder

class espnet2.asr.encoder.rnn_encoder.RNNEncoder(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, subsample: Optional[Sequence[int]] = (2, 2, 1, 1))[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

RNNEncoder class.

Parameters
  • input_size – The number of expected features in the input

  • output_size – The number of output features

  • hidden_size – The number of hidden features

  • bidirectional – If True becomes a bidirectional LSTM

  • use_projection – Use projection layer or not

  • num_layers – Number of recurrent layers

  • dropout – dropout probability

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.asr.encoder.transformer_encoder

Encoder definition.

class espnet2.asr.encoder.transformer_encoder.TransformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Transformer encoder module.

Parameters
  • input_size – input dim

  • output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of decoder blocks

  • dropout_rate – dropout rate

  • attention_dropout_rate – dropout rate in attention

  • positional_dropout_rate – dropout rate after adding positional encoding

  • input_layer – input layer type

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type – linear of conv1d

  • positionwise_conv_kernel_size – kernel size of positionwise conv1d layer

  • padding_idx – padding_idx for input_layer=embed

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]

espnet2.asr.encoder.vgg_rnn_encoder

class espnet2.asr.encoder.vgg_rnn_encoder.VGGRNNEncoder(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, in_channel: int = 1)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

VGGRNNEncoder class.

Parameters
  • input_size – The number of expected features in the input

  • bidirectional – If True becomes a bidirectional LSTM

  • use_projection – Use projection layer or not

  • num_layers – Number of recurrent layers

  • hidden_size – The number of hidden features

  • output_size – The number of output features

  • dropout – dropout probability

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.asr.encoder.__init__

espnet2.asr.encoder.abs_encoder

class espnet2.asr.encoder.abs_encoder.AbsEncoder[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.asr.frontend.default

class espnet2.asr.frontend.default.DefaultFrontend(fs: Union[int, str] = 16000, n_fft: int = 512, win_length: int = None, hop_length: int = 128, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: int = None, fmax: int = None, htk: bool = False, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, apply_stft: bool = True)[source]

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Conventional frontend structure for ASR.

Stft -> WPE -> MVDR-Beamformer -> Power-spec -> Mel-Fbank -> CMVN

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.asr.frontend.__init__

espnet2.asr.frontend.abs_frontend

class espnet2.asr.frontend.abs_frontend.AbsFrontend[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.asr.decoder.rnn_decoder

class espnet2.asr.decoder.rnn_decoder.RNNDecoder(vocab_size: int, encoder_output_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, sampling_probability: float = 0.0, dropout: float = 0.0, context_residual: bool = False, replace_sos: bool = False, num_encs: int = 1, att_conf: dict = {'aconv_chans': 10, 'aconv_filts': 100, 'adim': 320, 'aheads': 4, 'atype': 'location', 'awin': 5, 'han_conv_chans': -1, 'han_conv_filts': 100, 'han_dim': 320, 'han_heads': 4, 'han_mode': False, 'han_type': None, 'han_win': 5, 'num_att': 1, 'num_encs': 1})[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

forward(hs_pad, hlens, ys_in_pad, ys_in_lens, strm_idx=0)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_state(x)[source]

Get an initial state for decoding (optional).

Parameters

x (torch.Tensor) – The encoded feature tensor

Returns: initial state

rnn_forward(ey, z_list, c_list, z_prev, c_prev)[source]
score(yseq, state, x)[source]

Score new token (required).

Parameters
  • y (torch.Tensor) – 1D torch.int64 prefix tokens.

  • state – Scorer state for prefix tokens

  • x (torch.Tensor) – The encoder feature that generates ys.

Returns

Tuple of

scores for next token that has a shape of (n_vocab) and next state for ys

Return type

tuple[torch.Tensor, Any]

zero_state(hs_pad)[source]
espnet2.asr.decoder.rnn_decoder.build_attention_list(eprojs: int, dunits: int, atype: str = 'location', num_att: int = 1, num_encs: int = 1, aheads: int = 4, adim: int = 320, awin: int = 5, aconv_chans: int = 10, aconv_filts: int = 100, han_mode: bool = False, han_type=None, han_heads: int = 4, han_dim: int = 320, han_conv_chans: int = -1, han_conv_filts: int = 100, han_win: int = 5)[source]

espnet2.asr.decoder.abs_decoder

class espnet2.asr.decoder.abs_decoder.AbsDecoder[source]

Bases: torch.nn.modules.module.Module, espnet.nets.scorer_interface.ScorerInterface, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.decoder.transformer_decoder

Decoder definition.

class espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder(vocab_size: int, encoder_output_size: int, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True)[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder, espnet.nets.scorer_interface.BatchScorerInterface

Base class of Transfomer decoder module.

Parameters
  • vocab_size – output dim

  • encoder_output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of decoder blocks

  • dropout_rate – dropout rate

  • self_attention_dropout_rate – dropout rate for attention

  • input_layer – input layer type

  • use_output_layer – whether to use output layer

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch.

Parameters
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type

tuple[torch.Tensor, List[Any]]

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward decoder.

Parameters
  • hs_pad – encoded memory, float32 (batch, maxlen_in, feat)

  • hlens – (batch)

  • ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases

  • ys_in_lens – (batch)

Returns

tuple containing:

x: decoded token score before softmax (batch, maxlen_out, token)

if use_output_layer is True,

olens: (batch, )

Return type

(tuple)

forward_one_step(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, cache: List[torch.Tensor] = None) → Tuple[torch.Tensor, List[torch.Tensor]][source]

Forward one step.

Parameters
  • tgt – input token ids, int64 (batch, maxlen_out)

  • tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)

  • memory – encoded memory, float32 (batch, maxlen_in, feat)

  • cache – cached output list of (batch, max_time_out-1, size)

Returns

NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)

Return type

y, cache

score(ys, state, x)[source]

Score.

class espnet2.asr.decoder.transformer_decoder.DynamicConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.DynamicConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.LightweightConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.LightweightConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.TransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

espnet2.asr.decoder.__init__

espnet2.asr.specaug.specaug

class espnet2.asr.specaug.specaug.SpecAug(apply_time_warp: bool = True, time_warp_window: int = 5, time_warp_mode: str = 'bicubic', apply_freq_mask: bool = True, freq_mask_width_range: Union[int, Sequence[int]] = (0, 20), num_freq_mask: int = 2, apply_time_mask: bool = True, time_mask_width_range: Union[int, Sequence[int]] = (0, 100), num_time_mask: int = 2)[source]

Bases: espnet2.asr.specaug.abs_specaug.AbsSpecAug

Implementation of SpecAug.

Reference:

Daniel S. Park et al. “SpecAugment: A Simple Data

Augmentation Method for Automatic Speech Recognition”

Warning

When using cuda mode, time_warp doesn’t have reproducibility due to torch.nn.functional.interpolate.

forward(x, x_lengths=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.specaug.abs_specaug

class espnet2.asr.specaug.abs_specaug.AbsSpecAug[source]

Bases: torch.nn.modules.module.Module

Abstract class for the augmentation of spectrogram

The process-flow:

Frontend -> SpecAug -> Normalization -> Encoder -> Decoder

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: torch.Tensor, x_lengths: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.specaug.__init__