espnet2.slu package

espnet2.slu.espnet_model

class espnet2.slu.espnet_model.ESPnetSLUModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module], postdecoder: Optional[espnet2.slu.postdecoder.abs_postdecoder.AbsPostDecoder] = None, deliberationencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder] = None, transcript_token_list: Union[Tuple[str, ...], List[str], None] = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', extract_feats_in_collect_stats: bool = True, two_pass: bool = False, pre_postencoder_norm: bool = False)[source]

Bases: espnet2.asr.espnet_model.ESPnetASRModel

CTC-attention hybrid Encoder-Decoder model

collect_feats(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, transcript: torch.Tensor = None, transcript_lengths: torch.Tensor = None, **kwargs) → Dict[str, torch.Tensor][source]
encode(speech: torch.Tensor, speech_lengths: torch.Tensor, transcript_pad: torch.Tensor = None, transcript_pad_lens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Frontend + Encoder. Note that this method is used by asr_inference.py

Parameters:
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, transcript: torch.Tensor = None, transcript_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters:
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

  • text – (Batch, Length)

  • text_lengths – (Batch,)

  • kwargs – “utt_id” is among the input.

espnet2.slu.__init__

espnet2.slu.postdecoder.abs_postdecoder

class espnet2.slu.postdecoder.abs_postdecoder.AbsPostDecoder(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract convert_examples_to_features(data: list, max_seq_length: int, output_size: int)[source]
abstract forward(transcript_input_ids: torch.LongTensor, transcript_attention_mask: torch.LongTensor, transcript_token_type_ids: torch.LongTensor, transcript_position_ids: torch.LongTensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.slu.postdecoder.__init__

espnet2.slu.postdecoder.hugging_face_transformers_postdecoder

Hugging Face Transformers PostDecoder.

class espnet2.slu.postdecoder.hugging_face_transformers_postdecoder.HuggingFaceTransformersPostDecoder(model_name_or_path: str, output_size=256)[source]

Bases: espnet2.slu.postdecoder.abs_postdecoder.AbsPostDecoder

Hugging Face Transformers PostEncoder.

Initialize the module.

convert_examples_to_features(data, max_seq_length)[source]
forward(transcript_input_ids: torch.LongTensor, transcript_attention_mask: torch.LongTensor, transcript_token_type_ids: torch.LongTensor, transcript_position_ids: torch.LongTensor) → torch.Tensor[source]

Forward.

output_size() → int[source]

Get the output size.

espnet2.slu.postencoder.conformer_postencoder

Conformers PostEncoder.

class espnet2.slu.postencoder.conformer_postencoder.ConformerPostEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'linear', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1)[source]

Bases: espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder

Hugging Face Transformers PostEncoder.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward.

output_size() → int[source]

Get the output size.

espnet2.slu.postencoder.transformer_postencoder

Encoder definition.

class espnet2.slu.postencoder.transformer_postencoder.TransformerPostEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'linear', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1)[source]

Bases: espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder

Transformer encoder module.

Parameters:
  • input_size – input dim

  • output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of decoder blocks

  • dropout_rate – dropout rate

  • attention_dropout_rate – dropout rate in attention

  • positional_dropout_rate – dropout rate after adding positional encoding

  • input_layer – input layer type

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type – linear of conv1d

  • positionwise_conv_kernel_size – kernel size of positionwise conv1d layer

  • padding_idx – padding_idx for input_layer=embed

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters:
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]

espnet2.slu.postencoder.__init__