espnet2.slu package¶

espnet2.slu.init¶

espnet2.slu.espnet_model¶

class espnet2.slu.espnet_model.ESPnetSLUModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module], postdecoder: Optional[espnet2.slu.postdecoder.abs_postdecoder.AbsPostDecoder] = None, deliberationencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder] = None, transcript_token_list: Union[Tuple[str, ...], List[str], None] = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', extract_feats_in_collect_stats: bool = True, two_pass: bool = False, pre_postencoder_norm: bool = False)[source]¶

Bases: espnet2.asr.espnet_model.ESPnetASRModel

CTC-attention hybrid Encoder-Decoder model

collect_feats(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, transcript: torch.Tensor = None, transcript_lengths: torch.Tensor = None, **kwargs) → Dict[str, torch.Tensor][source]¶

encode(speech: torch.Tensor, speech_lengths: torch.Tensor, transcript_pad: torch.Tensor = None, transcript_pad_lens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶

Frontend + Encoder. Note that this method is used by asr_inference.py

Parameters:

speech – (Batch, Length, …)
speech_lengths – (Batch, )

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, transcript: torch.Tensor = None, transcript_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Frontend + Encoder + Decoder + Calc loss

Parameters:

speech – (Batch, Length, …)
speech_lengths – (Batch, )
text – (Batch, Length)
text_lengths – (Batch,)
kwargs – “utt_id” is among the input.

espnet2.slu.postdecoder.hugging_face_transformers_postdecoder¶

Hugging Face Transformers PostDecoder.

class espnet2.slu.postdecoder.hugging_face_transformers_postdecoder.HuggingFaceTransformersPostDecoder(model_name_or_path: str, output_size=256)[source]¶

Bases: espnet2.slu.postdecoder.abs_postdecoder.AbsPostDecoder

Hugging Face Transformers PostEncoder.

Initialize the module.

convert_examples_to_features(data, max_seq_length)[source]¶

forward(transcript_input_ids: torch.LongTensor, transcript_attention_mask: torch.LongTensor, transcript_token_type_ids: torch.LongTensor, transcript_position_ids: torch.LongTensor) → torch.Tensor[source]¶: Forward.

output_size() → int[source]¶: Get the output size.

espnet2.slu.postdecoder.init¶

espnet2.slu.postdecoder.abs_postdecoder¶

class espnet2.slu.postdecoder.abs_postdecoder.AbsPostDecoder(*args, **kwargs)[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract convert_examples_to_features(data: list, max_seq_length: int, output_size: int)[source]¶

abstract forward(transcript_input_ids: torch.LongTensor, transcript_attention_mask: torch.LongTensor, transcript_token_type_ids: torch.LongTensor, transcript_position_ids: torch.LongTensor) → torch.Tensor[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]¶

espnet2.slu.postencoder.init¶

espnet2.slu.postencoder.conformer_postencoder¶

Conformers PostEncoder.

class espnet2.slu.postencoder.conformer_postencoder.ConformerPostEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'linear', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1)[source]¶

Bases: espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder

Hugging Face Transformers PostEncoder.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶: Forward.

output_size() → int[source]¶: Get the output size.

espnet2.slu.postencoder.transformer_postencoder¶

Encoder definition.

class espnet2.slu.postencoder.transformer_postencoder.TransformerPostEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'linear', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1)[source]¶

Bases: espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder

Transformer encoder module.

Parameters:

input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters:

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns:

position embedded tensor and mask

output_size() → int[source]¶

espnet2.slu package¶

espnet2.slu.__init__¶

espnet2.slu.espnet_model¶

espnet2.slu.postdecoder.hugging_face_transformers_postdecoder¶

espnet2.slu.postdecoder.__init__¶

espnet2.slu.postdecoder.abs_postdecoder¶

espnet2.slu.postencoder.__init__¶

espnet2.slu.postencoder.conformer_postencoder¶

espnet2.slu.postencoder.transformer_postencoder¶

espnet2.slu.init¶

espnet2.slu.postdecoder.init¶

espnet2.slu.postencoder.init¶