espnet2.asr.decoder.hugging_face_transformers_decoder.HuggingFaceTransformersDecoder
espnet2.asr.decoder.hugging_face_transformers_decoder.HuggingFaceTransformersDecoder
class espnet2.asr.decoder.hugging_face_transformers_decoder.HuggingFaceTransformersDecoder(vocab_size: int, encoder_output_size: int, model_name_or_path: str, causal_lm: bool = False, prefix: str = '', postfix: str = '', overriding_architecture_config: str | dict | None = {}, load_pretrained_weights: bool = True, separate_lm_head: bool = False)
Bases: AbsDecoder
, BatchScorerInterface
Hugging Face Transformers Decoder.
- Parameters:
- encoder_output_size – dimension of encoder attention
- model_name_or_path – Hugging Face Transformers model name
Initializes the HuggingFaceTransformersDecoder.
- Parameters:
- vocab_size (int) – The size of the vocabulary.
- encoder_output_size (int) – The size of the encoder output.
- model_name_or_path (str) – The name or path of the pre-trained Transformers model.
- causal_lm (bool , optional) – Whether to use a causal language model. Defaults to False. This overrides the model_name_or_path if provided.
- prefix (str , optional) – Prefix to be added to the input tokens. Defaults to “”.
- postfix (str , optional) – Postfix to be added to the input tokens. Defaults to “”.
- overriding_architecture_config (str or dict , optional) – Path to the configuration json file or the json dictionary itself. Defaults to None. If this is set, it can be used to override the default decoder configuration.
- load_pretrained_weights (bool) – Whether to load the pre-trained weights. Defaults to True.
- separate_lm_head (bool) – True ensures that the language model head is not shared with the input token embeddings. When False, the original structure is kept, ie, if the original Transformers implementation has tying of weights, it is retained. Defaults to False.
- Raises:
- ImportError – If the transformers library is not available.
- Exception – If the word embeddings attribute cannot be found in the model.
add_prefix_postfix(enc_out, hlens, ys_in_pad, ys_in_lens)
batch_score(ys: Tensor, states: List[Any], xs: Tensor, speech: Tensor | None = None) → Tuple[Tensor, List[Any]]
Score new token batch (required).
- Parameters:
- ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
- states (List *[*Any ]) – Scorer states for prefix tokens.
- xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).
- Returns: Tuple of : batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type: tuple[torch.Tensor, List[Any]]
forward(hs_pad: Tensor, hlens: Tensor, ys_in_pad: Tensor, ys_in_lens: Tensor) → Tuple[Tensor, Tensor]
Forward decoder.
Parameters:
- hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
- hlens – (batch)
- ys_in_pad – input tensor (batch, maxlen_out, #mels)
- ys_in_lens – (batch)
Returns: tuple containing:
x: decoded token score before softmax (batch, maxlen_out, token) : if use_output_layer is True,
olens: (batch, )
Return type: (tuple)
reload_pretrained_parameters()
score(ys, state, x, speech=None)
Score new token (required).
- Parameters:
- y (torch.Tensor) – 1D torch.int64 prefix tokens.
- state – Scorer state for prefix tokens
- x (torch.Tensor) – The encoder feature that generates ys.
- Returns: Tuple of : scores for next token that has a shape of (n_vocab) and next state for ys
- Return type: tuple[torch.Tensor, Any]