espnet.nets.batch_beam_search_online.BatchBeamSearchOnline
espnet.nets.batch_beam_search_online.BatchBeamSearchOnline
class espnet.nets.batch_beam_search_online.BatchBeamSearchOnline(*args, block_size=40, hop_size=16, look_ahead=16, disable_repetition_detection=False, encoded_feat_length_limit=0, decoder_text_length_limit=0, incremental_decode=False, time_sync=False, ctc=None, hold_n=0, transducer_conf=None, joint_network=None, **kwargs)
Bases: BatchBeamSearch
Online beam search implementation.
This simulates streaming decoding. It requires encoded features of entire utterance and extracts block by block from it as it shoud be done in streaming processing. This is based on Tsunoo et al, “STREAMING TRANSFORMER ASR WITH BLOCKWISE SYNCHRONOUS BEAM SEARCH” (https://arxiv.org/abs/2006.14941).
Initialize beam search.
assemble_hyps(ended_hyps)
Assemble the hypotheses.
extend(x: Tensor, hyps: Hypothesis) → List[Hypothesis]
Extend probabilities and states with more encoded chunks.
- Parameters:
- x (torch.Tensor) – The extended encoder output feature
- hyps (Hypothesis) – Current list of hypothesis
- Returns: The extended hypothesis
- Return type:Hypothesis
forward(x: Tensor, maxlenratio: float = 0.0, minlenratio: float = 0.0, is_final: bool = True) → List[Hypothesis]
Perform beam search.
- Parameters:
- x (torch.Tensor) – Encoded speech feature (T, D)
- maxlenratio (float) – Input length ratio to obtain max output length. If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
- minlenratio (float) – Input length ratio to obtain min output length.
- Returns: N-best decoding results
- Return type: list[Hypothesis]
process_one_block(h, is_final, maxlen, minlen, maxlenratio)
Recognize one block.
process_one_block_time_sync(h, is_final, maxlen, maxlenratio)
Recognize one block w/ time sync.
reset()
Reset parameters.
score_full(hyp: BatchHypothesis, x: Tensor, pre_x: Tensor | None = None) → Tuple[Dict[str, Tensor], Dict[str, Any]]
Score new hypothesis by self.full_scorers.
- Parameters:
- hyp (Hypothesis) – Hypothesis with prefix tokens to score
- x (torch.Tensor) – Corresponding input feature
- pre_x (torch.Tensor) – Encoded speech feature for sequential attention (T, D)
- Returns: Tuple of : score dict of hyp that has string keys of self.full_scorers and tensor score values of shape: (self.n_vocab,), and state dict that has string keys and state values of self.full_scorers
- Return type: Tuple[Dict[str, torch.Tensor], Dict[str, Any]]