espnet.nets.batch_beam_search_online.BatchBeamSearchOnline

About 1 min

espnet.nets.batch_beam_search_online.BatchBeamSearchOnline

class espnet.nets.batch_beam_search_online.BatchBeamSearchOnline(*args, block_size=40, hop_size=16, look_ahead=16, disable_repetition_detection=False, encoded_feat_length_limit=0, decoder_text_length_limit=0, incremental_decode=False, time_sync=False, ctc=None, hold_n=0, transducer_conf=None, joint_network=None, **kwargs)

Bases: BatchBeamSearch

Online beam search implementation.

This simulates streaming decoding. It requires encoded features of entire utterance and extracts block by block from it as it shoud be done in streaming processing. This is based on Tsunoo et al, “STREAMING TRANSFORMER ASR WITH BLOCKWISE SYNCHRONOUS BEAM SEARCH” (https://arxiv.org/abs/2006.14941).

Initialize beam search.

assemble_hyps(ended_hyps)

Assemble the hypotheses.

extend(x: Tensor, hyps: Hypothesis) → List[Hypothesis]

Extend probabilities and states with more encoded chunks.

Parameters:
- x (torch.Tensor) – The extended encoder output feature
- hyps (Hypothesis) – Current list of hypothesis
Returns: The extended hypothesis
Return type:Hypothesis

forward(x: Tensor, maxlenratio: float = 0.0, minlenratio: float = 0.0, is_final: bool = True) → List[Hypothesis]

Perform beam search.

Parameters:
- x (torch.Tensor) – Encoded speech feature (T, D)
- maxlenratio (float) – Input length ratio to obtain max output length. If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
- minlenratio (float) – Input length ratio to obtain min output length.
Returns: N-best decoding results
Return type: list[Hypothesis]

process_one_block(h, is_final, maxlen, minlen, maxlenratio)

Recognize one block.

process_one_block_time_sync(h, is_final, maxlen, maxlenratio)

Recognize one block w/ time sync.

reset()

Reset parameters.

score_full(hyp: BatchHypothesis, x: Tensor, pre_x: Tensor | None = None) → Tuple[Dict[str, Tensor], Dict[str, Any]]

Score new hypothesis by self.full_scorers.

Parameters:
- hyp (Hypothesis) – Hypothesis with prefix tokens to score
- x (torch.Tensor) – Corresponding input feature
- pre_x (torch.Tensor) – Encoded speech feature for sequential attention (T, D)
Returns: Tuple of : score dict of hyp that has string keys of self.full_scorers and tensor score values of shape: (self.n_vocab,), and state dict that has string keys and state values of self.full_scorers
Return type: Tuple[Dict[str, torch.Tensor], Dict[str, Any]]