espnet2.sds.espnet_model.ESPnetSDSModelInterface

About 2 min

espnet2.sds.espnet_model.ESPnetSDSModelInterface

class espnet2.sds.espnet_model.ESPnetSDSModelInterface(ASR_option: str, LLM_option: str, TTS_option: str, type_option: str, access_token: str)

Bases: AbsESPnetModel

Web Interface for Spoken Dialog System models

This class provides a unified interface to integrate ASR, TTS, and LLM modules for cascaded spoken dialog systems as well as also supports E2E spoken dialog systems. It supports real-time interactions, including VAD (Voice Activity Detection) based conversation management.

Initializer method.

Parameters:
- ASR_option (str) – The selected ASR model option to use for speech-to-text processing.
- LLM_option (str) – The selected LLM model option for generating text responses.
- TTS_option (str) – The selected TTS model option for text-to-speech synthesis.
- type_option (str) – The type of SDS interaction to perform (e.g., cascaded or E2E).
- access_token (str) – The access token for accessing models hosted on Hugging Face.

collect_feats()

forward(y: ndarray, sr: int, stream: ndarray, asr_output_str: str | None, text_str: str | None, audio_output: Tuple[int, ndarray] | None, audio_output1: Tuple[int, ndarray] | None, latency_ASR: float, latency_LM: float, latency_TTS: float)

Processes audio input to generate ASR, LLM, and TTS outputs

while calculating latencies.

This method handles both Cascaded and End-to-End setups.

Parameters:
- y – Input audio array.
- sr – Sampling rate of the input audio.
- stream – The current audio stream buffer.
- asr_output_str – Previously generated ASR output string.
- text_str – Previously generated LLM text response.
- audio_output – Previously generated TTS audio output.
- (****) (audio_output1) – Placeholder for audio stream.
- latency_ASR (float) – Latency for ASR processing.
- latency_LM (float) – Latency for LLM processing.
- latency_TTS (float) – Latency for TTS processing.
Returns: Tuple[str, str, Optional[Tuple[int, np.ndarray]], Optional[Tuple[int, np.ndarray]], float, float, float, bool]:
- Updated ASR output string.
- Updated LLM-generated text.
- Updated TTS audio output.
- Updated user audio stream output.
- ASR latency.
- LLM latency.
- TTS latency.
- Update audio stream
- Change flag indicating if output was updated.

handle_ASR_selection(option: str)

Handles the selection and initialization of ASR model.

This method dynamically loads the selected ASR based on the provided option. If the selected model is already active, it avoids reloading to save resources. The method temporarily removes the visibility of Gradio outputs during the initialization process to indicate progress.

Parameters:option (str) – The name of the ASR to load.

handle_E2E_selection()

Handles the selection and initialization of E2E model Mini-Omni.

This method dynamically loads the E2E spoken dialog model. If the model is already active, it avoids reloading to save resources.

handle_LLM_selection(option: str)

Handles the selection and initialization of a LLM.

This method dynamically loads the selected LLM based on the provided option. If the selected model is already active, it avoids reloading to save resources. The method temporarily removes the visibility of Gradio outputs during the initialization process to indicate progress.

Parameters:option (str) – The name of the LLM to load.

handle_TTS_selection(option: str)

Handles the selection and initialization of a Text-to-Speech (TTS) model.

This method dynamically loads the selected TTS model based on the provided option. If the selected model is already active, it avoids reloading to save resources. The method temporarily removes the visibility of Gradio outputs during the initialization process to indicate progress.

Parameters:option (str) – The name of the TTS model to load.

handle_type_selection(option: str, TTS_radio: str, ASR_radio: str, LLM_radio: str)

Handles the selection of the spoken dialogue model type (Cascaded or E2E)

and dynamically updates the interface based on the selected option.

This method manages the initialization of ASR, TTS, and LLM models for Cascaded systems or switches to an End-to-End system. The Gradio interface components are updated accordingly.

Parameters:
- option (str) – The selected spoken dialogue system.
- TTS_radio (str) – The selected TTS model for the Cascaded system.
- ASR_radio (str) – The selected ASR model for the Cascaded system.
- LLM_radio (str) – The selected LLM model for the Cascaded system.