espnet2.sds.vad.webrtc_vad.WebrtcVADModel

Less than 1 minute

espnet2.sds.vad.webrtc_vad.WebrtcVADModel

class espnet2.sds.vad.webrtc_vad.WebrtcVADModel(speakup_threshold: int = 12, continue_threshold: int = 10, min_speech_ms: int = 500, max_speech_ms: float = inf, target_sr: int = 16000)

Bases: AbsVAD

Webrtc VAD Model

This class uses WebRTC VAD to detect speech in an audio stream.

Parameters:
- speakup_threshold (int , optional) – The threshold for detecting the start of speech.
- continue_threshold (int , optional) – The threshold for continuing speech detection.
- min_speech_ms (int , optional) – The minimum duration (in milliseconds) for a valid speech segment. Defaults to 500 ms.
- max_speech_ms (float , optional) – The maximum duration (in milliseconds) for a valid speech segment. Defaults to infinity.
- target_sr (int , optional) – The target sampling rate for resampling the input audio. Defaults to 16000 Hz.

vad_output

Stores the speech segments detected as floating-point tensors.

Type: Optional[list]

vad_bin_output

Stores the speech segments detected as binary audio.

Type: Optional[list]
Raises:ImportError – If the required webrtcvad library is not installed.

forward(speech: ndarray, sample_rate: int, binary: bool = False) → ndarray | None

Process an audio stream and detect speech using WebRTC VAD.

Parameters:
- speech – The raw audio stream in 16-bit PCM format.
- sample_rate (int) – The sampling rate of the input audio.
- binary (bool , optional) – If True, returns the binary audio output instead of the resampled float array. Defaults to False.
Returns: The detected speech segment as a NumPy array (float or binary audio), or None if no valid segment is found.
Return type: Optional[np.ndarray]

warmup()