espnet2.sds.vad.webrtc_vad.WebrtcVADModel
Less than 1 minute
espnet2.sds.vad.webrtc_vad.WebrtcVADModel
class espnet2.sds.vad.webrtc_vad.WebrtcVADModel(speakup_threshold: int = 12, continue_threshold: int = 10, min_speech_ms: int = 500, max_speech_ms: float = inf, target_sr: int = 16000)
Bases: AbsVAD
Webrtc VAD Model
This class uses WebRTC VAD to detect speech in an audio stream.
- Parameters:
- speakup_threshold (int , optional) – The threshold for detecting the start of speech.
- continue_threshold (int , optional) – The threshold for continuing speech detection.
- min_speech_ms (int , optional) – The minimum duration (in milliseconds) for a valid speech segment. Defaults to 500 ms.
- max_speech_ms (float , optional) – The maximum duration (in milliseconds) for a valid speech segment. Defaults to infinity.
- target_sr (int , optional) – The target sampling rate for resampling the input audio. Defaults to 16000 Hz.
vad_output
Stores the speech segments detected as floating-point tensors.
- Type: Optional[list]
vad_bin_output
Stores the speech segments detected as binary audio.
Type: Optional[list]
Raises:ImportError – If the required webrtcvad library is not installed.
forward(speech: ndarray, sample_rate: int, binary: bool = False) → ndarray | None
Process an audio stream and detect speech using WebRTC VAD.
- Parameters:
- speech – The raw audio stream in 16-bit PCM format.
- sample_rate (int) – The sampling rate of the input audio.
- binary (bool , optional) – If True, returns the binary audio output instead of the resampled float array. Defaults to False.
- Returns: The detected speech segment as a NumPy array (float or binary audio), or None if no valid segment is found.
- Return type: Optional[np.ndarray]
warmup()