espnet2.speechlm.model.speechlm.multimodal_io.audio.ContinuousAudioIO
espnet2.speechlm.model.speechlm.multimodal_io.audio.ContinuousAudioIO
class espnet2.speechlm.model.speechlm.multimodal_io.audio.ContinuousAudioIO(encoder_choice: str = 'huggingface', encoder_hf_model_tag: str = 'Qwen/Qwen2.5-Omni-7B', attn_implementation: str | None = None, dtype: str = 'bfloat16', device: str = 'cpu')
Bases: AbsIO
Continuous audio I/O for feature extraction.
This class handles continuous audio representations using neural encoders that produce dense feature vectors instead of discrete tokens.
Initialize continuous audio encoder.
- Parameters:
- encoder_choice – Type of encoder (“huggingface”)
- encoder_hf_model_tag – HuggingFace model identifier (e.g., “Qwen/Qwen2.5-Omni-7B”)
- attn_implementation – Attention implementation type
- dtype – Model dtype (“bfloat16”, “float16”, etc.)
- device – Device for model (“cpu”, “cuda”, etc.)
copy_for_worker() → ContinuousAudioIO
Create lightweight copy for multiprocessing workers.
For continuous audio, we create a new instance without the model since preprocessing doesn’t require the encoder model itself.
- Returns: Lightweight copy suitable for workers
encode_batch(batch_data: Tensor, length: Tensor) → List[Tensor]
Encode batch of audio into continuous features.
Processes audio through the encoder model to extract dense features with proper attention masking based on actual audio lengths.
- Parameters:
- batch_data – Audio tensor [batch, samples, channels]
- length – Frame lengths for each sample [batch]
- Returns: List of audio feature tensors, one per sample in batch
feature_dim() → int
Get feature dimension for continuous representation.
- Returns: Feature dimension of encoder output
find_length(data: Tuple[ndarray, int]) → int
Calculate frame length after encoding.
We don’t call self.processor as it’s very slow to find the length
- Parameters:data – Tuple of (audio_array, sample_rate) where audio_array has shape [num_channels, num_samples]
- Returns: Frame length after encoding (number of frames)
preprocess(data: Tuple[ndarray, int]) → Tuple[ndarray, Tuple[int, ndarray], ndarray]
Preprocess audio for continuous feature extraction.
Extracts spectrogram features and prepares them for batch encoding.
- Parameters:data – Tuple of (audio_array, sample_rate) where audio_array has shape [num_channels, num_samples]
- Returns:
- seq: Zero array [after_length, 1] as placeholder
- conti_feat: Tuple of (after_length, mel_features)
- loss_mask: Zero array [after_length, 1] (no discrete tokens)
- Return type: Tuple of (seq, conti_feat, loss_mask)
