espnet2.speechlm.model.speechlm.multimodal_io.audio.ContinuousAudioIO

About 1 min

espnet2.speechlm.model.speechlm.multimodal_io.audio.ContinuousAudioIO

class espnet2.speechlm.model.speechlm.multimodal_io.audio.ContinuousAudioIO(encoder_choice: str = 'huggingface', encoder_hf_model_tag: str = 'Qwen/Qwen2.5-Omni-7B', attn_implementation: str | None = None, dtype: str = 'bfloat16', device: str = 'cpu')

Bases: AbsIO

Continuous audio I/O for feature extraction.

This class handles continuous audio representations using neural encoders that produce dense feature vectors instead of discrete tokens.

Initialize continuous audio encoder.

Parameters:
- encoder_choice – Type of encoder (“huggingface”)
- encoder_hf_model_tag – HuggingFace model identifier (e.g., “Qwen/Qwen2.5-Omni-7B”)
- attn_implementation – Attention implementation type
- dtype – Model dtype (“bfloat16”, “float16”, etc.)
- device – Device for model (“cpu”, “cuda”, etc.)

copy_for_worker() → ContinuousAudioIO

Create lightweight copy for multiprocessing workers.

For continuous audio, we create a new instance without the model since preprocessing doesn’t require the encoder model itself.

Returns: Lightweight copy suitable for workers

encode_batch(batch_data: Tensor, length: Tensor) → List[Tensor]

Encode batch of audio into continuous features.

Processes audio through the encoder model to extract dense features with proper attention masking based on actual audio lengths.

Parameters:
- batch_data – Audio tensor [batch, samples, channels]
- length – Frame lengths for each sample [batch]
Returns: List of audio feature tensors, one per sample in batch

feature_dim() → int

Get feature dimension for continuous representation.

Returns: Feature dimension of encoder output

find_length(data: Tuple[ndarray, int]) → int

Calculate frame length after encoding.

We don’t call self.processor as it’s very slow to find the length

Parameters:data – Tuple of (audio_array, sample_rate) where audio_array has shape [num_channels, num_samples]
Returns: Frame length after encoding (number of frames)

preprocess(data: Tuple[ndarray, int]) → Tuple[ndarray, Tuple[int, ndarray], ndarray]

Preprocess audio for continuous feature extraction.

Extracts spectrogram features and prepares them for batch encoding.

Parameters:data – Tuple of (audio_array, sample_rate) where audio_array has shape [num_channels, num_samples]
Returns:
- seq: Zero array [after_length, 1] as placeholder
- conti_feat: Tuple of (after_length, mel_features)
- loss_mask: Zero array [after_length, 1] (no discrete tokens)
Return type: Tuple of (seq, conti_feat, loss_mask)