espnet2.speechlm.model.speechlm.multimodal_io.audio.ContinuousAudioIO
espnet2.speechlm.model.speechlm.multimodal_io.audio.ContinuousAudioIO
class espnet2.speechlm.model.speechlm.multimodal_io.audio.ContinuousAudioIO(encoder_choice: str = 'huggingface', encoder_hf_model_tag: str = 'Qwen/Qwen2.5-Omni-7B', attn_implementation: str | None = None, dtype: str = 'bfloat16', device: str = 'cpu')
Bases: AbsIO
Continuous audio I/O for feature extraction.
This class handles continuous audio representations using neural encoders that produce dense feature vectors instead of discrete tokens.
Initialize continuous audio encoder.
- Parameters:
- encoder_choice – Type of encoder (“huggingface”)
- encoder_hf_model_tag – HuggingFace model identifier (e.g., “Qwen/Qwen2.5-Omni-7B”)
- attn_implementation – Attention implementation type
- dtype – Model dtype (“bfloat16”, “float16”, etc.)
- device – Device for model (“cpu”, “cuda”, etc.)
copy_for_worker() → ContinuousAudioIO
Create lightweight copy for multiprocessing workers.
For continuous audio, we create a new instance without the model since preprocessing doesn’t require the encoder model itself.
- Returns: Lightweight copy suitable for workers
dummy_forward(ref_tensor: Tensor) → Tensor
Perform a dummy forward pass to include parameters in the graph.
This ensures all encoder parameters participate in the backward pass for proper gradient synchronization in distributed training.
- Parameters:ref_tensor – Reference tensor to get device and dtype from
- Returns: Output tensor from encode_batch (raw, not summed)
encode_batch(batch_data: Tensor, length: Tensor) → List[Tensor]
Encode batch of audio into continuous features.
Processes audio through the encoder model to extract dense features with proper attention masking based on actual audio lengths.
- Parameters:
- batch_data – Audio tensor [batch, samples, channels]
- length – Frame lengths for each sample [batch]
- Returns: List of audio feature tensors, one per sample in batch
feature_dim() → int
Get feature dimension for continuous representation.
- Returns: Feature dimension of encoder output
find_length(data: Tuple[ndarray, int], before_length: int | Tensor | None = None) → int | Tensor
Calculate frame length after encoding.
We don’t call self.processor as it’s very slow to find the length.
- Parameters:
- data – Tuple of (audio_array, sample_rate) where audio_array has shape [num_channels, num_samples]. Can be None if before_length is provided.
- before_length – Pre-computed frame length before downsampling. If provided, data is ignored. Can be int or torch.Tensor for batch processing.
- Returns: Frame length after encoding. Returns int if before_length is int or data is provided, returns torch.Tensor if before_length is a tensor.
preprocess(data: Tuple[ndarray, int]) → Tuple[ndarray, Tuple[int, ndarray], ndarray]
Preprocess audio for continuous feature extraction.
Extracts spectrogram features and prepares them for batch encoding.
- Parameters:data – Tuple of (audio_array, sample_rate) where audio_array has shape [num_channels, num_samples]
- Returns:
- seq: Zero array [after_length, 1] as placeholder
- conti_feat: Tuple of (after_length, mel_features)
- loss_mask: Zero array [after_length, 1] (no discrete tokens)
- Return type: Tuple of (seq, conti_feat, loss_mask)
