espnet2.speechlm.model.speechlm.multimodal_io.audio.DiscreteAudioIO
espnet2.speechlm.model.speechlm.multimodal_io.audio.DiscreteAudioIO
class espnet2.speechlm.model.speechlm.multimodal_io.audio.DiscreteAudioIO(codec_choice: str | None = None, codec_hf_model_tag: str | None = None, codec_max_token_per_frame: int = 8, ssl_choice: str | None = None, ssl_hf_model_tag: str | None = None, stream_weights: List[float] | None = None, delay_interleave: bool = False, device: str = 'cpu')
Bases: AbsIO
Discrete audio I/O using combined codec and SSL tokenizers.
This class handles audio encoding/decoding using both:
- Codec tokens (acoustic/low-level features) from neural audio codecs
- SSL tokens (semantic/high-level features) from self-supervised models
The tokens from both tokenizers are concatenated frame-by-frame to create a multi-stream representation where semantic and acoustic information are aligned temporally.
Initialize discrete audio I/O handler with combined tokenizers.
- Parameters:
- codec_choice – Type of codec to use (“ESPnet” or None to disable)
- codec_hf_model_tag – HuggingFace model tag for codec tokenizer
- codec_max_token_per_frame – Maximum number of codec tokens per frame (default: 8)
- ssl_choice – Type of SSL model to use (“ESPnet” or None to disable)
- ssl_hf_model_tag – HuggingFace model tag for SSL model (e.g., “espnet/xeus”)
- stream_weights – Loss weights for each stream. List of weight values, one for each stream. Order should be [SSL streams, codec streams]. If None, all streams get equal weight (1.0).
- delay_interleave – Whether to apply delay interleaving to multi-stream tokens (default: False)
- device – Device to run models on (default: “cpu”)
copy_for_worker() → DiscreteAudioIO
Create lightweight copy for multiprocessing workers.
Creates a new instance with the same parameters (loads models) then removes the heavy model components to reduce memory usage in workers while keeping necessary metadata.
- Returns: Lightweight copy suitable for workers
decode_batch(codes: Tensor, lengths: Tensor) → Tuple[Tensor, Tensor]
Decode a batch of encoded tokens back to audio.
Note: Only codec tokens are used for audio reconstruction. SSL tokens are discarded as they represent semantic information that cannot be directly converted back to waveforms.
- Parameters:
- codes – Encoded tokens [batch, time, n_streams]
- lengths – Frame lengths [batch]
- Returns:
- audio: Reconstructed audio [batch, num_channels, num_samples]
- audio_lengths: Sample lengths [batch]
- Return type: Tuple of
encode_batch(data: Tensor, lengths: Tensor) → Tensor
Encode a batch of audio data into discrete tokens.
- Parameters:
- data – Audio tensor of shape [batch, samples, num_channel]
- lengths – Effective sample lengths [batch]
- Returns: Encoded tokens [batch, time, n_streams]
- Return type: codes
find_length(data: Tuple[ndarray, int]) → int
Calculate frame length after encoding.
- Parameters:data – Tuple of (audio_array, sample_rate) where audio_array has shape [num_channels, num_samples]
- Returns: Frame length after encoding (number of frames)
get_stream_interval() → List[Tuple[int, int]] | None
Get vocabulary index ranges for each stream.
SSL streams come first, followed by codec streams. Each tuple represents (start_index, end_index) for that stream.
- Returns: List of (start, end) tuples for each stream’s vocabulary range
get_stream_weight() → List[float] | None
Get loss weights for each stream.
- Returns: List of weight values for each stream. Order is [SSL streams, codec streams].
get_vocabulary() → List[str] | None
Get the complete vocabulary list across all streams.
- Returns: List of all token symbols for SSL and codec combined
num_stream() → int | None
Get number of parallel streams (SSL + codec).
- Returns: Total number of streams combining SSL and codec
preprocess(data: Tuple[ndarray, int]) → Tuple[ndarray, Tuple[int, ndarray] | None, ndarray]
Preprocess audio for discrete tokenization.
Since tokenization happens on GPU, this returns placeholder sequences and passes raw audio as continuous features for on-the-fly encoding.
- Parameters:data – Tuple of (audio_array, sample_rate) where audio_array has shape [num_channels, num_samples]
- Returns:
- seq: Zero-filled placeholder array [length, num_stream]
- conti_feat: Tuple of (length, transposed_audio) for GPU encoding
- loss_mask: Stream weights broadcasted to [length, num_stream]
- Return type: Tuple of (seq, conti_feat, loss_mask)
