espnet2.speechlm.model.speechlm.multimodal_io.audio.DiscreteAudioIO

About 2 min

espnet2.speechlm.model.speechlm.multimodal_io.audio.DiscreteAudioIO

class espnet2.speechlm.model.speechlm.multimodal_io.audio.DiscreteAudioIO(codec_choice: str | None = None, codec_hf_model_tag: str | None = None, codec_max_token_per_frame: int = 8, ssl_choice: str | None = None, ssl_hf_model_tag: str | None = None, stream_weights: List[float] | None = None, delay_interleave: bool = False, device: str = 'cpu')

Bases: AbsIO

Discrete audio I/O using combined codec and SSL tokenizers.

This class handles audio encoding/decoding using both:

Codec tokens (acoustic/low-level features) from neural audio codecs
SSL tokens (semantic/high-level features) from self-supervised models

The tokens from both tokenizers are concatenated frame-by-frame to create a multi-stream representation where semantic and acoustic information are aligned temporally.

Initialize discrete audio I/O handler with combined tokenizers.

Parameters:
- codec_choice – Type of codec to use (“ESPnet” or None to disable)
- codec_hf_model_tag – HuggingFace model tag for codec tokenizer
- codec_max_token_per_frame – Maximum number of codec tokens per frame (default: 8)
- ssl_choice – Type of SSL model to use (“ESPnet” or None to disable)
- ssl_hf_model_tag – HuggingFace model tag for SSL model (e.g., “espnet/xeus”)
- stream_weights – Loss weights for each stream. List of weight values, one for each stream. Order should be [SSL streams, codec streams]. If None, all streams get equal weight (1.0).
- delay_interleave – Whether to apply delay interleaving to multi-stream tokens (default: False)
- device – Device to run models on (default: “cpu”)

copy_for_worker() → DiscreteAudioIO

Create lightweight copy for multiprocessing workers.

Creates a new instance with the same parameters (loads models) then removes the heavy model components to reduce memory usage in workers while keeping necessary metadata.

Returns: Lightweight copy suitable for workers

decode_batch(codes: Tensor, lengths: Tensor) → Tuple[Tensor, Tensor]

Decode a batch of encoded tokens back to audio.

Note: Only codec tokens are used for audio reconstruction. SSL tokens are discarded as they represent semantic information that cannot be directly converted back to waveforms.

Parameters:
- codes – Encoded tokens [batch, time, n_streams]
- lengths – Frame lengths [batch]
Returns:
- audio: Reconstructed audio [batch, num_channels, num_samples]
- audio_lengths: Sample lengths [batch]
Return type: Tuple of

encode_batch(data: Tensor, lengths: Tensor) → Tensor

Encode a batch of audio data into discrete tokens.

Parameters:
- data – Audio tensor of shape [batch, samples, num_channel]
- lengths – Effective sample lengths [batch]
Returns: Encoded tokens [batch, time, n_streams]
Return type: codes

find_length(data: Tuple[ndarray, int]) → int

Calculate frame length after encoding.

Parameters:data – Tuple of (audio_array, sample_rate) where audio_array has shape [num_channels, num_samples]
Returns: Frame length after encoding (number of frames)

get_stream_interval() → List[Tuple[int, int]] | None

Get vocabulary index ranges for each stream.

SSL streams come first, followed by codec streams. Each tuple represents (start_index, end_index) for that stream.

Returns: List of (start, end) tuples for each stream’s vocabulary range

get_stream_weight() → List[float] | None

Get loss weights for each stream.

Returns: List of weight values for each stream. Order is [SSL streams, codec streams].

get_vocabulary() → List[str] | None

Get the complete vocabulary list across all streams.

Returns: List of all token symbols for SSL and codec combined

num_stream() → int | None

Get number of parallel streams (SSL + codec).

Returns: Total number of streams combining SSL and codec

preprocess(data: Tuple[ndarray, int]) → Tuple[ndarray, Tuple[int, ndarray] | None, ndarray]

Preprocess audio for discrete tokenization.

Since tokenization happens on GPU, this returns placeholder sequences and passes raw audio as continuous features for on-the-fly encoding.

Parameters:data – Tuple of (audio_array, sample_rate) where audio_array has shape [num_channels, num_samples]
Returns:
- seq: Zero-filled placeholder array [length, num_stream]
- conti_feat: Tuple of (length, transposed_audio) for GPU encoding
- loss_mask: Stream weights broadcasted to [length, num_stream]
Return type: Tuple of (seq, conti_feat, loss_mask)