espnet2.speechlm.model.speechlm.speechlm_job.SpeechLMPreprocessor

About 1 min

espnet2.speechlm.model.speechlm.speechlm_job.SpeechLMPreprocessor

class espnet2.speechlm.model.speechlm.speechlm_job.SpeechLMPreprocessor(is_train, multimodal_io, vocab, vocab_intervals, audio_input: str = 'continuous_audio', audio_output: str = 'discrete_audio', loss_region: str = 'assistant', batchfy_method: str = 'bucket', audio_cfg: float = 0.0, batch_length: int = -1)

Bases: object

Preprocessor for SpeechLM data handling.

Converts raw data into model-ready format with tokenization, padding, and loss mask generation for multimodal sequences.

collate_fn(data_lst)

Batch multiple samples for training.

Processes each sample, pads sequences to same length, and organizes continuous features by modality. Returns dict ready for model forward.

The return dict value should always in the format of either tensor, list of strings, or a flat dict of tensors/ints (for attn_args).

attn_args is a dict of pre-computed flash attention kwargs: : - pack mode: cu_seq_lens_q/k (int32), max_length_q/k (int) to avoid per-layer CPU-GPU sync in flash attention varlen.

bucket mode: empty dict (flash attention uses is_causal=True).

diagnose(data)

Print human-readable representation of processed data for debugging.

Shows tokens, loss masks, and continuous feature info frame by frame.

find_length(key, data_dict)

Quickly compute sequence length without full preprocessing.

Counts tokens for BOS, role/modality markers, content, and EOS/EOT. Used for efficient batch construction.

preprocessing(key, data_dict)

Convert single raw data dict into training-ready format.

Applies chat template, tokenizes content, adds special tokens, and creates loss masks. Returns dict with sequences and features.

special_mask(value)

Create loss mask for special tokens (1 frame, multi-stream).

Only first stream has the actual value, others are zero.

special_token(token)

Convert special token string to multi-stream token array.

Places token ID in first stream, padding tokens in other streams.