espnet2.speechlm.model.speechlm.speechlm_job.SpeechLMPreprocessor
espnet2.speechlm.model.speechlm.speechlm_job.SpeechLMPreprocessor
class espnet2.speechlm.model.speechlm.speechlm_job.SpeechLMPreprocessor(is_train, multimodal_io, vocab, vocab_intervals, audio_input: str = 'continuous_audio', audio_output: str = 'discrete_audio', loss_region: str = 'assistant', batchfy_method: str = 'bucket', audio_cfg: float = 0.0, batch_length: int = -1)
Bases: object
Preprocessor for SpeechLM data handling.
Converts raw data into model-ready format with tokenization, padding, and loss mask generation for multimodal sequences.
collate_fn(data_lst)
Batch multiple samples for training.
Processes each sample, pads sequences to same length, and organizes continuous features by modality. Returns dict ready for model forward.
The return dict value should always in the format of either tensor or list of strings. No nested format is allowed.
diagnose(data)
Print human-readable representation of processed data for debugging.
Shows tokens, loss masks, and continuous feature info frame by frame.
find_length(key, data_dict)
Quickly compute sequence length without full preprocessing.
Counts tokens for BOS, role/modality markers, content, and EOS/EOT. Used for efficient batch construction.
preprocessing(key, data_dict)
Convert single raw data dict into training-ready format.
Applies chat template, tokenizes content, adds special tokens, and creates loss masks. Returns dict with sequences and features.
special_mask(value)
Create loss mask for special tokens (1 frame, multi-stream).
Only first stream has the actual value, others are zero.
special_token(token)
Convert special token string to multi-stream token array.
Places token ID in first stream, padding tokens in other streams.
