espnet2.speechlm.model.speechlm.multimodal_io.text.HuggingFaceTextIO

About 1 min

class espnet2.speechlm.model.speechlm.multimodal_io.text.HuggingFaceTextIO(tokenizer_name: str)

Text I/O using HuggingFace tokenizers.

This class provides text tokenization using HuggingFace’s pretrained tokenizers. Text is discrete with a single stream.

Initialize HuggingFace text tokenizer.

Parameters:tokenizer_name – HuggingFace model name or path (e.g., “bert-base-uncased”, “gpt2”)

copy_for_worker() → HuggingFaceTextIO

Create copy for multiprocessing workers.

decode(tokens: ndarray) → str

Decode a 1D tensor of token IDs to text string.

find_length(data: str) → int

Get token count for length statistics.

get_stream_interval() → List[Tuple[int, int]]

Get vocabulary range for single stream.

get_stream_weight() → List[float]

Get loss weight for single stream.

get_vocabulary() → List[str]

Get tokenizer vocabulary.

num_stream() → int

Text uses single stream.

preprocess(data: str) → Tuple[ndarray, None, ndarray]

Tokenize single text string for data loading.

Parameters:data – Single text string
Returns:
- tokens: Token IDs as numpy array [seq_len, 1]
- conti_feat: None (text is discrete)
- loss_mask: Loss weights [seq_len, 1], all 1.0
Return type: Tuple of (tokens, conti_feat, loss_mask)