espnet2.speechlm.model.speechlm.multimodal_io.text.HuggingFaceTextIO
espnet2.speechlm.model.speechlm.multimodal_io.text.HuggingFaceTextIO
class espnet2.speechlm.model.speechlm.multimodal_io.text.HuggingFaceTextIO(tokenizer_name: str)
Bases: AbsIO
Text I/O using HuggingFace tokenizers.
This class provides text tokenization using HuggingFace’s pretrained tokenizers. Text is discrete with a single stream.
Initialize HuggingFace text tokenizer.
- Parameters:tokenizer_name – HuggingFace model name or path (e.g., “bert-base-uncased”, “gpt2”)
copy_for_worker() → HuggingFaceTextIO
Create copy for multiprocessing workers.
- Returns: New instance with same tokenizer
decode(tokens: ndarray) → str
Decode a 1D tensor of token IDs to text string.
- Parameters:tokens – 1D numpy array of token IDs [seq_len]
- Returns: Decoded text string
find_length(data: str) → int
Get token count for length statistics.
- Parameters:data – Text string
- Returns: Number of tokens after tokenization
get_stream_interval() → List[Tuple[int, int]]
Get vocabulary range for single stream.
- Returns: [(0, vocab_size)] for text’s single stream
get_stream_weight() → List[float]
Get loss weight for single stream.
- Returns: [1.0] for single text stream
get_vocabulary() → List[str]
Get tokenizer vocabulary.
- Returns: List of all tokens, padded to model vocab size
num_stream() → int
Text uses single stream.
preprocess(data: str) → Tuple[ndarray, None, ndarray]
Tokenize single text string for data loading.
- Parameters:data – Single text string
- Returns:
- tokens: Token IDs as numpy array [seq_len, 1]
- conti_feat: None (text is discrete)
- loss_mask: Loss weights [seq_len, 1], all 1.0
- Return type: Tuple of (tokens, conti_feat, loss_mask)
