espnet2.text package

espnet2.text.phoneme_tokenizer

class espnet2.text.phoneme_tokenizer.G2p_en(no_space: bool = False)[source]

Bases: object

On behalf of g2p_en.G2p.

g2p_en.G2p isn’t pickalable and it can’t be copied to the other processes via multiprocessing module. As a workaround, g2p_en.G2p is instantiated upon calling this class.

class espnet2.text.phoneme_tokenizer.PhonemeTokenizer(g2p_type: Union[None, str], non_linguistic_symbols: Union[pathlib.Path, str, Iterable[str]] = None, space_symbol: str = '<space>', remove_non_linguistic_symbols: bool = False)[source]

Bases: espnet2.text.abs_tokenizer.AbsTokenizer

text2tokens(line: str) → List[str][source]
tokens2text(tokens: Iterable[str]) → str[source]
espnet2.text.phoneme_tokenizer.pyopenjtalk_g2p(text) → List[str][source]
espnet2.text.phoneme_tokenizer.pyopenjtalk_g2p_kana(text) → List[str][source]
espnet2.text.phoneme_tokenizer.pypinyin_g2p(text) → List[str][source]
espnet2.text.phoneme_tokenizer.pypinyin_g2p_phone(text) → List[str][source]
espnet2.text.phoneme_tokenizer.split_by_space(text) → List[str][source]

espnet2.text.token_id_converter

class espnet2.text.token_id_converter.TokenIDConverter(token_list: Union[pathlib.Path, str, Iterable[str]], unk_symbol: str = '<unk>')[source]

Bases: object

get_num_vocabulary_size() → int[source]
ids2tokens(integers: Union[numpy.ndarray, Iterable[int]]) → List[str][source]
tokens2ids(tokens: Iterable[str]) → List[int][source]

espnet2.text.cleaner

class espnet2.text.cleaner.TextCleaner(cleaner_types: Collection[str] = None)[source]

Bases: object

Text cleaner.

Examples

>>> cleaner = TextCleaner("tacotron")
>>> cleaner("(Hello-World);   &  jr. & dr.")
'HELLO WORLD, AND JUNIOR AND DOCTOR'

espnet2.text.char_tokenizer

class espnet2.text.char_tokenizer.CharTokenizer(non_linguistic_symbols: Union[pathlib.Path, str, Iterable[str]] = None, space_symbol: str = '<space>', remove_non_linguistic_symbols: bool = False)[source]

Bases: espnet2.text.abs_tokenizer.AbsTokenizer

text2tokens(line: str) → List[str][source]
tokens2text(tokens: Iterable[str]) → str[source]

espnet2.text.build_tokenizer

espnet2.text.build_tokenizer.build_tokenizer(token_type: str, bpemodel: Union[pathlib.Path, str, Iterable[str]] = None, non_linguistic_symbols: Union[pathlib.Path, str, Iterable[str]] = None, remove_non_linguistic_symbols: bool = False, space_symbol: str = '<space>', delimiter: str = None, g2p_type: str = None) → espnet2.text.abs_tokenizer.AbsTokenizer[source]

A helper function to instantiate Tokenizer

espnet2.text.abs_tokenizer

class espnet2.text.abs_tokenizer.AbsTokenizer[source]

Bases: abc.ABC

abstract text2tokens(line: str) → List[str][source]
abstract tokens2text(tokens: Iterable[str]) → str[source]

espnet2.text.word_tokenizer

class espnet2.text.word_tokenizer.WordTokenizer(delimiter: str = None, non_linguistic_symbols: Union[pathlib.Path, str, Iterable[str]] = None, remove_non_linguistic_symbols: bool = False)[source]

Bases: espnet2.text.abs_tokenizer.AbsTokenizer

text2tokens(line: str) → List[str][source]
tokens2text(tokens: Iterable[str]) → str[source]

espnet2.text.__init__

espnet2.text.sentencepiece_tokenizer

class espnet2.text.sentencepiece_tokenizer.SentencepiecesTokenizer(model: Union[pathlib.Path, str])[source]

Bases: espnet2.text.abs_tokenizer.AbsTokenizer

text2tokens(line: str) → List[str][source]
tokens2text(tokens: Iterable[str]) → str[source]