espnet2.text package

espnet2.text.abs_tokenizer

class espnet2.text.abs_tokenizer.AbsTokenizer[source]

Bases: abc.ABC

abstract text2tokens(line: str) → List[str][source]
abstract tokens2text(tokens: Iterable[str]) → str[source]

espnet2.text.cleaner

class espnet2.text.cleaner.TextCleaner(cleaner_types: Optional[Collection[str]] = None)[source]

Bases: object

Text cleaner.

Examples

>>> cleaner = TextCleaner("tacotron")
>>> cleaner("(Hello-World);   &  jr. & dr.")
'HELLO WORLD, AND JUNIOR AND DOCTOR'

espnet2.text.token_id_converter

class espnet2.text.token_id_converter.TokenIDConverter(token_list: Union[pathlib.Path, str, Iterable[str]], unk_symbol: str = '<unk>')[source]

Bases: object

get_num_vocabulary_size() → int[source]
ids2tokens(integers: Union[numpy.ndarray, Iterable[int]]) → List[str][source]
tokens2ids(tokens: Iterable[str]) → List[int][source]

espnet2.text.sentencepiece_tokenizer

class espnet2.text.sentencepiece_tokenizer.SentencepiecesTokenizer(model: Union[pathlib.Path, str], encode_kwargs: Dict = {})[source]

Bases: espnet2.text.abs_tokenizer.AbsTokenizer

text2tokens(line: str) → List[str][source]
tokens2text(tokens: Iterable[str]) → str[source]

espnet2.text.char_tokenizer

class espnet2.text.char_tokenizer.CharTokenizer(non_linguistic_symbols: Union[pathlib.Path, str, Iterable[str], None] = None, space_symbol: str = '<space>', remove_non_linguistic_symbols: bool = False, nonsplit_symbols: Optional[Iterable[str]] = None)[source]

Bases: espnet2.text.abs_tokenizer.AbsTokenizer

text2tokens(line: str) → List[str][source]
tokens2text(tokens: Iterable[str]) → str[source]

espnet2.text.__init__

espnet2.text.whisper_token_id_converter

class espnet2.text.whisper_token_id_converter.OpenAIWhisperTokenIDConverter(model_type: str, language: Optional[str] = 'en', task: str = 'transcribe', added_tokens_txt: Optional[str] = None, sot: bool = False, speaker_change_symbol: str = '<sc>')[source]

Bases: object

get_num_vocabulary_size() → int[source]
ids2tokens(integers: Union[numpy.ndarray, Iterable[int]]) → List[str][source]
tokens2ids(tokens: Iterable[str]) → List[int][source]

espnet2.text.word_tokenizer

class espnet2.text.word_tokenizer.WordTokenizer(delimiter: Optional[str] = None, non_linguistic_symbols: Union[pathlib.Path, str, Iterable[str], None] = None, remove_non_linguistic_symbols: bool = False)[source]

Bases: espnet2.text.abs_tokenizer.AbsTokenizer

text2tokens(line: str) → List[str][source]
tokens2text(tokens: Iterable[str]) → str[source]

espnet2.text.hugging_face_token_id_converter

class espnet2.text.hugging_face_token_id_converter.HuggingFaceTokenIDConverter(model_name_or_path: str)[source]

Bases: object

get_num_vocabulary_size() → int[source]
ids2tokens(integers: Union[numpy.ndarray, Iterable[int]]) → List[str][source]
tokens2ids(tokens: Iterable[str]) → List[int][source]

espnet2.text.korean_cleaner

class espnet2.text.korean_cleaner.KoreanCleaner[source]

Bases: object

classmethod normalize_text(text)[source]

espnet2.text.whisper_tokenizer

class espnet2.text.whisper_tokenizer.OpenAIWhisperTokenizer(model_type: str, language: str = 'en', task: str = 'transcribe', sot: bool = False, speaker_change_symbol: str = '<sc>', added_tokens_txt: Optional[str] = None)[source]

Bases: espnet2.text.abs_tokenizer.AbsTokenizer

text2tokens(line: str) → List[str][source]
tokens2text(tokens: Iterable[str]) → str[source]

espnet2.text.hugging_face_tokenizer

class espnet2.text.hugging_face_tokenizer.HuggingFaceTokenizer(model: Union[pathlib.Path, str])[source]

Bases: espnet2.text.abs_tokenizer.AbsTokenizer

text2tokens(line: str) → List[str][source]
tokens2text(tokens: Iterable[str]) → str[source]

espnet2.text.build_tokenizer

espnet2.text.build_tokenizer.build_tokenizer(token_type: str, bpemodel: Union[pathlib.Path, str, Iterable[str], None] = None, non_linguistic_symbols: Union[pathlib.Path, str, Iterable[str], None] = None, remove_non_linguistic_symbols: bool = False, space_symbol: str = '<space>', delimiter: Optional[str] = None, g2p_type: Optional[str] = None, nonsplit_symbol: Optional[Iterable[str]] = None, encode_kwargs: Optional[Dict] = None, whisper_language: Optional[str] = None, whisper_task: Optional[str] = None, sot_asr: bool = False) → espnet2.text.abs_tokenizer.AbsTokenizer[source]

A helper function to instantiate Tokenizer

espnet2.text.phoneme_tokenizer

class espnet2.text.phoneme_tokenizer.G2p_en(no_space: bool = False)[source]

Bases: object

On behalf of g2p_en.G2p.

g2p_en.G2p isn’t pickalable and it can’t be copied to the other processes via multiprocessing module. As a workaround, g2p_en.G2p is instantiated upon calling this class.

class espnet2.text.phoneme_tokenizer.G2pk(descritive=False, group_vowels=False, to_syl=False, no_space=False, explicit_space=False, space_symbol='<space>')[source]

Bases: object

On behalf of g2pk.G2p.

g2pk.G2p isn’t pickalable and it can’t be copied to the other processes via multiprocessing module. As a workaround, g2pk.G2p is instantiated upon calling this class.

class espnet2.text.phoneme_tokenizer.IsG2p(dialect: str = 'standard', syllabify: bool = True, word_sep: str = ', ', use_dict: bool = True)[source]

Bases: object

Minimal wrapper for https://github.com/grammatek/ice-g2p

The g2p module uses a Bi-LSTM model along with a pronunciation dictionary to generate phonemization Unfortunately does not support multi-thread phonemization as of yet

class espnet2.text.phoneme_tokenizer.Jaso(space_symbol=' ', no_space=False)[source]

Bases: object

JAMO_LEADS = 'ᄀᄁᄂᄃᄄᄅᄆᄇᄈᄉᄊᄋᄌᄍᄎᄏᄐᄑᄒ'
JAMO_TAILS = 'ᆨᆩᆪᆫᆬᆭᆮᆯᆰᆱᆲᆳᆴᆵᆶᆷᆸᆹᆺᆻᆼᆽᆾᆿᇀᇁᇂ'
JAMO_VOWELS = 'ᅡᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵ'
PUNC = "!'(),-.:;?"
SPACE = ' '
VALID_CHARS = "ᄀᄁᄂᄃᄄᄅᄆᄇᄈᄉᄊᄋᄌᄍᄎᄏᄐᄑ하ᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵᆨᆩᆪᆫᆬᆭᆮᆯᆰᆱᆲᆳᆴᆵᆶᆷᆸᆹᆺᆻᆼᆽᆾᆿᇀᇁᇂ!'(),-.:;? "
class espnet2.text.phoneme_tokenizer.PhonemeTokenizer(g2p_type: Union[None, str], non_linguistic_symbols: Union[None, pathlib.Path, str, Iterable[str]] = None, space_symbol: str = '<space>', remove_non_linguistic_symbols: bool = False)[source]

Bases: espnet2.text.abs_tokenizer.AbsTokenizer

text2tokens(line: str) → List[str][source]
text2tokens_svs(syllable: str) → List[str][source]
tokens2text(tokens: Iterable[str]) → str[source]
class espnet2.text.phoneme_tokenizer.Phonemizer(backend, word_separator: Optional[str] = None, syllable_separator: Optional[str] = None, phone_separator: Optional[str] = ' ', strip=False, split_by_single_token: bool = False, **phonemizer_kwargs)[source]

Bases: object

Phonemizer module for various languages.

This is wrapper module of https://github.com/bootphon/phonemizer. You can define various g2p modules by specifying options for phonemizer.

See available options:

https://github.com/bootphon/phonemizer/blob/master/phonemizer/phonemize.py#L32

espnet2.text.phoneme_tokenizer.pyopenjtalk_g2p(text) → List[str][source]
espnet2.text.phoneme_tokenizer.pyopenjtalk_g2p_accent(text) → List[str][source]
espnet2.text.phoneme_tokenizer.pyopenjtalk_g2p_accent_with_pause(text) → List[str][source]
espnet2.text.phoneme_tokenizer.pyopenjtalk_g2p_kana(text) → List[str][source]
espnet2.text.phoneme_tokenizer.pyopenjtalk_g2p_prosody(text: str, drop_unvoiced_vowels: bool = True) → List[str][source]

Extract phoneme + prosoody symbol sequence from input full-context labels.

The algorithm is based on Prosodic features control by symbols as input of sequence-to-sequence acoustic modeling for neural TTS with some r9y9’s tweaks.

Parameters:
  • text (str) – Input text.

  • drop_unvoiced_vowels (bool) – whether to drop unvoiced vowels.

Returns:

List of phoneme + prosody symbols.

Return type:

List[str]

Examples

>>> from espnet2.text.phoneme_tokenizer import pyopenjtalk_g2p_prosody
>>> pyopenjtalk_g2p_prosody("こんにちは。")
['^', 'k', 'o', '[', 'N', 'n', 'i', 'ch', 'i', 'w', 'a', '$']
espnet2.text.phoneme_tokenizer.pypinyin_g2p(text) → List[str][source]
espnet2.text.phoneme_tokenizer.pypinyin_g2p_phone(text) → List[str][source]
espnet2.text.phoneme_tokenizer.pypinyin_g2p_phone_without_prosody(text) → List[str][source]
espnet2.text.phoneme_tokenizer.split_by_space(text) → List[str][source]