espnetez.preprocess.sentencepiece.train_sentencepiece
Less than 1 minute
espnetez.preprocess.sentencepiece.train_sentencepiece
espnetez.preprocess.sentencepiece.train_sentencepiece(dump_text_path: str | Path, output_path: str | Path, vocab_size: int = 5000, character_coverage: float = 0.9995, model_type: str = 'bpe', user_defined_symbols: list = [])
Main function to train a SentencePiece model.
This function trains a SentencePiece model using the provided training data and saves the resulting model and vocabulary files to the specified output directory. The model can be customized through various parameters such as vocabulary size, character coverage, and model type.
- Parameters:
- dump_text_path (Union *[*str , Path ]) – Path to the train.txt file containing the training data for the SentencePiece model.
- output_path (Union *[*str , Path ]) – Output directory where the trained SentencePiece model and vocabulary list will be stored.
- vocab_size (int , optional) – The size of the vocabulary to be generated by the SentencePiece model. Defaults to 5000.
- character_coverage (float , optional) – The character coverage rate for the model, which indicates the percentage of characters in the training data that should be covered. Defaults to 0.9995.
- model_type (str , optional) – The type of model to be trained. Options include ‘bpe’ (Byte Pair Encoding), ‘unigram’, ‘char’, and ‘word’. Defaults to “bpe”.
- user_defined_symbols (list , optional) – A list of user-defined symbols that should be included in the model. Defaults to an empty list.
- Raises:
- FileNotFoundError – If the specified dump_text_path does not exist.
- Exception – If the training of the SentencePiece model fails for any reason.
Examples
>>> train_sentencepiece(
... dump_text_path='path/to/train.txt',
... output_path='path/to/output',
... vocab_size=8000,
... character_coverage=0.995,
... model_type='unigram',
... user_defined_symbols=['<user_sym1>', '<user_sym2>']
... )
NOTE
Ensure that the train.txt file has been prepared using the prepare_sentences function before calling this function. The output directory will be created if it does not already exist.