core tools

ESPnet provides several command-line tools for training and evaluating neural networks (NN) under espnet/bin:

  • Align text to audio using CTC segmentation.using a pre-trained speech recognition model.

  • Enhance noisy speech for speech recognition

  • Transcribe text from speech using a speech recognition model on one CPU or GPU

  • Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs

  • Train a new language model on one CPU or one GPU

  • Train a neural machine translation (NMT) model on one CPU, one or multiple GPUs

  • Translate text from speech using a speech translation model on one CPU or GPU

  • Train a speech translation (ST) model on one CPU, one or multiple GPUs

  • Synthesize speech from text using a TTS model on one CPU

  • Train a new text-to-speech (TTS) model on one CPU, one or multiple GPUs

  • Converting speech using a VC model on one CPU

  • Train a new voice conversion (VC) model on one CPU, one or multiple GPUs

usage: [-h] [--config CONFIG] [--ngpu NGPU]
                    [--dtype {float16,float32,float64}] [--backend {pytorch}]
                    [--debugmode DEBUGMODE] [--verbose VERBOSE]
                    [--preprocess-conf PREPROCESS_CONF]
                    [--data-json DATA_JSON] [--utt-text UTT_TEXT] --model
                    MODEL [--model-conf MODEL_CONF] [--num-encs NUM_ENCS]
                    [--subsampling-factor SUBSAMPLING_FACTOR]
                    [--frame-duration FRAME_DURATION]
                    [--min-window-size MIN_WINDOW_SIZE]
                    [--max-window-size MAX_WINDOW_SIZE]
                    [--use-dict-blank USE_DICT_BLANK] [--set-blank SET_BLANK]
                    [--gratis-blank GRATIS_BLANK]
                    [--replace-spaces-with-blanks REPLACE_SPACES_WITH_BLANKS]
                    [--scoring-length SCORING_LENGTH] --output OUTPUT

Named Arguments


Decoding config file path.


Number of GPUs (max. 1 is supported)

Default: 0


Possible choices: float16, float32, float64

Float precision (only available in –api v2)

Default: “float32”


Possible choices: pytorch

Backend library

Default: “pytorch”



Default: 1

--verbose, -V

Verbose option

Default: 1


The configuration file for the pre-processing


Json of recognition data for audio and text


Text separated into utterances


Model file parameters to read


Model config file


Number of encoders in the model.

Default: 1


Subsampling factor. If the encoder sub-samples its input, the number of frames at the CTC layer is reduced by this factor. For example, a BLSTMP with subsampling 1_2_2_1_1 has a subsampling factor of 4.


Non-overlapping duration of a single frame in milliseconds.


Minimum window size considered for utterance.


Maximum window size considered for utterance.




Index of model dictionary for blank token (default: 0).


Set the transition cost of the blank token to zero. Audio sections labeled with blank tokens can then be skipped without penalty. Useful if there are unrelated audio segments between utterances.


Fill blanks in between words to better model pauses between words. Segments can be misaligned if this option is combined with –gratis-blank. May increase length of ground truth.


Changes partitioning length L for calculation of the confidence score.


Output segments file

Enhance noisy speech for speech recognition

usage: [-h] [--config CONFIG] [--config2 CONFIG2]
                      [--config3 CONFIG3] [--ngpu NGPU]
                      [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                      [--seed SEED] [--verbose VERBOSE]
                      [--batchsize BATCHSIZE]
                      [--preprocess-conf PREPROCESS_CONF]
                      [--recog-json RECOG_JSON] --model MODEL
                      [--model-conf MODEL_CONF]
                      [--enh-wspecifier ENH_WSPECIFIER]
                      [--enh-filetype {mat,hdf5,sound.hdf5,sound}] [--fs FS]
                      [--keep-length KEEP_LENGTH] [--image-dir IMAGE_DIR]
                      [--num-images NUM_IMAGES] [--apply-istft APPLY_ISTFT]
                      [--istft-win-length ISTFT_WIN_LENGTH]
                      [--istft-n-shift ISTFT_N_SHIFT]
                      [--istft-window ISTFT_WINDOW]

Named Arguments


config file path


second config file path that overwrites the settings in –config.


third config file path that overwrites the settings in –config and –config2.


Number of GPUs

Default: 0


Possible choices: chainer, pytorch

Backend library

Default: “chainer”



Default: 1


Random seed

Default: 1

--verbose, -V

Verbose option

Default: 1


Batch size for beam search (0: means no batch processing)

Default: 1


The configuration file for the pre-processing


Filename of recognition data (json)


Model file parameters to read


Model config file


Specify the output way for enhanced speech.e.g. ark,scp:outdir,wav.scp


Possible choices: mat, hdf5, sound.hdf5, sound

Specify the file format for enhanced speech. “mat” is the matrix format in kaldi

Default: “sound”


The sample frequency

Default: 16000


Adjust the output length to match with the input for enhanced speech

Default: True


The directory saving the images.


The number of images files to be saved. If negative, all samples are to be saved.

Default: 20


Apply istft to the output from the network

Default: True


The window length for istft. This option is ignored if stft is found in the preprocess-conf

Default: 512


The window type for istft. This option is ignored if stft is found in the preprocess-conf

Default: 256


The window type for istft. This option is ignored if stft is found in the preprocess-conf

Default: “hann”

Transcribe text from speech using a speech recognition model on one CPU or GPU

usage: [-h] [--config CONFIG] [--config2 CONFIG2]
                    [--config3 CONFIG3] [--ngpu NGPU]
                    [--dtype {float16,float32,float64}]
                    [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                    [--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
                    [--preprocess-conf PREPROCESS_CONF] [--api {v1,v2}]
                    [--recog-json RECOG_JSON] --result-label RESULT_LABEL
                    --model MODEL [--model-conf MODEL_CONF]
                    [--num-spkrs {1,2}] [--num-encs NUM_ENCS] [--nbest NBEST]
                    [--beam-size BEAM_SIZE] [--penalty PENALTY]
                    [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                    [--ctc-weight CTC_WEIGHT]
                    [--weights-ctc-dec WEIGHTS_CTC_DEC]
                    [--ctc-window-margin CTC_WINDOW_MARGIN]
                    [--search-type {default,nsc,tsd,alsd,maes}]
                    [--nstep NSTEP] [--prefix-alpha PREFIX_ALPHA]
                    [--max-sym-exp MAX_SYM_EXP] [--u-max U_MAX]
                    [--expansion-gamma EXPANSION_GAMMA]
                    [--expansion-beta EXPANSION_BETA]
                    [--score-norm [SCORE_NORM]]
                    [--softmax-temperature SOFTMAX_TEMPERATURE]
                    [--rnnlm RNNLM] [--rnnlm-conf RNNLM_CONF]
                    [--word-rnnlm WORD_RNNLM]
                    [--word-rnnlm-conf WORD_RNNLM_CONF]
                    [--word-dict WORD_DICT] [--lm-weight LM_WEIGHT]
                    [--ngram-model NGRAM_MODEL] [--ngram-weight NGRAM_WEIGHT]
                    [--ngram-scorer {full,part}]
                    [--streaming-mode {window,segment}]
                    [--streaming-window STREAMING_WINDOW]
                    [--streaming-min-blank-dur STREAMING_MIN_BLANK_DUR]
                    [--streaming-onset-margin STREAMING_ONSET_MARGIN]
                    [--streaming-offset-margin STREAMING_OFFSET_MARGIN]
                    [--maskctc-n-iterations MASKCTC_N_ITERATIONS]
                    [--maskctc-probability-threshold MASKCTC_PROBABILITY_THRESHOLD]
                    [--quantize-config [QUANTIZE_CONFIG [QUANTIZE_CONFIG ...]]]
                    [--quantize-dtype {float16,qint8}]
                    [--quantize-asr-model QUANTIZE_ASR_MODEL]
                    [--quantize-lm-model QUANTIZE_LM_MODEL]

Named Arguments


Config file path


Second config file path that overwrites the settings in –config


Third config file path that overwrites the settings in –config and –config2


Number of GPUs

Default: 0


Possible choices: float16, float32, float64

Float precision (only available in –api v2)

Default: “float32”


Possible choices: chainer, pytorch

Backend library

Default: “chainer”



Default: 1


Random seed

Default: 1

--verbose, -V

Verbose option

Default: 1


Batch size for beam search (0: means no batch processing)

Default: 1


The configuration file for the pre-processing


Possible choices: v1, v2

Beam search APIs v1: Default API. It only supports the ASRInterface.recognize method and DefaultRNNLM. v2: Experimental API. It supports any models that implements ScorerInterface.

Default: “v1”


Filename of recognition data (json)


Filename of result label data (json)


Model file parameters to read


Model config file


Possible choices: 1, 2

Number of speakers in the speech

Default: 1


Number of encoders in the model.

Default: 1


Output N-best hypotheses

Default: 1


Beam size

Default: 1


Incertion penalty

Default: 0.0

Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths. If maxlenratio<0.0, its absolute value is interpreted as a constant max output length

Default: 0.0


Input length ratio to obtain min output length

Default: 0.0


CTC weight in joint decoding

Default: 0.0


ctc weight assigned to each encoder during decoding.[in multi-encoder mode only]

Use CTC window with margin parameter to accelerate

CTC/attention decoding especially on GPU. Smaller magin makes decoding faster, but may increase search errors. If margin=0 (default), this function is disabled

Default: 0


Possible choices: default, nsc, tsd, alsd, maes

Type of beam search implementation to use during inference.

Can be either: default beam search (“default”), N-Step Constrained beam search (“nsc”), Time-Synchronous Decoding (“tsd”), Alignment-Length Synchronous Decoding (“alsd”) or modified Adaptive Expansion Search (“maes”).

Default: “default”

Number of expansion steps allowed in NSC beam search or mAES

(nstep > 0 for NSC and nstep > 1 for mAES).

Default: 1


Length prefix difference allowed in NSC beam search or mAES.

Default: 2


Number of symbol expansions allowed in TSD.

Default: 2


Length prefix difference allowed in ALSD.

Default: 400


Allowed logp difference for prune-by-value method in mAES.

Default: 2.3

Number of additional candidates for expanded hypotheses

selection in mAES.

Default: 2


Normalize final hypotheses’ score by length

Default: True


Penalization term for softmax function.

Default: 1.0


RNNLM model file to read


RNNLM model config file to read


Word RNNLM model file to read


Word RNNLM model config file to read


Word list to read


RNNLM weight

Default: 0.1


ngram model file to read


ngram weight

Default: 0.1


Possible choices: full, part

if the ngram is set as a part scorer, similar with CTC scorer,

ngram scorer only scores topK hypethesis. if the ngram is set as full scorer, ngram scorer scores all hypthesis the decoding speed of part scorer is musch faster than full one

Default: “part”


Possible choices: window, segment

Use streaming recognizer for inference.

–batchsize must be set to 0 to enable this mode


Window size

Default: 10


Minimum blank duration threshold

Default: 10


Onset margin

Default: 1


Offset margin

Default: 1


Number of decoding iterations.For Mask CTC, set 0 to predict 1 mask/iter.

Default: 10


Threshold probability for CTC output

Default: 0.999

Config for dynamic quantization provided as a list of modules,

separated by a comma. E.g.: –quantize-config=[Linear,LSTM,GRU]. Each specified module should be an attribute of ‘torch.nn’, e.g.: torch.nn.Linear, torch.nn.LSTM, torch.nn.GRU, …


Possible choices: float16, qint8

Dtype for dynamic quantization.

Default: “qint8”


Apply dynamic quantization to ASR model.

Default: False


Apply dynamic quantization to LM.

Default: False

Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs

usage: [-h] [--config CONFIG] [--config2 CONFIG2]
                    [--config3 CONFIG3] [--ngpu NGPU] [--use-ddp]
                    [--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
                    [--backend {chainer,pytorch}] --outdir OUTDIR
                    [--debugmode DEBUGMODE] --dict DICT [--seed SEED]
                    [--debugdir DEBUGDIR] [--resume [RESUME]]
                    [--minibatches MINIBATCHES] [--verbose VERBOSE]
                    [--tensorboard-dir [TENSORBOARD_DIR]]
                    [--report-interval-iters REPORT_INTERVAL_ITERS]
                    [--save-interval-iters SAVE_INTERVAL_ITERS]
                    [--train-json TRAIN_JSON] [--valid-json VALID_JSON]
                    [--model-module MODEL_MODULE] [--num-encs NUM_ENCS]
                    [--ctc_type {builtin,gtnctc,cudnnctc}]
                    [--mtlalpha MTLALPHA] [--lsm-weight LSM_WEIGHT]
                    [--report-cer] [--report-wer] [--nbest NBEST]
                    [--beam-size BEAM_SIZE] [--penalty PENALTY]
                    [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                    [--ctc-weight CTC_WEIGHT] [--rnnlm RNNLM]
                    [--rnnlm-conf RNNLM_CONF] [--lm-weight LM_WEIGHT]
                    [--sym-space SYM_SPACE] [--sym-blank SYM_BLANK]
                    [--sortagrad [SORTAGRAD]]
                    [--batch-count {auto,seq,bin,frame}]
                    [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                    [--batch-frames-in BATCH_FRAMES_IN]
                    [--batch-frames-out BATCH_FRAMES_OUT]
                    [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                    [--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
                    [--preprocess-conf [PREPROCESS_CONF]]
                    [--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
                    [--eps EPS] [--eps-decay EPS_DECAY]
                    [--weight-decay WEIGHT_DECAY]
                    [--criterion {loss,loss_eps_decay_only,acc}]
                    [--threshold THRESHOLD] [--epochs EPOCHS]
                    [--early-stop-criterion [EARLY_STOP_CRITERION]]
                    [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                    [--num-save-attention NUM_SAVE_ATTENTION]
                    [--num-save-ctc NUM_SAVE_CTC] [--grad-noise GRAD_NOISE]
                    [--num-spkrs {1,2}]
                    [--context-residual [CONTEXT_RESIDUAL]]
                    [--enc-init ENC_INIT] [--enc-init-mods ENC_INIT_MODS]
                    [--dec-init DEC_INIT] [--dec-init-mods DEC_INIT_MODS]
                    [--freeze-mods FREEZE_MODS] [--use-frontend USE_FRONTEND]
                    [--use-wpe USE_WPE]
                    [--wtype {lstm,blstm,lstmp,blstmp,vgglstmp,vggblstmp,vgglstm,vggblstm,gru,bgru,grup,bgrup,vgggrup,vggbgrup,vgggru,vggbgru}]
                    [--wlayers WLAYERS] [--wunits WUNITS] [--wprojs WPROJS]
                    [--wdropout-rate WDROPOUT_RATE] [--wpe-taps WPE_TAPS]
                    [--wpe-delay WPE_DELAY]
                    [--use-dnn-mask-for-wpe USE_DNN_MASK_FOR_WPE]
                    [--use-beamformer USE_BEAMFORMER]
                    [--btype {lstm,blstm,lstmp,blstmp,vgglstmp,vggblstmp,vgglstm,vggblstm,gru,bgru,grup,bgrup,vgggrup,vggbgrup,vgggru,vggbgru}]
                    [--blayers BLAYERS] [--bunits BUNITS] [--bprojs BPROJS]
                    [--badim BADIM] [--bnmask BNMASK]
                    [--ref-channel REF_CHANNEL]
                    [--bdropout-rate BDROPOUT_RATE] [--stats-file STATS_FILE]
                    [--apply-uttmvn APPLY_UTTMVN]
                    [--uttmvn-norm-means UTTMVN_NORM_MEANS]
                    [--uttmvn-norm-vars UTTMVN_NORM_VARS]
                    [--fbank-fs FBANK_FS] [--n-mels N_MELS]
                    [--fbank-fmin FBANK_FMIN] [--fbank-fmax FBANK_FMAX]

Named Arguments


config file path


second config file path that overwrites the settings in –config.


third config file path that overwrites the settings in –config and –config2.


Number of GPUs. If not given, use all visible devices


Enable process-based data parallel. –ngpu’s GPUs will be used. If –ngpu is not given, this tries to identify how many GPUs can be used. But, if it fails, the application will abort. And, currently, single node multi GPUs job is only supported.

Default: False


Possible choices: float16, float32, float64, O0, O1, O2, O3

Data type for training (only pytorch backend). O0,O1,.. flags require apex. See

Default: “float32”


Possible choices: chainer, pytorch

Backend library

Default: “chainer”


Output directory



Default: 1




Random seed

Default: 1


Output directory for debugging

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0


Tensorboard log dir path


Report interval iterations

Default: 100


Save snapshot interval iterations

Default: 0


Filename of train label data (json)


Filename of validation label data (json)


model defined module (default: espnet.nets.xxx_backend.e2e_asr:E2E)


Number of encoders in the model.

Default: 1


Possible choices: builtin, gtnctc, cudnnctc

Type of CTC implementation to calculate loss.

Default: “builtin”


Multitask learning coefficient, alpha: alpha*ctc_loss + (1-alpha)*att_loss

Default: 0.5


Label smoothing weight

Default: 0.0


Compute CER on development set

Default: False


Compute WER on development set

Default: False


Output N-best hypotheses

Default: 1


Beam size

Default: 4


Incertion penalty

Default: 0.0

Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 0.0


Input length ratio to obtain min output length

Default: 0.0


CTC weight in joint decoding

Default: 0.3


RNNLM model file to read


RNNLM model config file to read


RNNLM weight.

Default: 0.1


Space symbol

Default: “<space>”


Blank symbol

Default: “<blank>”


How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0


Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0


Maximum bins in a minibatch (0 to disable)

Default: 0


Maximum input frames in a minibatch (0 to disable)

Default: 0


Maximum output frames in a minibatch (0 to disable)

Default: 0


Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 800

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 150


Number of processes of iterator

Default: 0


The configuration file for the pre-processing


Possible choices: adadelta, adam, noam


Default: “adadelta”


Number of gradient accumuration

Default: 1


Epsilon constant for optimizer

Default: 1e-08


Decaying ratio of epsilon

Default: 0.01


Weight decay ratio

Default: 0.0


Possible choices: loss, loss_eps_decay_only, acc

Criterion to perform epsilon decay

Default: “acc”


Threshold to stop iteration

Default: 0.0001

--epochs, -e

Maximum number of epochs

Default: 30


Value to monitor to trigger an early stopping of the training

Default: “validation/main/acc”


Number of epochs to wait without improvement before stopping the training

Default: 3


Gradient norm threshold to clip

Default: 5


Number of samples of attention to be saved

Default: 3


Number of samples of CTC probability to be saved

Default: 3


The flag to switch to use noise injection to gradients during training

Default: False


Possible choices: 1, 2

Number of speakers in the speech.

Default: 1


The flag to switch to use context vector residual in the decoder network

Default: False


Pre-trained ASR model to initialize encoder.


List of encoder modules to initialize, separated by a comma.

Default: enc.enc.


Pre-trained ASR, MT or LM model to initialize decoder.


List of decoder modules to initialize, separated by a comma.

Default: att.,dec.


List of modules to freeze, separated by a comma.


The flag to switch to use frontend system.

Default: False


Apply Weighted Prediction Error

Default: False


Possible choices: lstm, blstm, lstmp, blstmp, vgglstmp, vggblstmp, vgglstm, vggblstm, gru, bgru, grup, bgrup, vgggrup, vggbgrup, vgggru, vggbgru

Type of encoder network architecture of the mask estimator for WPE.

Default: “blstmp”


Default: 2


Default: 300


Default: 300


Default: 0.0


Default: 5


Default: 3


Use DNN to estimate the power spectrogram. This option is experimental.

Default: False


Default: True


Possible choices: lstm, blstm, lstmp, blstmp, vgglstmp, vggblstmp, vgglstm, vggblstm, gru, bgru, grup, bgrup, vgggrup, vggbgrup, vgggru, vggbgru

Type of encoder network architecture of the mask estimator for Beamformer.

Default: “blstmp”


Default: 2


Default: 300


Default: 300


Default: 320


Number of beamforming masks, default is 2 for [speech, noise].

Default: 2


The reference channel used for beamformer. By default, the channel is estimated by DNN.

Default: -1


Default: 0.0


The stats file for the feature normalization


Apply utterance level mean variance normalization.

Default: True


Default: True


Default: False


The sample frequency used for the mel-fbank creation.

Default: 16000


The number of mel-frequency bins.

Default: 80


Default: 0.0


Train a new language model on one CPU or one GPU

usage: [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
                   [--backend {chainer,pytorch}] --outdir OUTDIR
                   [--debugmode DEBUGMODE] --dict DICT [--seed SEED]
                   [--resume [RESUME]] [--verbose VERBOSE]
                   [--tensorboard-dir [TENSORBOARD_DIR]]
                   [--report-interval-iters REPORT_INTERVAL_ITERS]
                   --train-label TRAIN_LABEL --valid-label VALID_LABEL
                   [--test-label TEST_LABEL] [--dump-hdf5-path DUMP_HDF5_PATH]
                   [--opt OPT] [--sortagrad [SORTAGRAD]]
                   [--batchsize BATCHSIZE] [--accum-grad ACCUM_GRAD]
                   [--epoch EPOCH]
                   [--early-stop-criterion [EARLY_STOP_CRITERION]]
                   [--patience [PATIENCE]] [--schedulers SCHEDULERS]
                   [--gradclip GRADCLIP] [--maxlen MAXLEN]
                   [--model-module MODEL_MODULE]

Named Arguments


config file path


second config file path that overwrites the settings in –config.


third config file path that overwrites the settings in –config and –config2.


Number of GPUs. If not given, use all visible devices


Possible choices: float16, float32, float64, O0, O1, O2, O3

Data type for training (only pytorch backend). O0,O1,.. flags require apex. See

Default: “float32”


Possible choices: chainer, pytorch

Backend library

Default: “chainer”


Output directory



Default: 1




Random seed

Default: 1

--resume, -r

Resume the training from snapshot

Default: “”

--verbose, -V

Verbose option

Default: 0


Tensorboard log dir path


Report interval iterations

Default: 100


Filename of train label data


Filename of validation label data


Filename of test label data


Path to dump a preprocessed dataset as hdf5



Default: “sgd”


How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0

--batchsize, -b

Number of examples in each mini-batch

Default: 300


Number of gradient accumueration

Default: 1

--epoch, -e

Number of sweeps over the dataset to train

Default: 20


Value to monitor to trigger an early stopping of the training

Default: “validation/main/loss”


Number of epochs to wait without improvement before stopping the training

Default: 3


optimizer schedulers, you can configure params like: <optimizer-param>-<scheduler-name>-<schduler-param> e.g., “–schedulers lr=noam –lr-noam-warmup 1000”.

--gradclip, -c

Gradient norm threshold to clip

Default: 5


Batch size is reduced if the input sequence > ML

Default: 40


model defined module (default: espnet.nets.xxx_backend.lm.default:DefaultRNNLM)

Default: “default”

Train a neural machine translation (NMT) model on one CPU, one or multiple GPUs

usage: [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
                   [--backend {chainer,pytorch}] --outdir OUTDIR
                   [--debugmode DEBUGMODE] --dict DICT [--seed SEED]
                   [--debugdir DEBUGDIR] [--resume [RESUME]]
                   [--minibatches MINIBATCHES] [--verbose VERBOSE]
                   [--tensorboard-dir [TENSORBOARD_DIR]]
                   [--report-interval-iters REPORT_INTERVAL_ITERS]
                   [--save-interval-iters SAVE_INTERVAL_ITERS]
                   [--train-json TRAIN_JSON] [--valid-json VALID_JSON]
                   [--model-module MODEL_MODULE] [--lsm-weight LSM_WEIGHT]
                   [--report-bleu] [--nbest NBEST] [--beam-size BEAM_SIZE]
                   [--penalty PENALTY] [--maxlenratio MAXLENRATIO]
                   [--minlenratio MINLENRATIO] [--rnnlm RNNLM]
                   [--rnnlm-conf RNNLM_CONF] [--lm-weight LM_WEIGHT]
                   [--sym-space SYM_SPACE] [--sym-blank SYM_BLANK]
                   [--sortagrad [SORTAGRAD]]
                   [--batch-count {auto,seq,bin,frame}]
                   [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                   [--batch-frames-in BATCH_FRAMES_IN]
                   [--batch-frames-out BATCH_FRAMES_OUT]
                   [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                   [--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
                   [--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
                   [--eps EPS] [--eps-decay EPS_DECAY] [--lr LR]
                   [--lr-decay LR_DECAY] [--weight-decay WEIGHT_DECAY]
                   [--criterion {loss,acc}] [--threshold THRESHOLD]
                   [--epochs EPOCHS]
                   [--early-stop-criterion [EARLY_STOP_CRITERION]]
                   [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                   [--num-save-attention NUM_SAVE_ATTENTION]
                   [--context-residual [CONTEXT_RESIDUAL]]
                   [--tie-src-tgt-embedding [TIE_SRC_TGT_EMBEDDING]]
                   [--tie-classifier [TIE_CLASSIFIER]] [--enc-init [ENC_INIT]]
                   [--enc-init-mods ENC_INIT_MODS] [--dec-init [DEC_INIT]]
                   [--dec-init-mods DEC_INIT_MODS]
                   [--multilingual MULTILINGUAL] [--replace-sos REPLACE_SOS]

Named Arguments


config file path


second config file path that overwrites the settings in –config.


third config file path that overwrites the settings in –config and –config2.


Number of GPUs. If not given, use all visible devices


Possible choices: float16, float32, float64, O0, O1, O2, O3

Data type for training (only pytorch backend). O0,O1,.. flags require apex. See

Default: “float32”


Possible choices: chainer, pytorch

Backend library

Default: “chainer”


Output directory



Default: 1


Dictionary for source/target languages


Random seed

Default: 1


Output directory for debugging

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0


Tensorboard log dir path


Report interval iterations

Default: 100


Save snapshot interval iterations

Default: 0


Filename of train label data (json)


Filename of validation label data (json)


model defined module (default: espnet.nets.xxx_backend.e2e_mt:E2E)


Label smoothing weight

Default: 0.0


Compute BLEU on development set

Default: True


Output N-best hypotheses

Default: 1


Beam size

Default: 4


Incertion penalty

Default: 0.0

Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 0.0


Input length ratio to obtain min output length

Default: 0.0


RNNLM model file to read


RNNLM model config file to read


RNNLM weight.

Default: 0.0


Space symbol

Default: “<space>”


Blank symbol

Default: “<blank>”


How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0


Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0


Maximum bins in a minibatch (0 to disable)

Default: 0


Maximum input frames in a minibatch (0 to disable)

Default: 0


Maximum output frames in a minibatch (0 to disable)

Default: 0


Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 100

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 100


Number of processes of iterator

Default: 0


Possible choices: adadelta, adam, noam


Default: “adadelta”


Number of gradient accumuration

Default: 1


Epsilon constant for optimizer

Default: 1e-08


Decaying ratio of epsilon

Default: 0.01


Learning rate for optimizer

Default: 0.001


Decaying ratio of learning rate

Default: 1.0


Weight decay ratio

Default: 0.0


Possible choices: loss, acc

Criterion to perform epsilon decay

Default: “acc”


Threshold to stop iteration

Default: 0.0001

--epochs, -e

Maximum number of epochs

Default: 30


Value to monitor to trigger an early stopping of the training

Default: “validation/main/acc”


Number of epochs to wait without improvement before stopping the training

Default: 3


Gradient norm threshold to clip

Default: 5


Number of samples of attention to be saved

Default: 3


The flag to switch to use context vector residual in the decoder network

Default: False


Tie parameters of source embedding and target embedding.

Default: False


Tie parameters of target embedding and output projection layer.

Default: False


Pre-trained ASR model to initialize encoder.


List of encoder modules to initialize, separated by a comma.

Default: enc.enc.


Pre-trained ASR, MT or LM model to initialize decoder.


List of decoder modules to initialize, separated by a comma.

Default: att., dec.


Prepend target language ID to the source sentence. Both source/target language IDs must be prepend in the pre-processing stage.

Default: False


Replace <sos> in the decoder with a target language ID (the first token in the target sequence)

Default: False

Translate text from speech using a speech translation model on one CPU or GPU

usage: [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--dtype {float16,float32,float64}]
                   [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                   [--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
                   [--preprocess-conf PREPROCESS_CONF] [--api {v1,v2}]
                   [--trans-json TRANS_JSON] --result-label RESULT_LABEL
                   --model MODEL [--model-conf MODEL_CONF] [--nbest NBEST]
                   [--beam-size BEAM_SIZE] [--penalty PENALTY]
                   [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                   [--tgt-lang TGT_LANG]

Named Arguments


Config file path


Second config file path that overwrites the settings in –config


Third config file path that overwrites the settings in –config and –config2


Number of GPUs

Default: 0


Possible choices: float16, float32, float64

Float precision (only available in –api v2)

Default: “float32”


Possible choices: chainer, pytorch

Backend library

Default: “chainer”



Default: 1


Random seed

Default: 1

--verbose, -V

Verbose option

Default: 1


Batch size for beam search (0: means no batch processing)

Default: 1


The configuration file for the pre-processing


Possible choices: v1, v2

Beam search APIs v1: Default API. It only supports the ASRInterface.recognize method and DefaultRNNLM. v2: Experimental API. It supports any models that implements ScorerInterface.

Default: “v1”


Filename of translation data (json)


Filename of result label data (json)


Model file parameters to read


Model config file


Output N-best hypotheses

Default: 1


Beam size

Default: 1


Incertion penalty

Default: 0.1

Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 3.0


Input length ratio to obtain min output length

Default: 0.0


target language ID (e.g., <en>, <de>, and <fr> etc.)

Default: False

Train a speech translation (ST) model on one CPU, one or multiple GPUs

usage: [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
                   [--backend {chainer,pytorch}] --outdir OUTDIR
                   [--debugmode DEBUGMODE] --dict DICT [--seed SEED]
                   [--debugdir DEBUGDIR] [--resume [RESUME]]
                   [--minibatches MINIBATCHES] [--verbose VERBOSE]
                   [--tensorboard-dir [TENSORBOARD_DIR]]
                   [--report-interval-iters REPORT_INTERVAL_ITERS]
                   [--save-interval-iters SAVE_INTERVAL_ITERS]
                   [--train-json TRAIN_JSON] [--valid-json VALID_JSON]
                   [--model-module MODEL_MODULE]
                   [--ctc_type {builtin,gtnctc,cudnnctc}]
                   [--mtlalpha MTLALPHA] [--asr-weight ASR_WEIGHT]
                   [--mt-weight MT_WEIGHT] [--lsm-weight LSM_WEIGHT]
                   [--report-cer] [--report-wer] [--report-bleu]
                   [--nbest NBEST] [--beam-size BEAM_SIZE] [--penalty PENALTY]
                   [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                   [--rnnlm RNNLM] [--rnnlm-conf RNNLM_CONF]
                   [--lm-weight LM_WEIGHT] [--sym-space SYM_SPACE]
                   [--sym-blank SYM_BLANK] [--sortagrad [SORTAGRAD]]
                   [--batch-count {auto,seq,bin,frame}]
                   [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                   [--batch-frames-in BATCH_FRAMES_IN]
                   [--batch-frames-out BATCH_FRAMES_OUT]
                   [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                   [--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
                   [--preprocess-conf [PREPROCESS_CONF]]
                   [--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
                   [--eps EPS] [--eps-decay EPS_DECAY] [--lr LR]
                   [--lr-decay LR_DECAY] [--weight-decay WEIGHT_DECAY]
                   [--criterion {loss,acc}] [--threshold THRESHOLD]
                   [--epochs EPOCHS]
                   [--early-stop-criterion [EARLY_STOP_CRITERION]]
                   [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                   [--num-save-attention NUM_SAVE_ATTENTION]
                   [--num-save-ctc NUM_SAVE_CTC] [--grad-noise GRAD_NOISE]
                   [--context-residual [CONTEXT_RESIDUAL]]
                   [--enc-init [ENC_INIT]] [--enc-init-mods ENC_INIT_MODS]
                   [--dec-init [DEC_INIT]] [--dec-init-mods DEC_INIT_MODS]
                   [--multilingual MULTILINGUAL] [--replace-sos REPLACE_SOS]
                   [--stats-file STATS_FILE] [--apply-uttmvn APPLY_UTTMVN]
                   [--uttmvn-norm-means UTTMVN_NORM_MEANS]
                   [--uttmvn-norm-vars UTTMVN_NORM_VARS] [--fbank-fs FBANK_FS]
                   [--n-mels N_MELS] [--fbank-fmin FBANK_FMIN]
                   [--fbank-fmax FBANK_FMAX]

Named Arguments


config file path


second config file path that overwrites the settings in –config.


third config file path that overwrites the settings in –config and –config2.


Number of GPUs. If not given, use all visible devices


Possible choices: float16, float32, float64, O0, O1, O2, O3

Data type for training (only pytorch backend). O0,O1,.. flags require apex. See

Default: “float32”


Possible choices: chainer, pytorch

Backend library

Default: “chainer”


Output directory



Default: 1




Random seed

Default: 1


Output directory for debugging

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0


Tensorboard log dir path


Report interval iterations

Default: 100


Save snapshot interval iterations

Default: 0


Filename of train label data (json)


Filename of validation label data (json)


model defined module (default: espnet.nets.xxx_backend.e2e_st:E2E)


Possible choices: builtin, gtnctc, cudnnctc

Type of CTC implementation to calculate loss.

Default: “builtin”


Multitask learning coefficient, alpha: alpha*ctc_loss + (1-alpha)*att_loss

Default: 0.0


Multitask learning coefficient for ASR task, weight: asr_weight*(alpha*ctc_loss + (1-alpha)*att_loss) + (1-asr_weight-mt_weight)*st_loss

Default: 0.0


Multitask learning coefficient for MT task, weight: mt_weight*mt_loss + (1-mt_weight-asr_weight)*st_loss

Default: 0.0


Label smoothing weight

Default: 0.0


Compute CER on development set

Default: False


Compute WER on development set

Default: False


Compute BLEU on development set

Default: True


Output N-best hypotheses

Default: 1


Beam size

Default: 4


Incertion penalty

Default: 0.0

Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 0.0


Input length ratio to obtain min output length

Default: 0.0


RNNLM model file to read


RNNLM model config file to read


RNNLM weight.

Default: 0.0


Space symbol

Default: “<space>”


Blank symbol

Default: “<blank>”


How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0


Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0


Maximum bins in a minibatch (0 to disable)

Default: 0


Maximum input frames in a minibatch (0 to disable)

Default: 0


Maximum output frames in a minibatch (0 to disable)

Default: 0


Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 800

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 150


Number of processes of iterator

Default: 0


The configuration file for the pre-processing


Possible choices: adadelta, adam, noam


Default: “adadelta”


Number of gradient accumuration

Default: 1


Epsilon constant for optimizer

Default: 1e-08


Decaying ratio of epsilon

Default: 0.01


Learning rate for optimizer

Default: 0.001


Decaying ratio of learning rate

Default: 1.0


Weight decay ratio

Default: 0.0


Possible choices: loss, acc

Criterion to perform epsilon decay

Default: “acc”


Threshold to stop iteration

Default: 0.0001

--epochs, -e

Maximum number of epochs

Default: 30


Value to monitor to trigger an early stopping of the training

Default: “validation/main/acc”


Number of epochs to wait without improvement before stopping the training

Default: 3


Gradient norm threshold to clip

Default: 5


Number of samples of attention to be saved

Default: 3


Number of samples of CTC probability to be saved

Default: 3


The flag to switch to use noise injection to gradients during training

Default: False


The flag to switch to use context vector residual in the decoder network

Default: False


Pre-trained ASR model to initialize encoder.


List of encoder modules to initialize, separated by a comma.

Default: enc.enc.


Pre-trained ASR, MT or LM model to initialize decoder.


List of decoder modules to initialize, separated by a comma.

Default: att., dec.


Prepend target language ID to the source sentence. Both source/target language IDs must be prepend in the pre-processing stage.

Default: False


Replace <sos> in the decoder with a target language ID (the first token in the target sequence)

Default: False


The stats file for the feature normalization


Apply utterance level mean variance normalization.

Default: True


Default: True


Default: False


The sample frequency used for the mel-fbank creation.

Default: 16000


The number of mel-frequency bins.

Default: 80


Default: 0.0


Translate text from speech using a speech translation model on one CPU or GPU

usage: [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--dtype {float16,float32,float64}]
                   [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                   [--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
                   [--preprocess-conf PREPROCESS_CONF] [--api {v1,v2}]
                   [--trans-json TRANS_JSON] --result-label RESULT_LABEL
                   --model MODEL [--nbest NBEST] [--beam-size BEAM_SIZE]
                   [--penalty PENALTY] [--maxlenratio MAXLENRATIO]
                   [--minlenratio MINLENRATIO] [--tgt-lang TGT_LANG]

Named Arguments


Config file path


Second config file path that overwrites the settings in –config


Third config file path that overwrites the settings in –config and –config2


Number of GPUs

Default: 0


Possible choices: float16, float32, float64

Float precision (only available in –api v2)

Default: “float32”


Possible choices: chainer, pytorch

Backend library

Default: “chainer”



Default: 1


Random seed

Default: 1

--verbose, -V

Verbose option

Default: 1


Batch size for beam search (0: means no batch processing)

Default: 1


The configuration file for the pre-processing


Possible choices: v1, v2

Beam search APIs v1: Default API. It only supports the ASRInterface.recognize method and DefaultRNNLM. v2: Experimental API. It supports any models that implements ScorerInterface.

Default: “v1”


Filename of translation data (json)


Filename of result label data (json)


Model file parameters to read


Output N-best hypotheses

Default: 1


Beam size

Default: 1


Incertion penalty

Default: 0.0

Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 0.0


Input length ratio to obtain min output length

Default: 0.0


target language ID (e.g., <en>, <de>, and <fr> etc.)

Default: False

Synthesize speech from text using a TTS model on one CPU

usage: [-h] [--config CONFIG] [--config2 CONFIG2]
                     [--config3 CONFIG3] [--ngpu NGPU]
                     [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                     [--seed SEED] --out OUT [--verbose VERBOSE]
                     [--preprocess-conf PREPROCESS_CONF] --json JSON --model
                     MODEL [--model-conf MODEL_CONF]
                     [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                     [--threshold THRESHOLD]
                     [--use-att-constraint USE_ATT_CONSTRAINT]
                     [--backward-window BACKWARD_WINDOW]
                     [--forward-window FORWARD_WINDOW]
                     [--fastspeech-alpha FASTSPEECH_ALPHA]
                     [--save-durations SAVE_DURATIONS]
                     [--save-focus-rates SAVE_FOCUS_RATES]

Named Arguments


config file path


second config file path that overwrites the settings in –config.


third config file path that overwrites the settings in –config and –config2.


Number of GPUs

Default: 0


Possible choices: chainer, pytorch

Backend library

Default: “pytorch”



Default: 1


Random seed

Default: 1


Output filename

--verbose, -V

Verbose option

Default: 0


The configuration file for the pre-processing


Filename of train label data (json)


Model file parameters to read


Model config file


Maximum length ratio in decoding

Default: 5


Minimum length ratio in decoding

Default: 0


Threshold value in decoding

Default: 0.5


Whether to use the attention constraint

Default: False


Backward window size in the attention constraint

Default: 1


Forward window size in the attention constraint

Default: 3


Alpha to change the speed for FastSpeech

Default: 1.0


Whether to save durations converted from attentions

Default: False


Whether to save focus rates of attentions

Default: False

Train a new text-to-speech (TTS) model on one CPU, one or multiple GPUs

usage: [-h] [--config CONFIG] [--config2 CONFIG2]
                    [--config3 CONFIG3] [--ngpu NGPU]
                    [--backend {chainer,pytorch}] --outdir OUTDIR
                    [--debugmode DEBUGMODE] [--seed SEED] [--resume [RESUME]]
                    [--minibatches MINIBATCHES] [--verbose VERBOSE]
                    [--tensorboard-dir [TENSORBOARD_DIR]]
                    [--eval-interval-epochs EVAL_INTERVAL_EPOCHS]
                    [--save-interval-epochs SAVE_INTERVAL_EPOCHS]
                    [--report-interval-iters REPORT_INTERVAL_ITERS]
                    --train-json TRAIN_JSON --valid-json VALID_JSON
                    [--model-module MODEL_MODULE] [--sortagrad [SORTAGRAD]]
                    [--batch-sort-key [{shuffle,output,input}]]
                    [--batch-count {auto,seq,bin,frame}]
                    [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                    [--batch-frames-in BATCH_FRAMES_IN]
                    [--batch-frames-out BATCH_FRAMES_OUT]
                    [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                    [--maxlen-out ML]
                    [--num-iter-processes NUM_ITER_PROCESSES]
                    [--preprocess-conf PREPROCESS_CONF]
                    [--use-speaker-embedding USE_SPEAKER_EMBEDDING]
                    [--use-second-target USE_SECOND_TARGET]
                    [--opt {adam,noam}] [--accum-grad ACCUM_GRAD] [--lr LR]
                    [--eps EPS] [--weight-decay WEIGHT_DECAY]
                    [--epochs EPOCHS]
                    [--early-stop-criterion [EARLY_STOP_CRITERION]]
                    [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                    [--num-save-attention NUM_SAVE_ATTENTION]
                    [--keep-all-data-on-mem KEEP_ALL_DATA_ON_MEM]
                    [--enc-init ENC_INIT] [--enc-init-mods ENC_INIT_MODS]
                    [--dec-init DEC_INIT] [--dec-init-mods DEC_INIT_MODS]
                    [--freeze-mods FREEZE_MODS]

Named Arguments


config file path


second config file path that overwrites the settings in –config.


third config file path that overwrites the settings in –config and –config2.


Number of GPUs. If not given, use all visible devices


Possible choices: chainer, pytorch

Backend library

Default: “pytorch”


Output directory



Default: 1


Random seed

Default: 1

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0


Tensorboard log directory path


Evaluation interval epochs

Default: 1


Save interval epochs

Default: 1


Report interval iterations

Default: 100


Filename of training json


Filename of validation json


model defined module

Default: “espnet.nets.pytorch_backend.e2e_tts_tacotron2:Tacotron2”


How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0


Possible choices: shuffle, output, input

Batch sorting key. “shuffle” only work with –batch-count “seq”.

Default: “shuffle”


Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0


Maximum bins in a minibatch (0 to disable)

Default: 0


Maximum input frames in a minibatch (0 to disable)

Default: 0


Maximum output frames in a minibatch (0 to disable)

Default: 0


Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 100

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 200


Number of processes of iterator

Default: 0


The configuration file for the pre-processing


Whether to use speaker embedding

Default: False


Whether to use second target

Default: False


Possible choices: adam, noam


Default: “adam”


Number of gradient accumuration

Default: 1


Learning rate for optimizer

Default: 0.001


Epsilon for optimizer

Default: 1e-06


Weight decay coefficient for optimizer

Default: 1e-06

--epochs, -e

Number of maximum epochs

Default: 30


Value to monitor to trigger an early stopping of the training

Default: “validation/main/loss”


Number of epochs to wait without improvement before stopping the training

Default: 3


Gradient norm threshold to clip

Default: 1


Number of samples of attention to be saved

Default: 5


Whether to keep all data on memory

Default: False


Pre-trained TTS model path to initialize encoder.


List of encoder modules to initialize, separated by a comma.

Default: enc.


Pre-trained TTS model path to initialize decoder.


List of decoder modules to initialize, separated by a comma.

Default: dec.


List of modules to freeze (not to train), separated by a comma.

Converting speech using a VC model on one CPU

usage: [-h] [--config CONFIG] [--config2 CONFIG2]
                    [--config3 CONFIG3] [--ngpu NGPU]
                    [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                    [--seed SEED] --out OUT [--verbose VERBOSE]
                    [--preprocess-conf PREPROCESS_CONF] --json JSON --model
                    MODEL [--model-conf MODEL_CONF]
                    [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                    [--threshold THRESHOLD]
                    [--use-att-constraint USE_ATT_CONSTRAINT]
                    [--backward-window BACKWARD_WINDOW]
                    [--forward-window FORWARD_WINDOW]
                    [--save-durations SAVE_DURATIONS]
                    [--save-focus-rates SAVE_FOCUS_RATES]

Named Arguments


config file path


second config file path that overwrites the settings in –config.


third config file path that overwrites the settings in –config and –config2.


Number of GPUs

Default: 0


Possible choices: chainer, pytorch

Backend library

Default: “pytorch”



Default: 1


Random seed

Default: 1


Output filename

--verbose, -V

Verbose option

Default: 0


The configuration file for the pre-processing


Filename of train label data (json)


Model file parameters to read


Model config file


Maximum length ratio in decoding

Default: 5


Minimum length ratio in decoding

Default: 0


Threshold value in decoding

Default: 0.5


Whether to use the attention constraint

Default: False


Backward window size in the attention constraint

Default: 1


Forward window size in the attention constraint

Default: 3


Whether to save durations converted from attentions

Default: False


Whether to save focus rates of attentions

Default: False

Train a new voice conversion (VC) model on one CPU, one or multiple GPUs

usage: [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--backend {chainer,pytorch}] --outdir OUTDIR
                   [--debugmode DEBUGMODE] [--seed SEED] [--resume [RESUME]]
                   [--minibatches MINIBATCHES] [--verbose VERBOSE]
                   [--tensorboard-dir [TENSORBOARD_DIR]]
                   [--eval-interval-epochs EVAL_INTERVAL_EPOCHS]
                   [--save-interval-epochs SAVE_INTERVAL_EPOCHS]
                   [--report-interval-iters REPORT_INTERVAL_ITERS]
                   [--srcspk SRCSPK] [--trgspk TRGSPK] --train-json TRAIN_JSON
                   --valid-json VALID_JSON [--model-module MODEL_MODULE]
                   [--sortagrad [SORTAGRAD]]
                   [--batch-sort-key [{shuffle,output,input}]]
                   [--batch-count {auto,seq,bin,frame}]
                   [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                   [--batch-frames-in BATCH_FRAMES_IN]
                   [--batch-frames-out BATCH_FRAMES_OUT]
                   [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                   [--maxlen-out ML] [--num-iter-processes NUM_ITER_PROCESSES]
                   [--preprocess-conf PREPROCESS_CONF]
                   [--use-speaker-embedding USE_SPEAKER_EMBEDDING]
                   [--use-second-target USE_SECOND_TARGET]
                   [--opt {adam,noam,lamb}] [--accum-grad ACCUM_GRAD]
                   [--lr LR] [--eps EPS] [--weight-decay WEIGHT_DECAY]
                   [--epochs EPOCHS]
                   [--early-stop-criterion [EARLY_STOP_CRITERION]]
                   [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                   [--num-save-attention NUM_SAVE_ATTENTION]
                   [--keep-all-data-on-mem KEEP_ALL_DATA_ON_MEM]
                   [--enc-init ENC_INIT] [--enc-init-mods ENC_INIT_MODS]
                   [--dec-init DEC_INIT] [--dec-init-mods DEC_INIT_MODS]
                   [--freeze-mods FREEZE_MODS]

Named Arguments


config file path


second config file path that overwrites the settings in –config.


third config file path that overwrites the settings in –config and –config2.


Number of GPUs. If not given, use all visible devices


Possible choices: chainer, pytorch

Backend library

Default: “pytorch”


Output directory



Default: 1


Random seed

Default: 1

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0


Tensorboard log directory path


Evaluation interval epochs

Default: 100


Save interval epochs

Default: 1


Report interval iterations

Default: 10


Source speaker


Target speaker


Filename of training json


Filename of validation json


model defined module

Default: “espnet.nets.pytorch_backend.e2e_tts_tacotron2:Tacotron2”


How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0


Possible choices: shuffle, output, input

Batch sorting key. “shuffle” only work with –batch-count “seq”.

Default: “shuffle”


Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0


Maximum bins in a minibatch (0 to disable)

Default: 0


Maximum input frames in a minibatch (0 to disable)

Default: 0


Maximum output frames in a minibatch (0 to disable)

Default: 0


Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 100

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 200


Number of processes of iterator

Default: 0


The configuration file for the pre-processing


Whether to use speaker embedding

Default: False


Whether to use second target

Default: False


Possible choices: adam, noam, lamb


Default: “adam”


Number of gradient accumuration

Default: 1


Learning rate for optimizer

Default: 0.001


Epsilon for optimizer

Default: 1e-06


Weight decay coefficient for optimizer

Default: 1e-06

--epochs, -e

Number of maximum epochs

Default: 30


Value to monitor to trigger an early stopping of the training

Default: “validation/main/loss”


Number of epochs to wait without improvement before stopping the training

Default: 3


Gradient norm threshold to clip

Default: 1


Number of samples of attention to be saved

Default: 5


Whether to keep all data on memory

Default: False


Pre-trained model path to initialize encoder.


List of encoder modules to initialize, separated by a comma.

Default: enc.


Pre-trained model path to initialize decoder.


List of decoder modules to initialize, separated by a comma.

Default: dec.


List of modules to freeze (not to train), separated by a comma.