core tools

ESPnet provides several command-line tools for training and evaluating neural networks (NN) under espnet/bin:

  • asr_align.py: Align text to audio using CTC segmentation.using a pre-trained speech recognition model.

  • asr_enhance.py: Enhance noisy speech for speech recognition

  • asr_recog.py: Transcribe text from speech using a speech recognition model on one CPU or GPU

  • asr_train.py: Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs

  • lm_train.py: Train a new language model on one CPU or one GPU

  • mt_train.py: Train a neural machine translation (NMT) model on one CPU, one or multiple GPUs

  • mt_trans.py: Translate text from speech using a speech translation model on one CPU or GPU

  • st_train.py: Train a speech translation (ST) model on one CPU, one or multiple GPUs

  • st_trans.py: Translate text from speech using a speech translation model on one CPU or GPU

  • tts_decode.py: Synthesize speech from text using a TTS model on one CPU

  • tts_train.py: Train a new text-to-speech (TTS) model on one CPU, one or multiple GPUs

  • vc_decode.py: Converting speech using a VC model on one CPU

  • vc_train.py: Train a new voice conversion (VC) model on one CPU, one or multiple GPUs

asr_align.py

Align text to audio using CTC segmentation.using a pre-trained speech recognition model.

usage: asr_align.py [-h] [--config CONFIG] [--ngpu NGPU]
                    [--dtype {float16,float32,float64}] [--backend {pytorch}]
                    [--debugmode DEBUGMODE] [--verbose VERBOSE]
                    [--preprocess-conf PREPROCESS_CONF]
                    [--data-json DATA_JSON] [--utt-text UTT_TEXT] --model
                    MODEL [--model-conf MODEL_CONF] [--num-encs NUM_ENCS]
                    [--subsampling-factor SUBSAMPLING_FACTOR]
                    [--frame-duration FRAME_DURATION]
                    [--min-window-size MIN_WINDOW_SIZE]
                    [--max-window-size MAX_WINDOW_SIZE]
                    [--use-dict-blank USE_DICT_BLANK] [--set-blank SET_BLANK]
                    [--gratis-blank GRATIS_BLANK]
                    [--replace-spaces-with-blanks REPLACE_SPACES_WITH_BLANKS]
                    [--scoring-length SCORING_LENGTH] --output OUTPUT

Named Arguments

--config

Decoding config file path.

--ngpu

Number of GPUs (max. 1 is supported)

Default: 0

--dtype

Possible choices: float16, float32, float64

Float precision (only available in –api v2)

Default: “float32”

--backend

Possible choices: pytorch

Backend library

Default: “pytorch”

--debugmode

Debugmode

Default: 1

--verbose, -V

Verbose option

Default: 1

--preprocess-conf

The configuration file for the pre-processing

--data-json

Json of recognition data for audio and text

--utt-text

Text separated into utterances

--model

Model file parameters to read

--model-conf

Model config file

--num-encs

Number of encoders in the model.

Default: 1

--subsampling-factor

Subsampling factor. If the encoder sub-samples its input, the number of frames at the CTC layer is reduced by this factor. For example, a BLSTMP with subsampling 1_2_2_1_1 has a subsampling factor of 4.

--frame-duration

Non-overlapping duration of a single frame in milliseconds.

--min-window-size

Minimum window size considered for utterance.

--max-window-size

Maximum window size considered for utterance.

--use-dict-blank

DEPRECATED.

--set-blank

Index of model dictionary for blank token (default: 0).

--gratis-blank

Set the transition cost of the blank token to zero. Audio sections labeled with blank tokens can then be skipped without penalty. Useful if there are unrelated audio segments between utterances.

--replace-spaces-with-blanks

Fill blanks in between words to better model pauses between words. Segments can be misaligned if this option is combined with –gratis-blank. May increase length of ground truth.

--scoring-length

Changes partitioning length L for calculation of the confidence score.

--output

Output segments file

asr_enhance.py

Enhance noisy speech for speech recognition

usage: asr_enhance.py [-h] [--config CONFIG] [--config2 CONFIG2]
                      [--config3 CONFIG3] [--ngpu NGPU]
                      [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                      [--seed SEED] [--verbose VERBOSE]
                      [--batchsize BATCHSIZE]
                      [--preprocess-conf PREPROCESS_CONF]
                      [--recog-json RECOG_JSON] --model MODEL
                      [--model-conf MODEL_CONF]
                      [--enh-wspecifier ENH_WSPECIFIER]
                      [--enh-filetype {mat,hdf5,sound.hdf5,sound}] [--fs FS]
                      [--keep-length KEEP_LENGTH] [--image-dir IMAGE_DIR]
                      [--num-images NUM_IMAGES] [--apply-istft APPLY_ISTFT]
                      [--istft-win-length ISTFT_WIN_LENGTH]
                      [--istft-n-shift ISTFT_N_SHIFT]
                      [--istft-window ISTFT_WINDOW]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs

Default: 0

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--verbose, -V

Verbose option

Default: 1

--batchsize

Batch size for beam search (0: means no batch processing)

Default: 1

--preprocess-conf

The configuration file for the pre-processing

--recog-json

Filename of recognition data (json)

--model

Model file parameters to read

--model-conf

Model config file

--enh-wspecifier

Specify the output way for enhanced speech.e.g. ark,scp:outdir,wav.scp

--enh-filetype

Possible choices: mat, hdf5, sound.hdf5, sound

Specify the file format for enhanced speech. “mat” is the matrix format in kaldi

Default: “sound”

--fs

The sample frequency

Default: 16000

--keep-length

Adjust the output length to match with the input for enhanced speech

Default: True

--image-dir

The directory saving the images.

--num-images

The number of images files to be saved. If negative, all samples are to be saved.

Default: 20

--apply-istft

Apply istft to the output from the network

Default: True

--istft-win-length

The window length for istft. This option is ignored if stft is found in the preprocess-conf

Default: 512

--istft-n-shift

The window type for istft. This option is ignored if stft is found in the preprocess-conf

Default: 256

--istft-window

The window type for istft. This option is ignored if stft is found in the preprocess-conf

Default: “hann”

asr_recog.py

Transcribe text from speech using a speech recognition model on one CPU or GPU

usage: asr_recog.py [-h] [--config CONFIG] [--config2 CONFIG2]
                    [--config3 CONFIG3] [--ngpu NGPU]
                    [--dtype {float16,float32,float64}]
                    [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                    [--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
                    [--preprocess-conf PREPROCESS_CONF] [--api {v1,v2}]
                    [--recog-json RECOG_JSON] --result-label RESULT_LABEL
                    --model MODEL [--model-conf MODEL_CONF]
                    [--num-spkrs {1,2}] [--num-encs NUM_ENCS] [--nbest NBEST]
                    [--beam-size BEAM_SIZE] [--penalty PENALTY]
                    [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                    [--ctc-weight CTC_WEIGHT]
                    [--weights-ctc-dec WEIGHTS_CTC_DEC]
                    [--ctc-window-margin CTC_WINDOW_MARGIN]
                    [--search-type {default,nsc,tsd,alsd}] [--nstep NSTEP]
                    [--prefix-alpha PREFIX_ALPHA] [--max-sym-exp MAX_SYM_EXP]
                    [--u-max U_MAX] [--score-norm [SCORE_NORM]]
                    [--rnnlm RNNLM] [--rnnlm-conf RNNLM_CONF]
                    [--word-rnnlm WORD_RNNLM]
                    [--word-rnnlm-conf WORD_RNNLM_CONF]
                    [--word-dict WORD_DICT] [--lm-weight LM_WEIGHT]
                    [--ngram-model NGRAM_MODEL] [--ngram-weight NGRAM_WEIGHT]
                    [--ngram-scorer {full,part}]
                    [--streaming-mode {window,segment}]
                    [--streaming-window STREAMING_WINDOW]
                    [--streaming-min-blank-dur STREAMING_MIN_BLANK_DUR]
                    [--streaming-onset-margin STREAMING_ONSET_MARGIN]
                    [--streaming-offset-margin STREAMING_OFFSET_MARGIN]
                    [--maskctc-n-iterations MASKCTC_N_ITERATIONS]
                    [--maskctc-probability-threshold MASKCTC_PROBABILITY_THRESHOLD]

Named Arguments

--config

Config file path

--config2

Second config file path that overwrites the settings in –config

--config3

Third config file path that overwrites the settings in –config and –config2

--ngpu

Number of GPUs

Default: 0

--dtype

Possible choices: float16, float32, float64

Float precision (only available in –api v2)

Default: “float32”

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--verbose, -V

Verbose option

Default: 1

--batchsize

Batch size for beam search (0: means no batch processing)

Default: 1

--preprocess-conf

The configuration file for the pre-processing

--api

Possible choices: v1, v2

Beam search APIs v1: Default API. It only supports the ASRInterface.recognize method and DefaultRNNLM. v2: Experimental API. It supports any models that implements ScorerInterface.

Default: “v1”

--recog-json

Filename of recognition data (json)

--result-label

Filename of result label data (json)

--model

Model file parameters to read

--model-conf

Model config file

--num-spkrs

Possible choices: 1, 2

Number of speakers in the speech

Default: 1

--num-encs

Number of encoders in the model.

Default: 1

--nbest

Output N-best hypotheses

Default: 1

--beam-size

Beam size

Default: 1

--penalty

Incertion penalty

Default: 0.0

--maxlenratio
Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 0.0

--minlenratio

Input length ratio to obtain min output length

Default: 0.0

--ctc-weight

CTC weight in joint decoding

Default: 0.0

--weights-ctc-dec

ctc weight assigned to each encoder during decoding.[in multi-encoder mode only]

--ctc-window-margin
Use CTC window with margin parameter to accelerate

CTC/attention decoding especially on GPU. Smaller magin makes decoding faster, but may increase search errors. If margin=0 (default), this function is disabled

Default: 0

--search-type

Possible choices: default, nsc, tsd, alsd

Type of beam search implementation to use during inference.

Can be either: default beam search, n-step constrained beam search (“nsc”), time-synchronous decoding (“tsd”) or alignment-length synchronous decoding (“alsd”). Additional associated parameters: “nstep” + “prefix-alpha” (for nsc), “max-sym-exp” (for tsd) and “u-max” (for alsd)

Default: “default”

--nstep

Number of expansion steps allowed in NSC beam search.

Default: 1

--prefix-alpha

Length prefix difference allowed in NSC beam search.

Default: 2

--max-sym-exp

Number of symbol expansions allowed in TSD decoding.

Default: 2

--u-max

Length prefix difference allowed in ALSD beam search.

Default: 400

--score-norm

Normalize transducer scores by length

Default: True

--rnnlm

RNNLM model file to read

--rnnlm-conf

RNNLM model config file to read

--word-rnnlm

Word RNNLM model file to read

--word-rnnlm-conf

Word RNNLM model config file to read

--word-dict

Word list to read

--lm-weight

RNNLM weight

Default: 0.1

--ngram-model

ngram model file to read

--ngram-weight

ngram weight

Default: 0.1

--ngram-scorer

Possible choices: full, part

if the ngram is set as a part scorer, similar with CTC scorer,

ngram scorer only scores topK hypethesis. if the ngram is set as full scorer, ngram scorer scores all hypthesis the decoding speed of part scorer is musch faster than full one

Default: “part”

--streaming-mode

Possible choices: window, segment

Use streaming recognizer for inference.

–batchsize must be set to 0 to enable this mode

--streaming-window

Window size

Default: 10

--streaming-min-blank-dur

Minimum blank duration threshold

Default: 10

--streaming-onset-margin

Onset margin

Default: 1

--streaming-offset-margin

Offset margin

Default: 1

--maskctc-n-iterations

Number of decoding iterations.For Mask CTC, set 0 to predict 1 mask/iter.

Default: 10

--maskctc-probability-threshold

Threshold probability for CTC output

Default: 0.999

asr_train.py

Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs

usage: asr_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
                    [--config3 CONFIG3] [--ngpu NGPU]
                    [--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
                    [--backend {chainer,pytorch}] --outdir OUTDIR
                    [--debugmode DEBUGMODE] --dict DICT [--seed SEED]
                    [--debugdir DEBUGDIR] [--resume [RESUME]]
                    [--minibatches MINIBATCHES] [--verbose VERBOSE]
                    [--tensorboard-dir [TENSORBOARD_DIR]]
                    [--report-interval-iters REPORT_INTERVAL_ITERS]
                    [--save-interval-iters SAVE_INTERVAL_ITERS]
                    [--train-json TRAIN_JSON] [--valid-json VALID_JSON]
                    [--model-module MODEL_MODULE] [--num-encs NUM_ENCS]
                    [--ctc_type {builtin,warpctc}] [--mtlalpha MTLALPHA]
                    [--lsm-weight LSM_WEIGHT] [--report-cer] [--report-wer]
                    [--nbest NBEST] [--beam-size BEAM_SIZE]
                    [--penalty PENALTY] [--maxlenratio MAXLENRATIO]
                    [--minlenratio MINLENRATIO] [--ctc-weight CTC_WEIGHT]
                    [--rnnlm RNNLM] [--rnnlm-conf RNNLM_CONF]
                    [--lm-weight LM_WEIGHT] [--sym-space SYM_SPACE]
                    [--sym-blank SYM_BLANK] [--sortagrad [SORTAGRAD]]
                    [--batch-count {auto,seq,bin,frame}]
                    [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                    [--batch-frames-in BATCH_FRAMES_IN]
                    [--batch-frames-out BATCH_FRAMES_OUT]
                    [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                    [--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
                    [--preprocess-conf [PREPROCESS_CONF]]
                    [--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
                    [--eps EPS] [--eps-decay EPS_DECAY]
                    [--weight-decay WEIGHT_DECAY]
                    [--criterion {loss,loss_eps_decay_only,acc}]
                    [--threshold THRESHOLD] [--epochs EPOCHS]
                    [--early-stop-criterion [EARLY_STOP_CRITERION]]
                    [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                    [--num-save-attention NUM_SAVE_ATTENTION]
                    [--num-save-ctc NUM_SAVE_CTC] [--grad-noise GRAD_NOISE]
                    [--num-spkrs {1,2}]
                    [--context-residual [CONTEXT_RESIDUAL]]
                    [--enc-init ENC_INIT] [--enc-init-mods ENC_INIT_MODS]
                    [--dec-init DEC_INIT] [--dec-init-mods DEC_INIT_MODS]
                    [--freeze-mods FREEZE_MODS] [--use-frontend USE_FRONTEND]
                    [--use-wpe USE_WPE]
                    [--wtype {lstm,blstm,lstmp,blstmp,vgglstmp,vggblstmp,vgglstm,vggblstm,gru,bgru,grup,bgrup,vgggrup,vggbgrup,vgggru,vggbgru}]
                    [--wlayers WLAYERS] [--wunits WUNITS] [--wprojs WPROJS]
                    [--wdropout-rate WDROPOUT_RATE] [--wpe-taps WPE_TAPS]
                    [--wpe-delay WPE_DELAY]
                    [--use-dnn-mask-for-wpe USE_DNN_MASK_FOR_WPE]
                    [--use-beamformer USE_BEAMFORMER]
                    [--btype {lstm,blstm,lstmp,blstmp,vgglstmp,vggblstmp,vgglstm,vggblstm,gru,bgru,grup,bgrup,vgggrup,vggbgrup,vgggru,vggbgru}]
                    [--blayers BLAYERS] [--bunits BUNITS] [--bprojs BPROJS]
                    [--badim BADIM] [--bnmask BNMASK]
                    [--ref-channel REF_CHANNEL]
                    [--bdropout-rate BDROPOUT_RATE] [--stats-file STATS_FILE]
                    [--apply-uttmvn APPLY_UTTMVN]
                    [--uttmvn-norm-means UTTMVN_NORM_MEANS]
                    [--uttmvn-norm-vars UTTMVN_NORM_VARS]
                    [--fbank-fs FBANK_FS] [--n-mels N_MELS]
                    [--fbank-fmin FBANK_FMIN] [--fbank-fmax FBANK_FMAX]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs. If not given, use all visible devices

--train-dtype

Possible choices: float16, float32, float64, O0, O1, O2, O3

Data type for training (only pytorch backend). O0,O1,.. flags require apex. See https://nvidia.github.io/apex/amp.html#opt-levels

Default: “float32”

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--outdir

Output directory

--debugmode

Debugmode

Default: 1

--dict

Dictionary

--seed

Random seed

Default: 1

--debugdir

Output directory for debugging

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0

--tensorboard-dir

Tensorboard log dir path

--report-interval-iters

Report interval iterations

Default: 100

--save-interval-iters

Save snapshot interval iterations

Default: 0

--train-json

Filename of train label data (json)

--valid-json

Filename of validation label data (json)

--model-module

model defined module (default: espnet.nets.xxx_backend.e2e_asr:E2E)

--num-encs

Number of encoders in the model.

Default: 1

--ctc_type

Possible choices: builtin, warpctc

Type of CTC implementation to calculate loss.

Default: “warpctc”

--mtlalpha

Multitask learning coefficient, alpha: alpha*ctc_loss + (1-alpha)*att_loss

Default: 0.5

--lsm-weight

Label smoothing weight

Default: 0.0

--report-cer

Compute CER on development set

Default: False

--report-wer

Compute WER on development set

Default: False

--nbest

Output N-best hypotheses

Default: 1

--beam-size

Beam size

Default: 4

--penalty

Incertion penalty

Default: 0.0

--maxlenratio
Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 0.0

--minlenratio

Input length ratio to obtain min output length

Default: 0.0

--ctc-weight

CTC weight in joint decoding

Default: 0.3

--rnnlm

RNNLM model file to read

--rnnlm-conf

RNNLM model config file to read

--lm-weight

RNNLM weight.

Default: 0.1

--sym-space

Space symbol

Default: “<space>”

--sym-blank

Blank symbol

Default: “<blank>”

--sortagrad

How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0

--batch-count

Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0

--batch-bins

Maximum bins in a minibatch (0 to disable)

Default: 0

--batch-frames-in

Maximum input frames in a minibatch (0 to disable)

Default: 0

--batch-frames-out

Maximum output frames in a minibatch (0 to disable)

Default: 0

--batch-frames-inout

Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 800

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 150

--n-iter-processes

Number of processes of iterator

Default: 0

--preprocess-conf

The configuration file for the pre-processing

--opt

Possible choices: adadelta, adam, noam

Optimizer

Default: “adadelta”

--accum-grad

Number of gradient accumuration

Default: 1

--eps

Epsilon constant for optimizer

Default: 1e-08

--eps-decay

Decaying ratio of epsilon

Default: 0.01

--weight-decay

Weight decay ratio

Default: 0.0

--criterion

Possible choices: loss, loss_eps_decay_only, acc

Criterion to perform epsilon decay

Default: “acc”

--threshold

Threshold to stop iteration

Default: 0.0001

--epochs, -e

Maximum number of epochs

Default: 30

--early-stop-criterion

Value to monitor to trigger an early stopping of the training

Default: “validation/main/acc”

--patience

Number of epochs to wait without improvement before stopping the training

Default: 3

--grad-clip

Gradient norm threshold to clip

Default: 5

--num-save-attention

Number of samples of attention to be saved

Default: 3

--num-save-ctc

Number of samples of CTC probability to be saved

Default: 3

--grad-noise

The flag to switch to use noise injection to gradients during training

Default: False

--num-spkrs

Possible choices: 1, 2

Number of speakers in the speech.

Default: 1

--context-residual

The flag to switch to use context vector residual in the decoder network

Default: False

--enc-init

Pre-trained ASR model to initialize encoder.

--enc-init-mods

List of encoder modules to initialize, separated by a comma.

Default: enc.enc.

--dec-init

Pre-trained ASR, MT or LM model to initialize decoder.

--dec-init-mods

List of decoder modules to initialize, separated by a comma.

Default: att., dec.

--freeze-mods

List of modules to freeze, separated by a comma.

--use-frontend

The flag to switch to use frontend system.

Default: False

--use-wpe

Apply Weighted Prediction Error

Default: False

--wtype

Possible choices: lstm, blstm, lstmp, blstmp, vgglstmp, vggblstmp, vgglstm, vggblstm, gru, bgru, grup, bgrup, vgggrup, vggbgrup, vgggru, vggbgru

Type of encoder network architecture of the mask estimator for WPE.

Default: “blstmp”

--wlayers

Default: 2

--wunits

Default: 300

--wprojs

Default: 300

--wdropout-rate

Default: 0.0

--wpe-taps

Default: 5

--wpe-delay

Default: 3

--use-dnn-mask-for-wpe

Use DNN to estimate the power spectrogram. This option is experimental.

Default: False

--use-beamformer

Default: True

--btype

Possible choices: lstm, blstm, lstmp, blstmp, vgglstmp, vggblstmp, vgglstm, vggblstm, gru, bgru, grup, bgrup, vgggrup, vggbgrup, vgggru, vggbgru

Type of encoder network architecture of the mask estimator for Beamformer.

Default: “blstmp”

--blayers

Default: 2

--bunits

Default: 300

--bprojs

Default: 300

--badim

Default: 320

--bnmask

Number of beamforming masks, default is 2 for [speech, noise].

Default: 2

--ref-channel

The reference channel used for beamformer. By default, the channel is estimated by DNN.

Default: -1

--bdropout-rate

Default: 0.0

--stats-file

The stats file for the feature normalization

--apply-uttmvn

Apply utterance level mean variance normalization.

Default: True

--uttmvn-norm-means

Default: True

--uttmvn-norm-vars

Default: False

--fbank-fs

The sample frequency used for the mel-fbank creation.

Default: 16000

--n-mels

The number of mel-frequency bins.

Default: 80

--fbank-fmin

Default: 0.0

--fbank-fmax

lm_train.py

Train a new language model on one CPU or one GPU

usage: lm_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
                   [--backend {chainer,pytorch}] --outdir OUTDIR
                   [--debugmode DEBUGMODE] --dict DICT [--seed SEED]
                   [--resume [RESUME]] [--verbose VERBOSE]
                   [--tensorboard-dir [TENSORBOARD_DIR]]
                   [--report-interval-iters REPORT_INTERVAL_ITERS]
                   --train-label TRAIN_LABEL --valid-label VALID_LABEL
                   [--test-label TEST_LABEL] [--dump-hdf5-path DUMP_HDF5_PATH]
                   [--opt OPT] [--sortagrad [SORTAGRAD]]
                   [--batchsize BATCHSIZE] [--accum-grad ACCUM_GRAD]
                   [--epoch EPOCH]
                   [--early-stop-criterion [EARLY_STOP_CRITERION]]
                   [--patience [PATIENCE]] [--schedulers SCHEDULERS]
                   [--gradclip GRADCLIP] [--maxlen MAXLEN]
                   [--model-module MODEL_MODULE]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs. If not given, use all visible devices

--train-dtype

Possible choices: float16, float32, float64, O0, O1, O2, O3

Data type for training (only pytorch backend). O0,O1,.. flags require apex. See https://nvidia.github.io/apex/amp.html#opt-levels

Default: “float32”

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--outdir

Output directory

--debugmode

Debugmode

Default: 1

--dict

Dictionary

--seed

Random seed

Default: 1

--resume, -r

Resume the training from snapshot

Default: “”

--verbose, -V

Verbose option

Default: 0

--tensorboard-dir

Tensorboard log dir path

--report-interval-iters

Report interval iterations

Default: 100

--train-label

Filename of train label data

--valid-label

Filename of validation label data

--test-label

Filename of test label data

--dump-hdf5-path

Path to dump a preprocessed dataset as hdf5

--opt

Optimizer

Default: “sgd”

--sortagrad

How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0

--batchsize, -b

Number of examples in each mini-batch

Default: 300

--accum-grad

Number of gradient accumueration

Default: 1

--epoch, -e

Number of sweeps over the dataset to train

Default: 20

--early-stop-criterion

Value to monitor to trigger an early stopping of the training

Default: “validation/main/loss”

--patience

Number of epochs to wait without improvement before stopping the training

Default: 3

--schedulers

optimizer schedulers, you can configure params like: <optimizer-param>-<scheduler-name>-<schduler-param> e.g., “–schedulers lr=noam –lr-noam-warmup 1000”.

--gradclip, -c

Gradient norm threshold to clip

Default: 5

--maxlen

Batch size is reduced if the input sequence > ML

Default: 40

--model-module

model defined module (default: espnet.nets.xxx_backend.lm.default:DefaultRNNLM)

Default: “default”

mt_train.py

Train a neural machine translation (NMT) model on one CPU, one or multiple GPUs

usage: mt_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
                   [--backend {chainer,pytorch}] --outdir OUTDIR
                   [--debugmode DEBUGMODE] --dict DICT [--seed SEED]
                   [--debugdir DEBUGDIR] [--resume [RESUME]]
                   [--minibatches MINIBATCHES] [--verbose VERBOSE]
                   [--tensorboard-dir [TENSORBOARD_DIR]]
                   [--report-interval-iters REPORT_INTERVAL_ITERS]
                   [--save-interval-iters SAVE_INTERVAL_ITERS]
                   [--train-json TRAIN_JSON] [--valid-json VALID_JSON]
                   [--model-module MODEL_MODULE] [--lsm-weight LSM_WEIGHT]
                   [--report-bleu] [--nbest NBEST] [--beam-size BEAM_SIZE]
                   [--penalty PENALTY] [--maxlenratio MAXLENRATIO]
                   [--minlenratio MINLENRATIO] [--rnnlm RNNLM]
                   [--rnnlm-conf RNNLM_CONF] [--lm-weight LM_WEIGHT]
                   [--sym-space SYM_SPACE] [--sym-blank SYM_BLANK]
                   [--sortagrad [SORTAGRAD]]
                   [--batch-count {auto,seq,bin,frame}]
                   [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                   [--batch-frames-in BATCH_FRAMES_IN]
                   [--batch-frames-out BATCH_FRAMES_OUT]
                   [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                   [--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
                   [--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
                   [--eps EPS] [--eps-decay EPS_DECAY] [--lr LR]
                   [--lr-decay LR_DECAY] [--weight-decay WEIGHT_DECAY]
                   [--criterion {loss,acc}] [--threshold THRESHOLD]
                   [--epochs EPOCHS]
                   [--early-stop-criterion [EARLY_STOP_CRITERION]]
                   [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                   [--num-save-attention NUM_SAVE_ATTENTION]
                   [--context-residual [CONTEXT_RESIDUAL]]
                   [--tie-src-tgt-embedding [TIE_SRC_TGT_EMBEDDING]]
                   [--tie-classifier [TIE_CLASSIFIER]] [--enc-init [ENC_INIT]]
                   [--enc-init-mods ENC_INIT_MODS] [--dec-init [DEC_INIT]]
                   [--dec-init-mods DEC_INIT_MODS]
                   [--multilingual MULTILINGUAL] [--replace-sos REPLACE_SOS]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs. If not given, use all visible devices

--train-dtype

Possible choices: float16, float32, float64, O0, O1, O2, O3

Data type for training (only pytorch backend). O0,O1,.. flags require apex. See https://nvidia.github.io/apex/amp.html#opt-levels

Default: “float32”

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--outdir

Output directory

--debugmode

Debugmode

Default: 1

--dict

Dictionary for source/target languages

--seed

Random seed

Default: 1

--debugdir

Output directory for debugging

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0

--tensorboard-dir

Tensorboard log dir path

--report-interval-iters

Report interval iterations

Default: 100

--save-interval-iters

Save snapshot interval iterations

Default: 0

--train-json

Filename of train label data (json)

--valid-json

Filename of validation label data (json)

--model-module

model defined module (default: espnet.nets.xxx_backend.e2e_mt:E2E)

--lsm-weight

Label smoothing weight

Default: 0.0

--report-bleu

Compute BLEU on development set

Default: True

--nbest

Output N-best hypotheses

Default: 1

--beam-size

Beam size

Default: 4

--penalty

Incertion penalty

Default: 0.0

--maxlenratio
Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 0.0

--minlenratio

Input length ratio to obtain min output length

Default: 0.0

--rnnlm

RNNLM model file to read

--rnnlm-conf

RNNLM model config file to read

--lm-weight

RNNLM weight.

Default: 0.0

--sym-space

Space symbol

Default: “<space>”

--sym-blank

Blank symbol

Default: “<blank>”

--sortagrad

How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0

--batch-count

Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0

--batch-bins

Maximum bins in a minibatch (0 to disable)

Default: 0

--batch-frames-in

Maximum input frames in a minibatch (0 to disable)

Default: 0

--batch-frames-out

Maximum output frames in a minibatch (0 to disable)

Default: 0

--batch-frames-inout

Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 100

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 100

--n-iter-processes

Number of processes of iterator

Default: 0

--opt

Possible choices: adadelta, adam, noam

Optimizer

Default: “adadelta”

--accum-grad

Number of gradient accumuration

Default: 1

--eps

Epsilon constant for optimizer

Default: 1e-08

--eps-decay

Decaying ratio of epsilon

Default: 0.01

--lr

Learning rate for optimizer

Default: 0.001

--lr-decay

Decaying ratio of learning rate

Default: 1.0

--weight-decay

Weight decay ratio

Default: 0.0

--criterion

Possible choices: loss, acc

Criterion to perform epsilon decay

Default: “acc”

--threshold

Threshold to stop iteration

Default: 0.0001

--epochs, -e

Maximum number of epochs

Default: 30

--early-stop-criterion

Value to monitor to trigger an early stopping of the training

Default: “validation/main/acc”

--patience

Number of epochs to wait without improvement before stopping the training

Default: 3

--grad-clip

Gradient norm threshold to clip

Default: 5

--num-save-attention

Number of samples of attention to be saved

Default: 3

--context-residual

The flag to switch to use context vector residual in the decoder network

Default: False

--tie-src-tgt-embedding

Tie parameters of source embedding and target embedding.

Default: False

--tie-classifier

Tie parameters of target embedding and output projection layer.

Default: False

--enc-init

Pre-trained ASR model to initialize encoder.

--enc-init-mods

List of encoder modules to initialize, separated by a comma.

Default: enc.enc.

--dec-init

Pre-trained ASR, MT or LM model to initialize decoder.

--dec-init-mods

List of decoder modules to initialize, separated by a comma.

Default: att., dec.

--multilingual

Prepend target language ID to the source sentence. Both source/target language IDs must be prepend in the pre-processing stage.

Default: False

--replace-sos

Replace <sos> in the decoder with a target language ID (the first token in the target sequence)

Default: False

mt_trans.py

Translate text from speech using a speech translation model on one CPU or GPU

usage: mt_trans.py [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--dtype {float16,float32,float64}]
                   [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                   [--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
                   [--preprocess-conf PREPROCESS_CONF] [--api {v1,v2}]
                   [--trans-json TRANS_JSON] --result-label RESULT_LABEL
                   --model MODEL [--model-conf MODEL_CONF] [--nbest NBEST]
                   [--beam-size BEAM_SIZE] [--penalty PENALTY]
                   [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                   [--tgt-lang TGT_LANG]

Named Arguments

--config

Config file path

--config2

Second config file path that overwrites the settings in –config

--config3

Third config file path that overwrites the settings in –config and –config2

--ngpu

Number of GPUs

Default: 0

--dtype

Possible choices: float16, float32, float64

Float precision (only available in –api v2)

Default: “float32”

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--verbose, -V

Verbose option

Default: 1

--batchsize

Batch size for beam search (0: means no batch processing)

Default: 1

--preprocess-conf

The configuration file for the pre-processing

--api

Possible choices: v1, v2

Beam search APIs v1: Default API. It only supports the ASRInterface.recognize method and DefaultRNNLM. v2: Experimental API. It supports any models that implements ScorerInterface.

Default: “v1”

--trans-json

Filename of translation data (json)

--result-label

Filename of result label data (json)

--model

Model file parameters to read

--model-conf

Model config file

--nbest

Output N-best hypotheses

Default: 1

--beam-size

Beam size

Default: 1

--penalty

Incertion penalty

Default: 0.1

--maxlenratio
Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 3.0

--minlenratio

Input length ratio to obtain min output length

Default: 0.0

--tgt-lang

target language ID (e.g., <en>, <de>, and <fr> etc.)

Default: False

st_train.py

Train a speech translation (ST) model on one CPU, one or multiple GPUs

usage: st_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
                   [--backend {chainer,pytorch}] --outdir OUTDIR
                   [--debugmode DEBUGMODE] --dict DICT [--seed SEED]
                   [--debugdir DEBUGDIR] [--resume [RESUME]]
                   [--minibatches MINIBATCHES] [--verbose VERBOSE]
                   [--tensorboard-dir [TENSORBOARD_DIR]]
                   [--report-interval-iters REPORT_INTERVAL_ITERS]
                   [--save-interval-iters SAVE_INTERVAL_ITERS]
                   [--train-json TRAIN_JSON] [--valid-json VALID_JSON]
                   [--model-module MODEL_MODULE]
                   [--ctc_type {builtin,warpctc}] [--mtlalpha MTLALPHA]
                   [--asr-weight ASR_WEIGHT] [--mt-weight MT_WEIGHT]
                   [--lsm-weight LSM_WEIGHT] [--report-cer] [--report-wer]
                   [--report-bleu] [--nbest NBEST] [--beam-size BEAM_SIZE]
                   [--penalty PENALTY] [--maxlenratio MAXLENRATIO]
                   [--minlenratio MINLENRATIO] [--rnnlm RNNLM]
                   [--rnnlm-conf RNNLM_CONF] [--lm-weight LM_WEIGHT]
                   [--sym-space SYM_SPACE] [--sym-blank SYM_BLANK]
                   [--sortagrad [SORTAGRAD]]
                   [--batch-count {auto,seq,bin,frame}]
                   [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                   [--batch-frames-in BATCH_FRAMES_IN]
                   [--batch-frames-out BATCH_FRAMES_OUT]
                   [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                   [--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
                   [--preprocess-conf [PREPROCESS_CONF]]
                   [--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
                   [--eps EPS] [--eps-decay EPS_DECAY] [--lr LR]
                   [--lr-decay LR_DECAY] [--weight-decay WEIGHT_DECAY]
                   [--criterion {loss,acc}] [--threshold THRESHOLD]
                   [--epochs EPOCHS]
                   [--early-stop-criterion [EARLY_STOP_CRITERION]]
                   [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                   [--num-save-attention NUM_SAVE_ATTENTION]
                   [--num-save-ctc NUM_SAVE_CTC] [--grad-noise GRAD_NOISE]
                   [--context-residual [CONTEXT_RESIDUAL]]
                   [--enc-init [ENC_INIT]] [--enc-init-mods ENC_INIT_MODS]
                   [--dec-init [DEC_INIT]] [--dec-init-mods DEC_INIT_MODS]
                   [--multilingual MULTILINGUAL] [--replace-sos REPLACE_SOS]
                   [--stats-file STATS_FILE] [--apply-uttmvn APPLY_UTTMVN]
                   [--uttmvn-norm-means UTTMVN_NORM_MEANS]
                   [--uttmvn-norm-vars UTTMVN_NORM_VARS] [--fbank-fs FBANK_FS]
                   [--n-mels N_MELS] [--fbank-fmin FBANK_FMIN]
                   [--fbank-fmax FBANK_FMAX]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs. If not given, use all visible devices

--train-dtype

Possible choices: float16, float32, float64, O0, O1, O2, O3

Data type for training (only pytorch backend). O0,O1,.. flags require apex. See https://nvidia.github.io/apex/amp.html#opt-levels

Default: “float32”

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--outdir

Output directory

--debugmode

Debugmode

Default: 1

--dict

Dictionary

--seed

Random seed

Default: 1

--debugdir

Output directory for debugging

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0

--tensorboard-dir

Tensorboard log dir path

--report-interval-iters

Report interval iterations

Default: 100

--save-interval-iters

Save snapshot interval iterations

Default: 0

--train-json

Filename of train label data (json)

--valid-json

Filename of validation label data (json)

--model-module

model defined module (default: espnet.nets.xxx_backend.e2e_st:E2E)

--ctc_type

Possible choices: builtin, warpctc

Type of CTC implementation to calculate loss.

Default: “warpctc”

--mtlalpha

Multitask learning coefficient, alpha: alpha*ctc_loss + (1-alpha)*att_loss

Default: 0.0

--asr-weight

Multitask learning coefficient for ASR task, weight: asr_weight*(alpha*ctc_loss + (1-alpha)*att_loss) + (1-asr_weight-mt_weight)*st_loss

Default: 0.0

--mt-weight

Multitask learning coefficient for MT task, weight: mt_weight*mt_loss + (1-mt_weight-asr_weight)*st_loss

Default: 0.0

--lsm-weight

Label smoothing weight

Default: 0.0

--report-cer

Compute CER on development set

Default: False

--report-wer

Compute WER on development set

Default: False

--report-bleu

Compute BLEU on development set

Default: True

--nbest

Output N-best hypotheses

Default: 1

--beam-size

Beam size

Default: 4

--penalty

Incertion penalty

Default: 0.0

--maxlenratio
Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 0.0

--minlenratio

Input length ratio to obtain min output length

Default: 0.0

--rnnlm

RNNLM model file to read

--rnnlm-conf

RNNLM model config file to read

--lm-weight

RNNLM weight.

Default: 0.0

--sym-space

Space symbol

Default: “<space>”

--sym-blank

Blank symbol

Default: “<blank>”

--sortagrad

How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0

--batch-count

Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0

--batch-bins

Maximum bins in a minibatch (0 to disable)

Default: 0

--batch-frames-in

Maximum input frames in a minibatch (0 to disable)

Default: 0

--batch-frames-out

Maximum output frames in a minibatch (0 to disable)

Default: 0

--batch-frames-inout

Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 800

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 150

--n-iter-processes

Number of processes of iterator

Default: 0

--preprocess-conf

The configuration file for the pre-processing

--opt

Possible choices: adadelta, adam, noam

Optimizer

Default: “adadelta”

--accum-grad

Number of gradient accumuration

Default: 1

--eps

Epsilon constant for optimizer

Default: 1e-08

--eps-decay

Decaying ratio of epsilon

Default: 0.01

--lr

Learning rate for optimizer

Default: 0.001

--lr-decay

Decaying ratio of learning rate

Default: 1.0

--weight-decay

Weight decay ratio

Default: 0.0

--criterion

Possible choices: loss, acc

Criterion to perform epsilon decay

Default: “acc”

--threshold

Threshold to stop iteration

Default: 0.0001

--epochs, -e

Maximum number of epochs

Default: 30

--early-stop-criterion

Value to monitor to trigger an early stopping of the training

Default: “validation/main/acc”

--patience

Number of epochs to wait without improvement before stopping the training

Default: 3

--grad-clip

Gradient norm threshold to clip

Default: 5

--num-save-attention

Number of samples of attention to be saved

Default: 3

--num-save-ctc

Number of samples of CTC probability to be saved

Default: 3

--grad-noise

The flag to switch to use noise injection to gradients during training

Default: False

--context-residual

The flag to switch to use context vector residual in the decoder network

Default: False

--enc-init

Pre-trained ASR model to initialize encoder.

--enc-init-mods

List of encoder modules to initialize, separated by a comma.

Default: enc.enc.

--dec-init

Pre-trained ASR, MT or LM model to initialize decoder.

--dec-init-mods

List of decoder modules to initialize, separated by a comma.

Default: att., dec.

--multilingual

Prepend target language ID to the source sentence. Both source/target language IDs must be prepend in the pre-processing stage.

Default: False

--replace-sos

Replace <sos> in the decoder with a target language ID (the first token in the target sequence)

Default: False

--stats-file

The stats file for the feature normalization

--apply-uttmvn

Apply utterance level mean variance normalization.

Default: True

--uttmvn-norm-means

Default: True

--uttmvn-norm-vars

Default: False

--fbank-fs

The sample frequency used for the mel-fbank creation.

Default: 16000

--n-mels

The number of mel-frequency bins.

Default: 80

--fbank-fmin

Default: 0.0

--fbank-fmax

st_trans.py

Translate text from speech using a speech translation model on one CPU or GPU

usage: st_trans.py [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--dtype {float16,float32,float64}]
                   [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                   [--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
                   [--preprocess-conf PREPROCESS_CONF] [--api {v1,v2}]
                   [--trans-json TRANS_JSON] --result-label RESULT_LABEL
                   --model MODEL [--nbest NBEST] [--beam-size BEAM_SIZE]
                   [--penalty PENALTY] [--maxlenratio MAXLENRATIO]
                   [--minlenratio MINLENRATIO] [--tgt-lang TGT_LANG]

Named Arguments

--config

Config file path

--config2

Second config file path that overwrites the settings in –config

--config3

Third config file path that overwrites the settings in –config and –config2

--ngpu

Number of GPUs

Default: 0

--dtype

Possible choices: float16, float32, float64

Float precision (only available in –api v2)

Default: “float32”

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--verbose, -V

Verbose option

Default: 1

--batchsize

Batch size for beam search (0: means no batch processing)

Default: 1

--preprocess-conf

The configuration file for the pre-processing

--api

Possible choices: v1, v2

Beam search APIs v1: Default API. It only supports the ASRInterface.recognize method and DefaultRNNLM. v2: Experimental API. It supports any models that implements ScorerInterface.

Default: “v1”

--trans-json

Filename of translation data (json)

--result-label

Filename of result label data (json)

--model

Model file parameters to read

--nbest

Output N-best hypotheses

Default: 1

--beam-size

Beam size

Default: 1

--penalty

Incertion penalty

Default: 0.0

--maxlenratio
Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 0.0

--minlenratio

Input length ratio to obtain min output length

Default: 0.0

--tgt-lang

target language ID (e.g., <en>, <de>, and <fr> etc.)

Default: False

tts_decode.py

Synthesize speech from text using a TTS model on one CPU

usage: tts_decode.py [-h] [--config CONFIG] [--config2 CONFIG2]
                     [--config3 CONFIG3] [--ngpu NGPU]
                     [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                     [--seed SEED] --out OUT [--verbose VERBOSE]
                     [--preprocess-conf PREPROCESS_CONF] --json JSON --model
                     MODEL [--model-conf MODEL_CONF]
                     [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                     [--threshold THRESHOLD]
                     [--use-att-constraint USE_ATT_CONSTRAINT]
                     [--backward-window BACKWARD_WINDOW]
                     [--forward-window FORWARD_WINDOW]
                     [--fastspeech-alpha FASTSPEECH_ALPHA]
                     [--save-durations SAVE_DURATIONS]
                     [--save-focus-rates SAVE_FOCUS_RATES]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs

Default: 0

--backend

Possible choices: chainer, pytorch

Backend library

Default: “pytorch”

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--out

Output filename

--verbose, -V

Verbose option

Default: 0

--preprocess-conf

The configuration file for the pre-processing

--json

Filename of train label data (json)

--model

Model file parameters to read

--model-conf

Model config file

--maxlenratio

Maximum length ratio in decoding

Default: 5

--minlenratio

Minimum length ratio in decoding

Default: 0

--threshold

Threshold value in decoding

Default: 0.5

--use-att-constraint

Whether to use the attention constraint

Default: False

--backward-window

Backward window size in the attention constraint

Default: 1

--forward-window

Forward window size in the attention constraint

Default: 3

--fastspeech-alpha

Alpha to change the speed for FastSpeech

Default: 1.0

--save-durations

Whether to save durations converted from attentions

Default: False

--save-focus-rates

Whether to save focus rates of attentions

Default: False

tts_train.py

Train a new text-to-speech (TTS) model on one CPU, one or multiple GPUs

usage: tts_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
                    [--config3 CONFIG3] [--ngpu NGPU]
                    [--backend {chainer,pytorch}] --outdir OUTDIR
                    [--debugmode DEBUGMODE] [--seed SEED] [--resume [RESUME]]
                    [--minibatches MINIBATCHES] [--verbose VERBOSE]
                    [--tensorboard-dir [TENSORBOARD_DIR]]
                    [--eval-interval-epochs EVAL_INTERVAL_EPOCHS]
                    [--save-interval-epochs SAVE_INTERVAL_EPOCHS]
                    [--report-interval-iters REPORT_INTERVAL_ITERS]
                    --train-json TRAIN_JSON --valid-json VALID_JSON
                    [--model-module MODEL_MODULE] [--sortagrad [SORTAGRAD]]
                    [--batch-sort-key [{shuffle,output,input}]]
                    [--batch-count {auto,seq,bin,frame}]
                    [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                    [--batch-frames-in BATCH_FRAMES_IN]
                    [--batch-frames-out BATCH_FRAMES_OUT]
                    [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                    [--maxlen-out ML]
                    [--num-iter-processes NUM_ITER_PROCESSES]
                    [--preprocess-conf PREPROCESS_CONF]
                    [--use-speaker-embedding USE_SPEAKER_EMBEDDING]
                    [--use-second-target USE_SECOND_TARGET]
                    [--opt {adam,noam}] [--accum-grad ACCUM_GRAD] [--lr LR]
                    [--eps EPS] [--weight-decay WEIGHT_DECAY]
                    [--epochs EPOCHS]
                    [--early-stop-criterion [EARLY_STOP_CRITERION]]
                    [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                    [--num-save-attention NUM_SAVE_ATTENTION]
                    [--keep-all-data-on-mem KEEP_ALL_DATA_ON_MEM]
                    [--enc-init ENC_INIT] [--enc-init-mods ENC_INIT_MODS]
                    [--dec-init DEC_INIT] [--dec-init-mods DEC_INIT_MODS]
                    [--freeze-mods FREEZE_MODS]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs. If not given, use all visible devices

--backend

Possible choices: chainer, pytorch

Backend library

Default: “pytorch”

--outdir

Output directory

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0

--tensorboard-dir

Tensorboard log directory path

--eval-interval-epochs

Evaluation interval epochs

Default: 1

--save-interval-epochs

Save interval epochs

Default: 1

--report-interval-iters

Report interval iterations

Default: 100

--train-json

Filename of training json

--valid-json

Filename of validation json

--model-module

model defined module

Default: “espnet.nets.pytorch_backend.e2e_tts_tacotron2:Tacotron2”

--sortagrad

How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0

--batch-sort-key

Possible choices: shuffle, output, input

Batch sorting key. “shuffle” only work with –batch-count “seq”.

Default: “shuffle”

--batch-count

Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0

--batch-bins

Maximum bins in a minibatch (0 to disable)

Default: 0

--batch-frames-in

Maximum input frames in a minibatch (0 to disable)

Default: 0

--batch-frames-out

Maximum output frames in a minibatch (0 to disable)

Default: 0

--batch-frames-inout

Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 100

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 200

--num-iter-processes

Number of processes of iterator

Default: 0

--preprocess-conf

The configuration file for the pre-processing

--use-speaker-embedding

Whether to use speaker embedding

Default: False

--use-second-target

Whether to use second target

Default: False

--opt

Possible choices: adam, noam

Optimizer

Default: “adam”

--accum-grad

Number of gradient accumuration

Default: 1

--lr

Learning rate for optimizer

Default: 0.001

--eps

Epsilon for optimizer

Default: 1e-06

--weight-decay

Weight decay coefficient for optimizer

Default: 1e-06

--epochs, -e

Number of maximum epochs

Default: 30

--early-stop-criterion

Value to monitor to trigger an early stopping of the training

Default: “validation/main/loss”

--patience

Number of epochs to wait without improvement before stopping the training

Default: 3

--grad-clip

Gradient norm threshold to clip

Default: 1

--num-save-attention

Number of samples of attention to be saved

Default: 5

--keep-all-data-on-mem

Whether to keep all data on memory

Default: False

--enc-init

Pre-trained TTS model path to initialize encoder.

--enc-init-mods

List of encoder modules to initialize, separated by a comma.

Default: enc.

--dec-init

Pre-trained TTS model path to initialize decoder.

--dec-init-mods

List of decoder modules to initialize, separated by a comma.

Default: dec.

--freeze-mods

List of modules to freeze (not to train), separated by a comma.

vc_decode.py

Converting speech using a VC model on one CPU

usage: vc_decode.py [-h] [--config CONFIG] [--config2 CONFIG2]
                    [--config3 CONFIG3] [--ngpu NGPU]
                    [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                    [--seed SEED] --out OUT [--verbose VERBOSE]
                    [--preprocess-conf PREPROCESS_CONF] --json JSON --model
                    MODEL [--model-conf MODEL_CONF]
                    [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                    [--threshold THRESHOLD]
                    [--use-att-constraint USE_ATT_CONSTRAINT]
                    [--backward-window BACKWARD_WINDOW]
                    [--forward-window FORWARD_WINDOW]
                    [--save-durations SAVE_DURATIONS]
                    [--save-focus-rates SAVE_FOCUS_RATES]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs

Default: 0

--backend

Possible choices: chainer, pytorch

Backend library

Default: “pytorch”

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--out

Output filename

--verbose, -V

Verbose option

Default: 0

--preprocess-conf

The configuration file for the pre-processing

--json

Filename of train label data (json)

--model

Model file parameters to read

--model-conf

Model config file

--maxlenratio

Maximum length ratio in decoding

Default: 5

--minlenratio

Minimum length ratio in decoding

Default: 0

--threshold

Threshold value in decoding

Default: 0.5

--use-att-constraint

Whether to use the attention constraint

Default: False

--backward-window

Backward window size in the attention constraint

Default: 1

--forward-window

Forward window size in the attention constraint

Default: 3

--save-durations

Whether to save durations converted from attentions

Default: False

--save-focus-rates

Whether to save focus rates of attentions

Default: False

vc_train.py

Train a new voice conversion (VC) model on one CPU, one or multiple GPUs

usage: vc_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--backend {chainer,pytorch}] --outdir OUTDIR
                   [--debugmode DEBUGMODE] [--seed SEED] [--resume [RESUME]]
                   [--minibatches MINIBATCHES] [--verbose VERBOSE]
                   [--tensorboard-dir [TENSORBOARD_DIR]]
                   [--eval-interval-epochs EVAL_INTERVAL_EPOCHS]
                   [--save-interval-epochs SAVE_INTERVAL_EPOCHS]
                   [--report-interval-iters REPORT_INTERVAL_ITERS]
                   [--srcspk SRCSPK] [--trgspk TRGSPK] --train-json TRAIN_JSON
                   --valid-json VALID_JSON [--model-module MODEL_MODULE]
                   [--sortagrad [SORTAGRAD]]
                   [--batch-sort-key [{shuffle,output,input}]]
                   [--batch-count {auto,seq,bin,frame}]
                   [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                   [--batch-frames-in BATCH_FRAMES_IN]
                   [--batch-frames-out BATCH_FRAMES_OUT]
                   [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                   [--maxlen-out ML] [--num-iter-processes NUM_ITER_PROCESSES]
                   [--preprocess-conf PREPROCESS_CONF]
                   [--use-speaker-embedding USE_SPEAKER_EMBEDDING]
                   [--use-second-target USE_SECOND_TARGET]
                   [--opt {adam,noam,lamb}] [--accum-grad ACCUM_GRAD]
                   [--lr LR] [--eps EPS] [--weight-decay WEIGHT_DECAY]
                   [--epochs EPOCHS]
                   [--early-stop-criterion [EARLY_STOP_CRITERION]]
                   [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                   [--num-save-attention NUM_SAVE_ATTENTION]
                   [--keep-all-data-on-mem KEEP_ALL_DATA_ON_MEM]
                   [--enc-init ENC_INIT] [--enc-init-mods ENC_INIT_MODS]
                   [--dec-init DEC_INIT] [--dec-init-mods DEC_INIT_MODS]
                   [--freeze-mods FREEZE_MODS]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs. If not given, use all visible devices

--backend

Possible choices: chainer, pytorch

Backend library

Default: “pytorch”

--outdir

Output directory

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0

--tensorboard-dir

Tensorboard log directory path

--eval-interval-epochs

Evaluation interval epochs

Default: 100

--save-interval-epochs

Save interval epochs

Default: 1

--report-interval-iters

Report interval iterations

Default: 10

--srcspk

Source speaker

--trgspk

Target speaker

--train-json

Filename of training json

--valid-json

Filename of validation json

--model-module

model defined module

Default: “espnet.nets.pytorch_backend.e2e_tts_tacotron2:Tacotron2”

--sortagrad

How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0

--batch-sort-key

Possible choices: shuffle, output, input

Batch sorting key. “shuffle” only work with –batch-count “seq”.

Default: “shuffle”

--batch-count

Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0

--batch-bins

Maximum bins in a minibatch (0 to disable)

Default: 0

--batch-frames-in

Maximum input frames in a minibatch (0 to disable)

Default: 0

--batch-frames-out

Maximum output frames in a minibatch (0 to disable)

Default: 0

--batch-frames-inout

Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 100

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 200

--num-iter-processes

Number of processes of iterator

Default: 0

--preprocess-conf

The configuration file for the pre-processing

--use-speaker-embedding

Whether to use speaker embedding

Default: False

--use-second-target

Whether to use second target

Default: False

--opt

Possible choices: adam, noam, lamb

Optimizer

Default: “adam”

--accum-grad

Number of gradient accumuration

Default: 1

--lr

Learning rate for optimizer

Default: 0.001

--eps

Epsilon for optimizer

Default: 1e-06

--weight-decay

Weight decay coefficient for optimizer

Default: 1e-06

--epochs, -e

Number of maximum epochs

Default: 30

--early-stop-criterion

Value to monitor to trigger an early stopping of the training

Default: “validation/main/loss”

--patience

Number of epochs to wait without improvement before stopping the training

Default: 3

--grad-clip

Gradient norm threshold to clip

Default: 1

--num-save-attention

Number of samples of attention to be saved

Default: 5

--keep-all-data-on-mem

Whether to keep all data on memory

Default: False

--enc-init

Pre-trained model path to initialize encoder.

--enc-init-mods

List of encoder modules to initialize, separated by a comma.

Default: enc.

--dec-init

Pre-trained model path to initialize decoder.

--dec-init-mods

List of decoder modules to initialize, separated by a comma.

Default: dec.

--freeze-mods

List of modules to freeze (not to train), separated by a comma.