core tools (espnet2)

ESPnet2 provides several command-line tools for training and evaluating neural networks (NN) under espnet2/bin:

aggregate_stats_dirs.py

usage: aggregate_stats_dirs.py [-h]
                               [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                               [--skip_sum_stats] [--input_dir INPUT_DIR]
                               --output_dir OUTPUT_DIR

Aggregate statistics directories into one directory

optional arguments:
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --skip_sum_stats      Skip computing the sum of statistics. (default: False)
  --input_dir INPUT_DIR
                        Input directories (default: None)
  --output_dir OUTPUT_DIR
                        Output directory (default: None)

asr_align.py

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/runner/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /home/runner/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
usage: asr_align.py [-h] [--config CONFIG]
                    [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--ngpu NGPU] [--dtype {float16,float32,float64}]
                    --asr_train_config ASR_TRAIN_CONFIG --asr_model_file
                    ASR_MODEL_FILE [--token_type {char,bpe,None}]
                    [--bpemodel BPEMODEL] [--fs FS]
                    [--min_window_size MIN_WINDOW_SIZE]
                    [--max_window_size MAX_WINDOW_SIZE]
                    [--set_blank SET_BLANK] [--gratis_blank GRATIS_BLANK]
                    [--replace_spaces_with_blanks REPLACE_SPACES_WITH_BLANKS]
                    [--scoring_length SCORING_LENGTH]
                    [--time_stamps {auto,fixed}]
                    [--text_converter {tokenize,classic}]
                    [--kaldi_style_text KALDI_STYLE_TEXT]
                    [--print_utt_text PRINT_UTT_TEXT]
                    [--print_utt_score PRINT_UTT_SCORE] -a AUDIO -t TEXT
                    [-o OUTPUT]

ASR Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)

Model configuration related:
  --asr_train_config ASR_TRAIN_CONFIG
  --asr_model_file ASR_MODEL_FILE

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ASR model. If not given, refers
                        from the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)

CTC segmentation related:
  --fs FS               Sampling Frequency. The sampling frequency (in Hz) is
                        needed to correctly determine the starting and ending
                        time of aligned segments. (default: 16000)
  --min_window_size MIN_WINDOW_SIZE
                        Minimum window size considered for utterance.
                        (default: None)
  --max_window_size MAX_WINDOW_SIZE
                        Maximum window size considered for utterance.
                        (default: None)
  --set_blank SET_BLANK
                        Index of model dictionary for blank token. (default:
                        None)
  --gratis_blank GRATIS_BLANK
                        Set the transition cost of the blank token to zero.
                        Audio sections labeled with blank tokens can then be
                        skipped without penalty. Useful if there are unrelated
                        audio segments between utterances. (default: False)
  --replace_spaces_with_blanks REPLACE_SPACES_WITH_BLANKS
                        Fill blanks in between words to better model pauses
                        between words. This option is only active for
                        `--text_converter classic`. Segments can be misaligned
                        if this option is combined with --gratis-blank.
                        (default: False)
  --scoring_length SCORING_LENGTH
                        Changes partitioning length L for calculation of the
                        confidence score. (default: None)
  --time_stamps {auto,fixed}
                        Select method how CTC index duration is estimated, and
                        thus how the time stamps are calculated. (default:
                        auto)
  --text_converter {tokenize,classic}
                        How CTC segmentation handles text. (default: tokenize)

Input/output arguments:
  --kaldi_style_text KALDI_STYLE_TEXT
                        Assume that the input text file is kaldi-style
                        formatted, i.e., the utterance name is at the
                        beginning of each line. (default: True)
  --print_utt_text PRINT_UTT_TEXT
                        Include the utterance text in the segments output.
                        (default: True)
  --print_utt_score PRINT_UTT_SCORE
                        Include the confidence score in the segments output.
                        (default: True)
  -a AUDIO, --audio AUDIO
                        Input audio file. (default: None)
  -t TEXT, --text TEXT  Input text file. Each line contains the ground truth
                        of a single utterance. Kaldi-style text files include
                        the name of the utterance as the first word in the
                        line. (default: None)
  -o OUTPUT, --output OUTPUT
                        Output in the form of a `segments` file. If not given,
                        output is written to stdout. (default: -)

asr_inference.py

usage: asr_inference.py [-h] [--config CONFIG]
                        [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                        [--dtype {float16,float32,float64}]
                        [--num_workers NUM_WORKERS]
                        --data_path_and_name_and_type
                        DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--asr_train_config ASR_TRAIN_CONFIG]
                        [--asr_model_file ASR_MODEL_FILE]
                        [--lm_train_config LM_TRAIN_CONFIG]
                        [--lm_file LM_FILE]
                        [--word_lm_train_config WORD_LM_TRAIN_CONFIG]
                        [--word_lm_file WORD_LM_FILE]
                        [--ngram_file NGRAM_FILE] [--model_tag MODEL_TAG]
                        [--enh_s2t_task ENH_S2T_TASK]
                        [--quantize_asr_model QUANTIZE_ASR_MODEL]
                        [--quantize_lm QUANTIZE_LM]
                        [--quantize_modules [QUANTIZE_MODULES [QUANTIZE_MODULES ...]]]
                        [--quantize_dtype {float16,qint8}]
                        [--batch_size BATCH_SIZE] [--nbest NBEST]
                        [--beam_size BEAM_SIZE] [--penalty PENALTY]
                        [--maxlenratio MAXLENRATIO]
                        [--minlenratio MINLENRATIO] [--ctc_weight CTC_WEIGHT]
                        [--lm_weight LM_WEIGHT] [--ngram_weight NGRAM_WEIGHT]
                        [--streaming STREAMING]
                        [--hugging_face_decoder HUGGING_FACE_DECODER]
                        [--hugging_face_decoder_max_length HUGGING_FACE_DECODER_MAX_LENGTH]
                        [--transducer_conf TRANSDUCER_CONF]
                        [--token_type {char,bpe,None}] [--bpemodel BPEMODEL]

ASR Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --asr_train_config ASR_TRAIN_CONFIG
                        ASR training configuration (default: None)
  --asr_model_file ASR_MODEL_FILE
                        ASR model parameter file (default: None)
  --lm_train_config LM_TRAIN_CONFIG
                        LM training configuration (default: None)
  --lm_file LM_FILE     LM parameter file (default: None)
  --word_lm_train_config WORD_LM_TRAIN_CONFIG
                        Word LM training configuration (default: None)
  --word_lm_file WORD_LM_FILE
                        Word LM parameter file (default: None)
  --ngram_file NGRAM_FILE
                        N-gram parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        *_train_config and *_file will be overwritten
                        (default: None)
  --enh_s2t_task ENH_S2T_TASK
                        enhancement and asr joint model (default: False)

Quantization related:
  --quantize_asr_model QUANTIZE_ASR_MODEL
                        Apply dynamic quantization to ASR model. (default:
                        False)
  --quantize_lm QUANTIZE_LM
                        Apply dynamic quantization to LM. (default: False)
  --quantize_modules [QUANTIZE_MODULES [QUANTIZE_MODULES ...]]
                        List of modules to be dynamically quantized. E.g.:
                        --quantize_modules=[Linear,LSTM,GRU]. Each specified
                        module should be an attribute of 'torch.nn', e.g.:
                        torch.nn.Linear, torch.nn.LSTM, torch.nn.GRU, ...
                        (default: ['Linear'])
  --quantize_dtype {float16,qint8}
                        Dtype for dynamic quantization. (default: qint8)

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 20)
  --penalty PENALTY     Insertion penalty (default: 0.0)
  --maxlenratio MAXLENRATIO
                        Input length ratio to obtain max output length. If
                        maxlenratio=0.0 (default), it uses a end-detect
                        function to automatically find maximum hypothesis
                        lengths.If maxlenratio<0.0, its absolute value is
                        interpretedas a constant max output length (default:
                        0.0)
  --minlenratio MINLENRATIO
                        Input length ratio to obtain min output length
                        (default: 0.0)
  --ctc_weight CTC_WEIGHT
                        CTC weight in joint decoding (default: 0.5)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --ngram_weight NGRAM_WEIGHT
                        ngram weight (default: 0.9)
  --streaming STREAMING
  --hugging_face_decoder HUGGING_FACE_DECODER
  --hugging_face_decoder_max_length HUGGING_FACE_DECODER_MAX_LENGTH
  --transducer_conf TRANSDUCER_CONF
                        The keyword arguments for transducer beam search.
                        (default: None)

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ASR model. If not given, refers
                        from the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)

asr_inference_streaming.py

usage: asr_inference_streaming.py [-h] [--config CONFIG]
                                  [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                                  --output_dir OUTPUT_DIR [--ngpu NGPU]
                                  [--seed SEED]
                                  [--dtype {float16,float32,float64}]
                                  [--num_workers NUM_WORKERS]
                                  --data_path_and_name_and_type
                                  DATA_PATH_AND_NAME_AND_TYPE
                                  [--key_file KEY_FILE]
                                  [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                                  [--sim_chunk_length SIM_CHUNK_LENGTH]
                                  --asr_train_config ASR_TRAIN_CONFIG
                                  --asr_model_file ASR_MODEL_FILE
                                  [--lm_train_config LM_TRAIN_CONFIG]
                                  [--lm_file LM_FILE]
                                  [--word_lm_train_config WORD_LM_TRAIN_CONFIG]
                                  [--word_lm_file WORD_LM_FILE]
                                  [--batch_size BATCH_SIZE] [--nbest NBEST]
                                  [--beam_size BEAM_SIZE] [--penalty PENALTY]
                                  [--maxlenratio MAXLENRATIO]
                                  [--minlenratio MINLENRATIO]
                                  [--ctc_weight CTC_WEIGHT]
                                  [--lm_weight LM_WEIGHT]
                                  [--disable_repetition_detection DISABLE_REPETITION_DETECTION]
                                  [--encoded_feat_length_limit ENCODED_FEAT_LENGTH_LIMIT]
                                  [--decoder_text_length_limit DECODER_TEXT_LENGTH_LIMIT]
                                  [--token_type {char,bpe,None}]
                                  [--bpemodel BPEMODEL]

ASR Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
  --sim_chunk_length SIM_CHUNK_LENGTH
                        The length of one chunk, to which speech will be
                        divided for evalution of streaming processing.
                        (default: 0)

The model configuration related:
  --asr_train_config ASR_TRAIN_CONFIG
  --asr_model_file ASR_MODEL_FILE
  --lm_train_config LM_TRAIN_CONFIG
  --lm_file LM_FILE
  --word_lm_train_config WORD_LM_TRAIN_CONFIG
  --word_lm_file WORD_LM_FILE

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 20)
  --penalty PENALTY     Insertion penalty (default: 0.0)
  --maxlenratio MAXLENRATIO
                        Input length ratio to obtain max output length. If
                        maxlenratio=0.0 (default), it uses a end-detect
                        function to automatically find maximum hypothesis
                        lengths (default: 0.0)
  --minlenratio MINLENRATIO
                        Input length ratio to obtain min output length
                        (default: 0.0)
  --ctc_weight CTC_WEIGHT
                        CTC weight in joint decoding (default: 0.5)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --disable_repetition_detection DISABLE_REPETITION_DETECTION
  --encoded_feat_length_limit ENCODED_FEAT_LENGTH_LIMIT
                        Limit the lengths of the encoded featureto input to
                        the decoder. (default: 0)
  --decoder_text_length_limit DECODER_TEXT_LENGTH_LIMIT
                        Limit the lengths of the textto input to the decoder.
                        (default: 0)

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ASR model. If not given, refers
                        from the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)

asr_train.py

usage: asr_train.py [-h] [--config CONFIG] [--print_config]
                    [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--dry_run DRY_RUN]
                    [--iterator_type {sequence,chunk,task,none}]
                    [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                    [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                    [--dist_backend DIST_BACKEND]
                    [--dist_init_method DIST_INIT_METHOD]
                    [--dist_world_size DIST_WORLD_SIZE]
                    [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                    [--dist_master_addr DIST_MASTER_ADDR]
                    [--dist_master_port DIST_MASTER_PORT]
                    [--dist_launcher {slurm,mpi,None}]
                    [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                    [--unused_parameters UNUSED_PARAMETERS]
                    [--sharded_ddp SHARDED_DDP]
                    [--cudnn_enabled CUDNN_ENABLED]
                    [--cudnn_benchmark CUDNN_BENCHMARK]
                    [--cudnn_deterministic CUDNN_DETERMINISTIC]
                    [--collect_stats COLLECT_STATS]
                    [--write_collected_feats WRITE_COLLECTED_FEATS]
                    [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                    [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                    [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                    [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                    [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                    [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                    [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                    [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                    [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                    [--train_dtype {float16,float32,float64}]
                    [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                    [--use_matplotlib USE_MATPLOTLIB]
                    [--use_tensorboard USE_TENSORBOARD]
                    [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                    [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                    [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                    [--wandb_name WANDB_NAME]
                    [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                    [--detect_anomaly DETECT_ANOMALY]
                    [--pretrain_path PRETRAIN_PATH]
                    [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                    [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                    [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                    [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                    [--batch_size BATCH_SIZE]
                    [--valid_batch_size VALID_BATCH_SIZE]
                    [--batch_bins BATCH_BINS]
                    [--valid_batch_bins VALID_BATCH_BINS]
                    [--train_shape_file TRAIN_SHAPE_FILE]
                    [--valid_shape_file VALID_SHAPE_FILE]
                    [--batch_type {unsorted,sorted,folded,length,numel}]
                    [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                    [--fold_length FOLD_LENGTH]
                    [--sort_in_batch {descending,ascending}]
                    [--sort_batch {descending,ascending}]
                    [--multiple_iterator MULTIPLE_ITERATOR]
                    [--chunk_length CHUNK_LENGTH]
                    [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                    [--num_cache_chunks NUM_CACHE_CHUNKS]
                    [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                    [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                    [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                    [--max_cache_size MAX_CACHE_SIZE]
                    [--max_cache_fd MAX_CACHE_FD]
                    [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                    [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                    [--optim_conf OPTIM_CONF]
                    [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                    [--scheduler_conf SCHEDULER_CONF]
                    [--token_list TOKEN_LIST]
                    [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                    [--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
                    [--joint_net_conf JOINT_NET_CONF]
                    [--use_preprocessor USE_PREPROCESSOR]
                    [--token_type {bpe,char,word,phn,hugging_face}]
                    [--bpemodel BPEMODEL]
                    [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                    [--cleaner {None,tacotron,jaconv,vietnamese}]
                    [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                    [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                    [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                    [--noise_scp NOISE_SCP]
                    [--noise_apply_prob NOISE_APPLY_PROB]
                    [--noise_db_range NOISE_DB_RANGE]
                    [--short_noise_thres SHORT_NOISE_THRES]
                    [--frontend {default,sliding_window,s3prl,fused}]
                    [--frontend_conf FRONTEND_CONF] [--specaug {specaug,None}]
                    [--specaug_conf SPECAUG_CONF]
                    [--normalize {global_mvn,utterance_mvn,None}]
                    [--normalize_conf NORMALIZE_CONF]
                    [--model {espnet,maskctc}] [--model_conf MODEL_CONF]
                    [--preencoder {sinc,linear,None}]
                    [--preencoder_conf PREENCODER_CONF]
                    [--encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,longformer,branchformer}]
                    [--encoder_conf ENCODER_CONF]
                    [--postencoder {hugging_face_transformers,None}]
                    [--postencoder_conf POSTENCODER_CONF]
                    [--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm,hugging_face_transformers}]
                    [--decoder_conf DECODER_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "midi":
                        MIDI format types which supported by sndfile mid, midi, etc.

                           utterance_id_a a.mid
                           utterance_id_b b.mid
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --ctc_conf CTC_CONF   The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True})
  --joint_net_conf JOINT_NET_CONF
                        The keyword arguments for joint network class. (default: None)

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn,hugging_face}
                        The text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value. (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of noise decibel level. (default: 13_15)
  --short_noise_thres SHORT_NOISE_THRES
                        If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
  --frontend {default,sliding_window,s3prl,fused}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --model {espnet,maskctc}
                        The model type (default: espnet)
  --model_conf MODEL_CONF
                        The keyword arguments for model (default: {})
  --preencoder {sinc,linear,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,longformer,branchformer}
                        The encoder type (default: rnn)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --postencoder {hugging_face_transformers,None}
                        The postencoder type (default: None)
  --postencoder_conf POSTENCODER_CONF
                        The keyword arguments for postencoder (default: {})
  --decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm,hugging_face_transformers}
                        The decoder type (default: rnn)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})

asr_transducer_inference.py

usage: asr_transducer_inference.py [-h] [--config CONFIG]
                                   [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                                   --output_dir OUTPUT_DIR [--ngpu NGPU]
                                   [--seed SEED]
                                   [--dtype {float16,float32,float64}]
                                   [--num_workers NUM_WORKERS]
                                   --data_path_and_name_and_type
                                   DATA_PATH_AND_NAME_AND_TYPE
                                   [--key_file KEY_FILE]
                                   [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                                   [--asr_train_config ASR_TRAIN_CONFIG]
                                   [--asr_model_file ASR_MODEL_FILE]
                                   [--lm_train_config LM_TRAIN_CONFIG]
                                   [--lm_file LM_FILE] [--model_tag MODEL_TAG]
                                   [--batch_size BATCH_SIZE] [--nbest NBEST]
                                   [--beam_size BEAM_SIZE]
                                   [--lm_weight LM_WEIGHT]
                                   [--beam_search_config BEAM_SEARCH_CONFIG]
                                   [--token_type {char,bpe,None}]
                                   [--bpemodel BPEMODEL]
                                   [--quantize_asr_model QUANTIZE_ASR_MODEL]
                                   [--quantize_modules [QUANTIZE_MODULES [QUANTIZE_MODULES ...]]]
                                   [--quantize_dtype {float16,qint8}]
                                   [--streaming STREAMING]
                                   [--chunk_size CHUNK_SIZE]
                                   [--left_context LEFT_CONTEXT]
                                   [--right_context RIGHT_CONTEXT]
                                   [--display_partial_hypotheses DISPLAY_PARTIAL_HYPOTHESES]

ASR Transducer Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --quantize_asr_model QUANTIZE_ASR_MODEL
                        Apply dynamic quantization to ASR model. (default:
                        False)
  --quantize_modules [QUANTIZE_MODULES [QUANTIZE_MODULES ...]]
                        Module names to apply dynamic quantization on. The
                        module names are provided as a list, where each name
                        is separated by a comma (e.g.: --quantize-
                        config=[Linear,LSTM,GRU]). Each specified name should
                        be an attribute of 'torch.nn', e.g.: torch.nn.Linear,
                        torch.nn.LSTM, torch.nn.GRU, ... (default: None)
  --quantize_dtype {float16,qint8}
                        Dtype for dynamic quantization. (default: qint8)
  --streaming STREAMING
                        Whether to perform chunk-by-chunk inference. (default:
                        False)
  --chunk_size CHUNK_SIZE
                        Number of frames in chunk AFTER subsampling. (default:
                        16)
  --left_context LEFT_CONTEXT
                        Number of frames in left context of the chunk AFTER
                        subsampling. (default: 32)
  --right_context RIGHT_CONTEXT
                        Number of frames in right context of the chunk AFTER
                        subsampling. (default: 0)
  --display_partial_hypotheses DISPLAY_PARTIAL_HYPOTHESES
                        Whether to display partial hypotheses during chunk-by-
                        chunk inference. (default: False)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --asr_train_config ASR_TRAIN_CONFIG
                        ASR training configuration (default: None)
  --asr_model_file ASR_MODEL_FILE
                        ASR model parameter file (default: None)
  --lm_train_config LM_TRAIN_CONFIG
                        LM training configuration (default: None)
  --lm_file LM_FILE     LM parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        *_train_config and *_file will be overwritten
                        (default: None)

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 5)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --beam_search_config BEAM_SEARCH_CONFIG
                        The keyword arguments for transducer beam search.
                        (default: {})

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ASR model. If not given, refers
                        from the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)

asr_transducer_train.py

usage: asr_transducer_train.py [-h] [--config CONFIG] [--print_config]
                               [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                               [--dry_run DRY_RUN]
                               [--iterator_type {sequence,chunk,task,none}]
                               [--output_dir OUTPUT_DIR] [--ngpu NGPU]
                               [--seed SEED] [--num_workers NUM_WORKERS]
                               [--num_att_plot NUM_ATT_PLOT]
                               [--dist_backend DIST_BACKEND]
                               [--dist_init_method DIST_INIT_METHOD]
                               [--dist_world_size DIST_WORLD_SIZE]
                               [--dist_rank DIST_RANK]
                               [--local_rank LOCAL_RANK]
                               [--dist_master_addr DIST_MASTER_ADDR]
                               [--dist_master_port DIST_MASTER_PORT]
                               [--dist_launcher {slurm,mpi,None}]
                               [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                               [--unused_parameters UNUSED_PARAMETERS]
                               [--sharded_ddp SHARDED_DDP]
                               [--cudnn_enabled CUDNN_ENABLED]
                               [--cudnn_benchmark CUDNN_BENCHMARK]
                               [--cudnn_deterministic CUDNN_DETERMINISTIC]
                               [--collect_stats COLLECT_STATS]
                               [--write_collected_feats WRITE_COLLECTED_FEATS]
                               [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                               [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                               [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                               [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                               [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                               [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                               [--grad_clip GRAD_CLIP]
                               [--grad_clip_type GRAD_CLIP_TYPE]
                               [--grad_noise GRAD_NOISE]
                               [--accum_grad ACCUM_GRAD]
                               [--no_forward_run NO_FORWARD_RUN]
                               [--resume RESUME]
                               [--train_dtype {float16,float32,float64}]
                               [--use_amp USE_AMP]
                               [--log_interval LOG_INTERVAL]
                               [--use_matplotlib USE_MATPLOTLIB]
                               [--use_tensorboard USE_TENSORBOARD]
                               [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                               [--use_wandb USE_WANDB]
                               [--wandb_project WANDB_PROJECT]
                               [--wandb_id WANDB_ID]
                               [--wandb_entity WANDB_ENTITY]
                               [--wandb_name WANDB_NAME]
                               [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                               [--detect_anomaly DETECT_ANOMALY]
                               [--pretrain_path PRETRAIN_PATH]
                               [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                               [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                               [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                               [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                               [--batch_size BATCH_SIZE]
                               [--valid_batch_size VALID_BATCH_SIZE]
                               [--batch_bins BATCH_BINS]
                               [--valid_batch_bins VALID_BATCH_BINS]
                               [--train_shape_file TRAIN_SHAPE_FILE]
                               [--valid_shape_file VALID_SHAPE_FILE]
                               [--batch_type {unsorted,sorted,folded,length,numel}]
                               [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                               [--fold_length FOLD_LENGTH]
                               [--sort_in_batch {descending,ascending}]
                               [--sort_batch {descending,ascending}]
                               [--multiple_iterator MULTIPLE_ITERATOR]
                               [--chunk_length CHUNK_LENGTH]
                               [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                               [--num_cache_chunks NUM_CACHE_CHUNKS]
                               [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                               [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                               [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                               [--max_cache_size MAX_CACHE_SIZE]
                               [--max_cache_fd MAX_CACHE_FD]
                               [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                               [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                               [--optim_conf OPTIM_CONF]
                               [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                               [--scheduler_conf SCHEDULER_CONF]
                               [--token_list TOKEN_LIST]
                               [--input_size INPUT_SIZE] [--init INIT]
                               [--model_conf MODEL_CONF]
                               [--encoder_conf ENCODER_CONF]
                               [--joint_network_conf JOINT_NETWORK_CONF]
                               [--use_preprocessor USE_PREPROCESSOR]
                               [--token_type {bpe,char,word,phn}]
                               [--bpemodel BPEMODEL]
                               [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                               [--cleaner {None,tacotron,jaconv,vietnamese}]
                               [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                               [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                               [--rir_scp RIR_SCP]
                               [--rir_apply_prob RIR_APPLY_PROB]
                               [--noise_scp NOISE_SCP]
                               [--noise_apply_prob NOISE_APPLY_PROB]
                               [--noise_db_range NOISE_DB_RANGE]
                               [--frontend {default,sliding_window}]
                               [--frontend_conf FRONTEND_CONF]
                               [--specaug {specaug,None}]
                               [--specaug_conf SPECAUG_CONF]
                               [--normalize {global_mvn,utterance_mvn,None}]
                               [--normalize_conf NORMALIZE_CONF]
                               [--decoder {rnn,stateless}]
                               [--decoder_conf DECODER_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "midi":
                        MIDI format types which supported by sndfile mid, midi, etc.

                           utterance_id_a a.mid
                           utterance_id_b b.mid
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related.

  --token_list TOKEN_LIST
                        Integer-string mapper for tokens. (default: None)
  --input_size INPUT_SIZE
                        The number of dimensions for input features. (default: None)
  --init INIT           Type of model initialization to use. (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for the model class. (default: {'transducer_weight': 1.0, 'fastemit_lambda': 0.0, 'auxiliary_ctc_weight': 0.0, 'auxiliary_ctc_dropout_rate': 0.0, 'auxiliary_lm_loss_weight': 0.0, 'auxiliary_lm_loss_smoothing': 0.05, 'ignore_id': -1, 'sym_space': '<space>', 'sym_blank': '<blank>', 'report_cer': False, 'report_wer': False, 'extract_feats_in_collect_stats': True})
  --encoder_conf ENCODER_CONF
                        The keyword arguments for the encoder class. (default: {})
  --joint_network_conf JOINT_NETWORK_CONF
                        The keyword arguments for the joint network class. (default: {})

  Preprocess related.

  --use_preprocessor USE_PREPROCESSOR
                        Whether to apply preprocessing to input data. (default: True)
  --token_type {bpe,char,word,phn}
                        The type of tokens to use during tokenization. (default: bpe)
  --bpemodel BPEMODEL   The path of the sentencepiece model. (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        The 'non_linguistic_symbols' file path. (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Text cleaner to use. (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        g2p method to use if --token_type=phn. (default: None)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Normalization value for maximum amplitude scaling. (default: None)
  --rir_scp RIR_SCP     The RIR SCP file path. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        The probability of the applied RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The path of noise SCP file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability of the applied noise addition. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of the noise decibel level. (default: 13_15)
  --frontend {default,sliding_window}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --decoder {rnn,stateless}
                        The decoder type (default: rnn)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})

diar_inference.py

usage: diar_inference.py [-h] [--config CONFIG]
                         [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                         --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                         [--dtype {float16,float32,float64}] [--fs FS]
                         [--num_workers NUM_WORKERS]
                         --data_path_and_name_and_type
                         DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                         [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                         [--train_config TRAIN_CONFIG]
                         [--model_file MODEL_FILE] [--model_tag MODEL_TAG]
                         [--batch_size BATCH_SIZE]
                         [--segment_size SEGMENT_SIZE] [--hop_size HOP_SIZE]
                         [--show_progressbar SHOW_PROGRESSBAR]
                         [--num_spk NUM_SPK] [--enh_s2t_task ENH_S2T_TASK]
                         [--normalize_segment_scale NORMALIZE_SEGMENT_SCALE]
                         [--normalize_output_wav NORMALIZE_OUTPUT_WAV]
                         [--multiply_diar_result MULTIPLY_DIAR_RESULT]

Speaker Diarization inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --fs FS               Sampling rate (default: 8000)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --train_config TRAIN_CONFIG
                        Diarization training configuration (default: None)
  --model_file MODEL_FILE
                        Diarization model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        train_config and model_file will be overwritten
                        (default: None)

Data loading related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)

Diarize speech related:
  --segment_size SEGMENT_SIZE
                        Segment length in seconds for segment-wise speaker
                        diarization (default: None)
  --hop_size HOP_SIZE   Hop length in seconds for segment-wise speech
                        enhancement/separation (default: None)
  --show_progressbar SHOW_PROGRESSBAR
                        Whether to show a progress bar when performing
                        segment-wise speaker diarization (default: False)
  --num_spk NUM_SPK     Predetermined number of speakers for inference
                        (default: None)

Enh + Diar related:
  --enh_s2t_task ENH_S2T_TASK
                        enhancement and diarization joint model (default:
                        False)
  --normalize_segment_scale NORMALIZE_SEGMENT_SCALE
                        Whether to normalize the energy of the separated
                        streams in each segment (default: False)
  --normalize_output_wav NORMALIZE_OUTPUT_WAV
                        Whether to normalize the predicted wav to [-1~1]
                        (default: False)
  --multiply_diar_result MULTIPLY_DIAR_RESULT
                        Whether to multiply diar results to separated waves
                        (default: False)

diar_train.py

usage: diar_train.py [-h] [--config CONFIG] [--print_config]
                     [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                     [--dry_run DRY_RUN]
                     [--iterator_type {sequence,chunk,task,none}]
                     [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                     [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                     [--dist_backend DIST_BACKEND]
                     [--dist_init_method DIST_INIT_METHOD]
                     [--dist_world_size DIST_WORLD_SIZE]
                     [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                     [--dist_master_addr DIST_MASTER_ADDR]
                     [--dist_master_port DIST_MASTER_PORT]
                     [--dist_launcher {slurm,mpi,None}]
                     [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                     [--unused_parameters UNUSED_PARAMETERS]
                     [--sharded_ddp SHARDED_DDP]
                     [--cudnn_enabled CUDNN_ENABLED]
                     [--cudnn_benchmark CUDNN_BENCHMARK]
                     [--cudnn_deterministic CUDNN_DETERMINISTIC]
                     [--collect_stats COLLECT_STATS]
                     [--write_collected_feats WRITE_COLLECTED_FEATS]
                     [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                     [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                     [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                     [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                     [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                     [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                     [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                     [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                     [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                     [--train_dtype {float16,float32,float64}]
                     [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                     [--use_matplotlib USE_MATPLOTLIB]
                     [--use_tensorboard USE_TENSORBOARD]
                     [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                     [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                     [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                     [--wandb_name WANDB_NAME]
                     [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                     [--detect_anomaly DETECT_ANOMALY]
                     [--pretrain_path PRETRAIN_PATH]
                     [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                     [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                     [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                     [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                     [--batch_size BATCH_SIZE]
                     [--valid_batch_size VALID_BATCH_SIZE]
                     [--batch_bins BATCH_BINS]
                     [--valid_batch_bins VALID_BATCH_BINS]
                     [--train_shape_file TRAIN_SHAPE_FILE]
                     [--valid_shape_file VALID_SHAPE_FILE]
                     [--batch_type {unsorted,sorted,folded,length,numel}]
                     [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                     [--fold_length FOLD_LENGTH]
                     [--sort_in_batch {descending,ascending}]
                     [--sort_batch {descending,ascending}]
                     [--multiple_iterator MULTIPLE_ITERATOR]
                     [--chunk_length CHUNK_LENGTH]
                     [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                     [--num_cache_chunks NUM_CACHE_CHUNKS]
                     [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                     [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                     [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                     [--max_cache_size MAX_CACHE_SIZE]
                     [--max_cache_fd MAX_CACHE_FD]
                     [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                     [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                     [--optim_conf OPTIM_CONF]
                     [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                     [--scheduler_conf SCHEDULER_CONF] [--num_spk NUM_SPK]
                     [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                     [--input_size INPUT_SIZE] [--model_conf MODEL_CONF]
                     [--use_preprocessor USE_PREPROCESSOR]
                     [--frontend {default,sliding_window,s3prl,None}]
                     [--frontend_conf FRONTEND_CONF]
                     [--specaug {specaug,None}] [--specaug_conf SPECAUG_CONF]
                     [--normalize {global_mvn,utterance_mvn,None}]
                     [--normalize_conf NORMALIZE_CONF]
                     [--encoder {conformer,transformer,rnn}]
                     [--encoder_conf ENCODER_CONF] [--decoder {linear}]
                     [--decoder_conf DECODER_CONF]
                     [--label_aggregator {label_aggregator}]
                     [--label_aggregator_conf LABEL_AGGREGATOR_CONF]
                     [--attractor {rnn,None}]
                     [--attractor_conf ATTRACTOR_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "midi":
                        MIDI format types which supported by sndfile mid, midi, etc.

                           utterance_id_a a.mid
                           utterance_id_b b.mid
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --num_spk NUM_SPK     The number fo speakers (for each recording) used in system training (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'diar_weight': 1.0, 'attractor_weight': 1.0})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --frontend {default,sliding_window,s3prl,None}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --encoder {conformer,transformer,rnn}
                        The encoder type (default: transformer)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --decoder {linear}    The decoder type (default: linear)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
  --label_aggregator {label_aggregator}
                        The label_aggregator type (default: label_aggregator)
  --label_aggregator_conf LABEL_AGGREGATOR_CONF
                        The keyword arguments for label_aggregator (default: {})
  --attractor {rnn,None}
                        The attractor type (default: None)
  --attractor_conf ATTRACTOR_CONF
                        The keyword arguments for attractor (default: {})

enh_inference.py

usage: enh_inference.py [-h] [--config CONFIG]
                        [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                        [--dtype {float16,float32,float64}] [--fs FS]
                        [--num_workers NUM_WORKERS]
                        --data_path_and_name_and_type
                        DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--normalize_output_wav NORMALIZE_OUTPUT_WAV]
                        [--train_config TRAIN_CONFIG]
                        [--model_file MODEL_FILE] [--model_tag MODEL_TAG]
                        [--inference_config INFERENCE_CONFIG]
                        [--enh_s2t_task ENH_S2T_TASK]
                        [--batch_size BATCH_SIZE]
                        [--segment_size SEGMENT_SIZE] [--hop_size HOP_SIZE]
                        [--normalize_segment_scale NORMALIZE_SEGMENT_SCALE]
                        [--show_progressbar SHOW_PROGRESSBAR]
                        [--ref_channel REF_CHANNEL]

Frontend inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --fs FS               Sampling rate (default: 8000)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

Output data related:
  --normalize_output_wav NORMALIZE_OUTPUT_WAV
                        Whether to normalize the predicted wav to [-1~1]
                        (default: False)

The model configuration related:
  --train_config TRAIN_CONFIG
                        Training configuration file (default: None)
  --model_file MODEL_FILE
                        Model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        train_config and model_file will be overwritten
                        (default: None)
  --inference_config INFERENCE_CONFIG
                        Optional configuration file for overwriting enh model
                        attributes during inference (default: None)
  --enh_s2t_task ENH_S2T_TASK
                        enhancement and asr joint model (default: False)

Data loading related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)

SeparateSpeech related:
  --segment_size SEGMENT_SIZE
                        Segment length in seconds for segment-wise speech
                        enhancement/separation (default: None)
  --hop_size HOP_SIZE   Hop length in seconds for segment-wise speech
                        enhancement/separation (default: None)
  --normalize_segment_scale NORMALIZE_SEGMENT_SCALE
                        Whether to normalize the energy of the separated
                        streams in each segment (default: False)
  --show_progressbar SHOW_PROGRESSBAR
                        Whether to show a progress bar when performing
                        segment-wise speech enhancement/separation (default:
                        False)
  --ref_channel REF_CHANNEL
                        If not None, this will overwrite the ref_channel
                        defined in the separator module (for multi-channel
                        speech processing) (default: None)

enh_s2t_train.py

usage: enh_s2t_train.py [-h] [--config CONFIG] [--print_config]
                        [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        [--dry_run DRY_RUN]
                        [--iterator_type {sequence,chunk,task,none}]
                        [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                        [--num_workers NUM_WORKERS]
                        [--num_att_plot NUM_ATT_PLOT]
                        [--dist_backend DIST_BACKEND]
                        [--dist_init_method DIST_INIT_METHOD]
                        [--dist_world_size DIST_WORLD_SIZE]
                        [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                        [--dist_master_addr DIST_MASTER_ADDR]
                        [--dist_master_port DIST_MASTER_PORT]
                        [--dist_launcher {slurm,mpi,None}]
                        [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                        [--unused_parameters UNUSED_PARAMETERS]
                        [--sharded_ddp SHARDED_DDP]
                        [--cudnn_enabled CUDNN_ENABLED]
                        [--cudnn_benchmark CUDNN_BENCHMARK]
                        [--cudnn_deterministic CUDNN_DETERMINISTIC]
                        [--collect_stats COLLECT_STATS]
                        [--write_collected_feats WRITE_COLLECTED_FEATS]
                        [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                        [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                        [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                        [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                        [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                        [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                        [--grad_clip GRAD_CLIP]
                        [--grad_clip_type GRAD_CLIP_TYPE]
                        [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                        [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                        [--train_dtype {float16,float32,float64}]
                        [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                        [--use_matplotlib USE_MATPLOTLIB]
                        [--use_tensorboard USE_TENSORBOARD]
                        [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                        [--use_wandb USE_WANDB]
                        [--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
                        [--wandb_entity WANDB_ENTITY]
                        [--wandb_name WANDB_NAME]
                        [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                        [--detect_anomaly DETECT_ANOMALY]
                        [--pretrain_path PRETRAIN_PATH]
                        [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                        [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                        [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                        [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                        [--batch_size BATCH_SIZE]
                        [--valid_batch_size VALID_BATCH_SIZE]
                        [--batch_bins BATCH_BINS]
                        [--valid_batch_bins VALID_BATCH_BINS]
                        [--train_shape_file TRAIN_SHAPE_FILE]
                        [--valid_shape_file VALID_SHAPE_FILE]
                        [--batch_type {unsorted,sorted,folded,length,numel}]
                        [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                        [--fold_length FOLD_LENGTH]
                        [--sort_in_batch {descending,ascending}]
                        [--sort_batch {descending,ascending}]
                        [--multiple_iterator MULTIPLE_ITERATOR]
                        [--chunk_length CHUNK_LENGTH]
                        [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                        [--num_cache_chunks NUM_CACHE_CHUNKS]
                        [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                        [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--max_cache_size MAX_CACHE_SIZE]
                        [--max_cache_fd MAX_CACHE_FD]
                        [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                        [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                        [--optim_conf OPTIM_CONF]
                        [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                        [--scheduler_conf SCHEDULER_CONF]
                        [--token_list TOKEN_LIST]
                        [--src_token_list SRC_TOKEN_LIST]
                        [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                        [--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
                        [--enh_criterions ENH_CRITERIONS]
                        [--diar_num_spk DIAR_NUM_SPK]
                        [--diar_input_size DIAR_INPUT_SIZE]
                        [--enh_model_conf ENH_MODEL_CONF]
                        [--asr_model_conf ASR_MODEL_CONF]
                        [--st_model_conf ST_MODEL_CONF]
                        [--diar_model_conf DIAR_MODEL_CONF]
                        [--subtask_series {enh,asr,st,diar} [{enh,asr,st,diar} ...]]
                        [--model_conf MODEL_CONF]
                        [--use_preprocessor USE_PREPROCESSOR]
                        [--token_type {bpe,char,word,phn}]
                        [--bpemodel BPEMODEL]
                        [--src_token_type {bpe,char,word,phn}]
                        [--src_bpemodel SRC_BPEMODEL]
                        [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                        [--cleaner {None,tacotron,jaconv,vietnamese}]
                        [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                        [--text_name TEXT_NAME [TEXT_NAME ...]]
                        [--enh_encoder {stft,conv,same}]
                        [--enh_encoder_conf ENH_ENCODER_CONF]
                        [--enh_separator {asteroid,conformer,dan,dc_crn,dccrn,dpcl,dpcl_e2e,dprnn,dptnet,fasnet,rnn,skim,svoice,tcn,transformer,wpe_beamformer,tcn_nomask,ineube}]
                        [--enh_separator_conf ENH_SEPARATOR_CONF]
                        [--enh_decoder {stft,conv,same}]
                        [--enh_decoder_conf ENH_DECODER_CONF]
                        [--enh_mask_module {multi_mask}]
                        [--enh_mask_module_conf ENH_MASK_MODULE_CONF]
                        [--frontend {default,sliding_window,s3prl,fused}]
                        [--frontend_conf FRONTEND_CONF]
                        [--specaug {specaug,None}]
                        [--specaug_conf SPECAUG_CONF]
                        [--normalize {global_mvn,utterance_mvn,None}]
                        [--normalize_conf NORMALIZE_CONF]
                        [--asr_preencoder {sinc,linear,None}]
                        [--asr_preencoder_conf ASR_PREENCODER_CONF]
                        [--asr_encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,longformer,branchformer}]
                        [--asr_encoder_conf ASR_ENCODER_CONF]
                        [--asr_postencoder {hugging_face_transformers,None}]
                        [--asr_postencoder_conf ASR_POSTENCODER_CONF]
                        [--asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm,hugging_face_transformers}]
                        [--asr_decoder_conf ASR_DECODER_CONF]
                        [--st_preencoder {sinc,linear,None}]
                        [--st_preencoder_conf ST_PREENCODER_CONF]
                        [--st_encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain}]
                        [--st_encoder_conf ST_ENCODER_CONF]
                        [--st_postencoder {hugging_face_transformers,None}]
                        [--st_postencoder_conf ST_POSTENCODER_CONF]
                        [--st_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
                        [--st_decoder_conf ST_DECODER_CONF]
                        [--st_extra_asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
                        [--st_extra_asr_decoder_conf ST_EXTRA_ASR_DECODER_CONF]
                        [--st_extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
                        [--st_extra_mt_decoder_conf ST_EXTRA_MT_DECODER_CONF]
                        [--diar_frontend {default,sliding_window,s3prl,None}]
                        [--diar_frontend_conf DIAR_FRONTEND_CONF]
                        [--diar_specaug {specaug,None}]
                        [--diar_specaug_conf DIAR_SPECAUG_CONF]
                        [--diar_normalize {global_mvn,utterance_mvn,None}]
                        [--diar_normalize_conf DIAR_NORMALIZE_CONF]
                        [--diar_encoder {conformer,transformer,rnn}]
                        [--diar_encoder_conf DIAR_ENCODER_CONF]
                        [--diar_decoder {linear}]
                        [--diar_decoder_conf DIAR_DECODER_CONF]
                        [--label_aggregator {label_aggregator}]
                        [--label_aggregator_conf LABEL_AGGREGATOR_CONF]
                        [--diar_attractor {rnn,None}]
                        [--diar_attractor_conf DIAR_ATTRACTOR_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "midi":
                        MIDI format types which supported by sndfile mid, midi, etc.

                           utterance_id_a a.mid
                           utterance_id_b b.mid
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --src_token_list SRC_TOKEN_LIST
                        A text mapping int-id to token (for source language) (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --ctc_conf CTC_CONF   The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True})
  --enh_criterions ENH_CRITERIONS
                        The criterions binded with the loss wrappers. (default: [{'name': 'si_snr', 'conf': {}, 'wrapper': 'fixed_order', 'wrapper_conf': {}}])
  --diar_num_spk DIAR_NUM_SPK
                        The number of speakers (for each recording) for diar submodel class (default: None)
  --diar_input_size DIAR_INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --enh_model_conf ENH_MODEL_CONF
                        The keyword arguments for enh submodel class. (default: {'stft_consistency': False, 'loss_type': 'mask_mse', 'mask_type': None})
  --asr_model_conf ASR_MODEL_CONF
                        The keyword arguments for asr submodel class. (default: {'ctc_weight': 0.5, 'interctc_weight': 0.0, 'ignore_id': -1, 'lsm_weight': 0.0, 'length_normalized_loss': False, 'report_cer': True, 'report_wer': True, 'sym_space': '<space>', 'sym_blank': '<blank>', 'sym_sos': '<sos/eos>', 'sym_eos': '<sos/eos>', 'extract_feats_in_collect_stats': True, 'lang_token_id': -1})
  --st_model_conf ST_MODEL_CONF
                        The keyword arguments for st submodel class. (default: {'stft_consistency': False, 'loss_type': 'mask_mse', 'mask_type': None})
  --diar_model_conf DIAR_MODEL_CONF
                        The keyword arguments for diar submodel class. (default: {'diar_weight': 1.0, 'attractor_weight': 1.0})
  --subtask_series {enh,asr,st,diar} [{enh,asr,st,diar} ...]
                        The series of subtasks in the pipeline. (default: ('enh', 'asr'))
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'calc_enh_loss': True, 'bypass_enh_prob': 0})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: False)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --src_token_type {bpe,char,word,phn}
                        The source text will be tokenized in the specified level token (default: bpe)
  --src_bpemodel SRC_BPEMODEL
                        The model file of sentencepiece (for source language) (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --text_name TEXT_NAME [TEXT_NAME ...]
                        Specify the text_name attribute used in the preprocessor (default: ['text'])
  --enh_encoder {stft,conv,same}
                        The enh_encoder type (default: stft)
  --enh_encoder_conf ENH_ENCODER_CONF
                        The keyword arguments for enh_encoder (default: {})
  --enh_separator {asteroid,conformer,dan,dc_crn,dccrn,dpcl,dpcl_e2e,dprnn,dptnet,fasnet,rnn,skim,svoice,tcn,transformer,wpe_beamformer,tcn_nomask,ineube}
                        The enh_separator type (default: rnn)
  --enh_separator_conf ENH_SEPARATOR_CONF
                        The keyword arguments for enh_separator (default: {})
  --enh_decoder {stft,conv,same}
                        The enh_decoder type (default: stft)
  --enh_decoder_conf ENH_DECODER_CONF
                        The keyword arguments for enh_decoder (default: {})
  --enh_mask_module {multi_mask}
                        The enh_mask_module type (default: multi_mask)
  --enh_mask_module_conf ENH_MASK_MODULE_CONF
                        The keyword arguments for enh_mask_module (default: {})
  --frontend {default,sliding_window,s3prl,fused}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --asr_preencoder {sinc,linear,None}
                        The asr_preencoder type (default: None)
  --asr_preencoder_conf ASR_PREENCODER_CONF
                        The keyword arguments for asr_preencoder (default: {})
  --asr_encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,longformer,branchformer}
                        The asr_encoder type (default: rnn)
  --asr_encoder_conf ASR_ENCODER_CONF
                        The keyword arguments for asr_encoder (default: {})
  --asr_postencoder {hugging_face_transformers,None}
                        The asr_postencoder type (default: None)
  --asr_postencoder_conf ASR_POSTENCODER_CONF
                        The keyword arguments for asr_postencoder (default: {})
  --asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm,hugging_face_transformers}
                        The asr_decoder type (default: rnn)
  --asr_decoder_conf ASR_DECODER_CONF
                        The keyword arguments for asr_decoder (default: {})
  --st_preencoder {sinc,linear,None}
                        The st_preencoder type (default: None)
  --st_preencoder_conf ST_PREENCODER_CONF
                        The keyword arguments for st_preencoder (default: {})
  --st_encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain}
                        The st_encoder type (default: rnn)
  --st_encoder_conf ST_ENCODER_CONF
                        The keyword arguments for st_encoder (default: {})
  --st_postencoder {hugging_face_transformers,None}
                        The st_postencoder type (default: None)
  --st_postencoder_conf ST_POSTENCODER_CONF
                        The keyword arguments for st_postencoder (default: {})
  --st_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
                        The st_decoder type (default: rnn)
  --st_decoder_conf ST_DECODER_CONF
                        The keyword arguments for st_decoder (default: {})
  --st_extra_asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
                        The st_extra_asr_decoder type (default: rnn)
  --st_extra_asr_decoder_conf ST_EXTRA_ASR_DECODER_CONF
                        The keyword arguments for st_extra_asr_decoder (default: {})
  --st_extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
                        The st_extra_mt_decoder type (default: rnn)
  --st_extra_mt_decoder_conf ST_EXTRA_MT_DECODER_CONF
                        The keyword arguments for st_extra_mt_decoder (default: {})
  --diar_frontend {default,sliding_window,s3prl,None}
                        The diar_frontend type (default: default)
  --diar_frontend_conf DIAR_FRONTEND_CONF
                        The keyword arguments for diar_frontend (default: {})
  --diar_specaug {specaug,None}
                        The diar_specaug type (default: None)
  --diar_specaug_conf DIAR_SPECAUG_CONF
                        The keyword arguments for diar_specaug (default: {})
  --diar_normalize {global_mvn,utterance_mvn,None}
                        The diar_normalize type (default: utterance_mvn)
  --diar_normalize_conf DIAR_NORMALIZE_CONF
                        The keyword arguments for diar_normalize (default: {})
  --diar_encoder {conformer,transformer,rnn}
                        The diar_encoder type (default: transformer)
  --diar_encoder_conf DIAR_ENCODER_CONF
                        The keyword arguments for diar_encoder (default: {})
  --diar_decoder {linear}
                        The diar_decoder type (default: linear)
  --diar_decoder_conf DIAR_DECODER_CONF
                        The keyword arguments for diar_decoder (default: {})
  --label_aggregator {label_aggregator}
                        The label_aggregator type (default: label_aggregator)
  --label_aggregator_conf LABEL_AGGREGATOR_CONF
                        The keyword arguments for label_aggregator (default: {})
  --diar_attractor {rnn,None}
                        The diar_attractor type (default: None)
  --diar_attractor_conf DIAR_ATTRACTOR_CONF
                        The keyword arguments for diar_attractor (default: {})

enh_scoring.py

usage: enh_scoring.py [-h] [--config CONFIG]
                      [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                      --output_dir OUTPUT_DIR
                      [--dtype {float16,float32,float64}] --ref_scp REF_SCP
                      --inf_scp INF_SCP [--key_file KEY_FILE]
                      [--ref_channel REF_CHANNEL]
                      [--flexible_numspk FLEXIBLE_NUMSPK]

Frontend inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --dtype {float16,float32,float64}
                        Data type (default: float32)

Input data related:
  --ref_scp REF_SCP
  --inf_scp INF_SCP
  --key_file KEY_FILE
  --ref_channel REF_CHANNEL
  --flexible_numspk FLEXIBLE_NUMSPK

enh_train.py

usage: enh_train.py [-h] [--config CONFIG] [--print_config]
                    [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--dry_run DRY_RUN]
                    [--iterator_type {sequence,chunk,task,none}]
                    [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                    [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                    [--dist_backend DIST_BACKEND]
                    [--dist_init_method DIST_INIT_METHOD]
                    [--dist_world_size DIST_WORLD_SIZE]
                    [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                    [--dist_master_addr DIST_MASTER_ADDR]
                    [--dist_master_port DIST_MASTER_PORT]
                    [--dist_launcher {slurm,mpi,None}]
                    [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                    [--unused_parameters UNUSED_PARAMETERS]
                    [--sharded_ddp SHARDED_DDP]
                    [--cudnn_enabled CUDNN_ENABLED]
                    [--cudnn_benchmark CUDNN_BENCHMARK]
                    [--cudnn_deterministic CUDNN_DETERMINISTIC]
                    [--collect_stats COLLECT_STATS]
                    [--write_collected_feats WRITE_COLLECTED_FEATS]
                    [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                    [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                    [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                    [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                    [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                    [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                    [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                    [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                    [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                    [--train_dtype {float16,float32,float64}]
                    [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                    [--use_matplotlib USE_MATPLOTLIB]
                    [--use_tensorboard USE_TENSORBOARD]
                    [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                    [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                    [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                    [--wandb_name WANDB_NAME]
                    [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                    [--detect_anomaly DETECT_ANOMALY]
                    [--pretrain_path PRETRAIN_PATH]
                    [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                    [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                    [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                    [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                    [--batch_size BATCH_SIZE]
                    [--valid_batch_size VALID_BATCH_SIZE]
                    [--batch_bins BATCH_BINS]
                    [--valid_batch_bins VALID_BATCH_BINS]
                    [--train_shape_file TRAIN_SHAPE_FILE]
                    [--valid_shape_file VALID_SHAPE_FILE]
                    [--batch_type {unsorted,sorted,folded,length,numel}]
                    [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                    [--fold_length FOLD_LENGTH]
                    [--sort_in_batch {descending,ascending}]
                    [--sort_batch {descending,ascending}]
                    [--multiple_iterator MULTIPLE_ITERATOR]
                    [--chunk_length CHUNK_LENGTH]
                    [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                    [--num_cache_chunks NUM_CACHE_CHUNKS]
                    [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                    [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                    [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                    [--max_cache_size MAX_CACHE_SIZE]
                    [--max_cache_fd MAX_CACHE_FD]
                    [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                    [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                    [--optim_conf OPTIM_CONF]
                    [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                    [--scheduler_conf SCHEDULER_CONF]
                    [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                    [--model_conf MODEL_CONF] [--criterions CRITERIONS]
                    [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                    [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                    [--noise_scp NOISE_SCP]
                    [--noise_apply_prob NOISE_APPLY_PROB]
                    [--noise_db_range NOISE_DB_RANGE]
                    [--short_noise_thres SHORT_NOISE_THRES]
                    [--use_reverberant_ref USE_REVERBERANT_REF]
                    [--num_spk NUM_SPK] [--num_noise_type NUM_NOISE_TYPE]
                    [--sample_rate SAMPLE_RATE]
                    [--force_single_channel FORCE_SINGLE_CHANNEL]
                    [--dynamic_mixing DYNAMIC_MIXING] [--utt2spk UTT2SPK]
                    [--dynamic_mixing_gain_db DYNAMIC_MIXING_GAIN_DB]
                    [--encoder {stft,conv,same}] [--encoder_conf ENCODER_CONF]
                    [--separator {asteroid,conformer,dan,dc_crn,dccrn,dpcl,dpcl_e2e,dprnn,dptnet,fasnet,rnn,skim,svoice,tcn,transformer,wpe_beamformer,tcn_nomask,ineube}]
                    [--separator_conf SEPARATOR_CONF]
                    [--decoder {stft,conv,same}] [--decoder_conf DECODER_CONF]
                    [--mask_module {multi_mask}]
                    [--mask_module_conf MASK_MODULE_CONF]
                    [--preprocessor {dynamic_mixing,enh,None}]
                    [--preprocessor_conf PREPROCESSOR_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "midi":
                        MIDI format types which supported by sndfile mid, midi, etc.

                           utterance_id_a a.mid
                           utterance_id_b b.mid
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'stft_consistency': False, 'loss_type': 'mask_mse', 'mask_type': None})
  --criterions CRITERIONS
                        The criterions binded with the loss wrappers. (default: [{'name': 'si_snr', 'conf': {}, 'wrapper': 'fixed_order', 'wrapper_conf': {}}])

  Preprocess related

  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value or range. e.g. --speech_volume_normalize 1.0 scales it to 1.0.
                        --speech_volume_normalize 0.5_1.0 scales it to a random number in the range [0.5, 1.0) (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of signal-to-noise ratio (SNR) level in decibel. (default: 13_15)
  --short_noise_thres SHORT_NOISE_THRES
                        If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
  --use_reverberant_ref USE_REVERBERANT_REF
                        Whether to use reverberant speech references instead of anechoic ones (default: False)
  --num_spk NUM_SPK     Number of speakers in the input signal. (default: 1)
  --num_noise_type NUM_NOISE_TYPE
                        Number of noise types. (default: 1)
  --sample_rate SAMPLE_RATE
                        Sampling rate of the data (in Hz). (default: 8000)
  --force_single_channel FORCE_SINGLE_CHANNEL
                        Whether to force all data to be single-channel. (default: False)
  --dynamic_mixing DYNAMIC_MIXING
                        Apply dynamic mixing (default: False)
  --utt2spk UTT2SPK     The file path of utt2spk file. Only used in dynamic_mixing mode. (default: None)
  --dynamic_mixing_gain_db DYNAMIC_MIXING_GAIN_DB
                        Random gain (in dB) for dynamic mixing sources (default: 0.0)
  --encoder {stft,conv,same}
                        The encoder type (default: stft)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --separator {asteroid,conformer,dan,dc_crn,dccrn,dpcl,dpcl_e2e,dprnn,dptnet,fasnet,rnn,skim,svoice,tcn,transformer,wpe_beamformer,tcn_nomask,ineube}
                        The separator type (default: rnn)
  --separator_conf SEPARATOR_CONF
                        The keyword arguments for separator (default: {})
  --decoder {stft,conv,same}
                        The decoder type (default: stft)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
  --mask_module {multi_mask}
                        The mask_module type (default: multi_mask)
  --mask_module_conf MASK_MODULE_CONF
                        The keyword arguments for mask_module (default: {})
  --preprocessor {dynamic_mixing,enh,None}
                        The preprocessor type (default: None)
  --preprocessor_conf PREPROCESSOR_CONF
                        The keyword arguments for preprocessor (default: {})

gan_tts_train.py

usage: gan_tts_train.py [-h] [--config CONFIG] [--print_config]
                        [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        [--dry_run DRY_RUN]
                        [--iterator_type {sequence,chunk,task,none}]
                        [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                        [--num_workers NUM_WORKERS]
                        [--num_att_plot NUM_ATT_PLOT]
                        [--dist_backend DIST_BACKEND]
                        [--dist_init_method DIST_INIT_METHOD]
                        [--dist_world_size DIST_WORLD_SIZE]
                        [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                        [--dist_master_addr DIST_MASTER_ADDR]
                        [--dist_master_port DIST_MASTER_PORT]
                        [--dist_launcher {slurm,mpi,None}]
                        [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                        [--unused_parameters UNUSED_PARAMETERS]
                        [--sharded_ddp SHARDED_DDP]
                        [--cudnn_enabled CUDNN_ENABLED]
                        [--cudnn_benchmark CUDNN_BENCHMARK]
                        [--cudnn_deterministic CUDNN_DETERMINISTIC]
                        [--collect_stats COLLECT_STATS]
                        [--write_collected_feats WRITE_COLLECTED_FEATS]
                        [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                        [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                        [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                        [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                        [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                        [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                        [--grad_clip GRAD_CLIP]
                        [--grad_clip_type GRAD_CLIP_TYPE]
                        [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                        [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                        [--train_dtype {float16,float32,float64}]
                        [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                        [--use_matplotlib USE_MATPLOTLIB]
                        [--use_tensorboard USE_TENSORBOARD]
                        [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                        [--use_wandb USE_WANDB]
                        [--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
                        [--wandb_entity WANDB_ENTITY]
                        [--wandb_name WANDB_NAME]
                        [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                        [--detect_anomaly DETECT_ANOMALY]
                        [--pretrain_path PRETRAIN_PATH]
                        [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                        [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                        [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                        [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                        [--batch_size BATCH_SIZE]
                        [--valid_batch_size VALID_BATCH_SIZE]
                        [--batch_bins BATCH_BINS]
                        [--valid_batch_bins VALID_BATCH_BINS]
                        [--train_shape_file TRAIN_SHAPE_FILE]
                        [--valid_shape_file VALID_SHAPE_FILE]
                        [--batch_type {unsorted,sorted,folded,length,numel}]
                        [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                        [--fold_length FOLD_LENGTH]
                        [--sort_in_batch {descending,ascending}]
                        [--sort_batch {descending,ascending}]
                        [--multiple_iterator MULTIPLE_ITERATOR]
                        [--chunk_length CHUNK_LENGTH]
                        [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                        [--num_cache_chunks NUM_CACHE_CHUNKS]
                        [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                        [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--max_cache_size MAX_CACHE_SIZE]
                        [--max_cache_fd MAX_CACHE_FD]
                        [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                        [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                        [--optim_conf OPTIM_CONF]
                        [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                        [--scheduler_conf SCHEDULER_CONF]
                        [--optim2 {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                        [--optim2_conf OPTIM2_CONF]
                        [--scheduler2 {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                        [--scheduler2_conf SCHEDULER2_CONF]
                        [--generator_first GENERATOR_FIRST]
                        [--token_list TOKEN_LIST] [--odim ODIM]
                        [--model_conf MODEL_CONF]
                        [--use_preprocessor USE_PREPROCESSOR]
                        [--token_type {bpe,char,word,phn}]
                        [--bpemodel BPEMODEL]
                        [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                        [--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
                        [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                        [--feats_extract {fbank,log_spectrogram,linear_spectrogram}]
                        [--feats_extract_conf FEATS_EXTRACT_CONF]
                        [--normalize {global_mvn,utterance_mvn,None}]
                        [--normalize_conf NORMALIZE_CONF]
                        [--tts {vits,joint_text2wav,jets}]
                        [--tts_conf TTS_CONF] [--pitch_extract {dio,None}]
                        [--pitch_extract_conf PITCH_EXTRACT_CONF]
                        [--pitch_normalize {global_mvn,utterance_mvn,None}]
                        [--pitch_normalize_conf PITCH_NORMALIZE_CONF]
                        [--energy_extract {energy,None}]
                        [--energy_extract_conf ENERGY_EXTRACT_CONF]
                        [--energy_normalize {global_mvn,utterance_mvn,None}]
                        [--energy_normalize_conf ENERGY_NORMALIZE_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --generator_first GENERATOR_FIRST
                        Whether to update generator first. (default: False)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "midi":
                        MIDI format types which supported by sndfile mid, midi, etc.

                           utterance_id_a a.mid
                           utterance_id_b b.mid
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})
  --optim2 {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim2_conf OPTIM2_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler2 {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler2_conf SCHEDULER2_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --odim ODIM           The number of dimension of output feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: phn)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --feats_extract {fbank,log_spectrogram,linear_spectrogram}
                        The feats_extract type (default: linear_spectrogram)
  --feats_extract_conf FEATS_EXTRACT_CONF
                        The keyword arguments for feats_extract (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: None)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --tts {vits,joint_text2wav,jets}
                        The tts type (default: vits)
  --tts_conf TTS_CONF   The keyword arguments for tts (default: {})
  --pitch_extract {dio,None}
                        The pitch_extract type (default: None)
  --pitch_extract_conf PITCH_EXTRACT_CONF
                        The keyword arguments for pitch_extract (default: {})
  --pitch_normalize {global_mvn,utterance_mvn,None}
                        The pitch_normalize type (default: None)
  --pitch_normalize_conf PITCH_NORMALIZE_CONF
                        The keyword arguments for pitch_normalize (default: {})
  --energy_extract {energy,None}
                        The energy_extract type (default: None)
  --energy_extract_conf ENERGY_EXTRACT_CONF
                        The keyword arguments for energy_extract (default: {})
  --energy_normalize {global_mvn,utterance_mvn,None}
                        The energy_normalize type (default: None)
  --energy_normalize_conf ENERGY_NORMALIZE_CONF
                        The keyword arguments for energy_normalize (default: {})

hubert_train.py

usage: hubert_train.py [-h] [--config CONFIG] [--print_config]
                       [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                       [--dry_run DRY_RUN]
                       [--iterator_type {sequence,chunk,task,none}]
                       [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                       [--num_workers NUM_WORKERS]
                       [--num_att_plot NUM_ATT_PLOT]
                       [--dist_backend DIST_BACKEND]
                       [--dist_init_method DIST_INIT_METHOD]
                       [--dist_world_size DIST_WORLD_SIZE]
                       [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                       [--dist_master_addr DIST_MASTER_ADDR]
                       [--dist_master_port DIST_MASTER_PORT]
                       [--dist_launcher {slurm,mpi,None}]
                       [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                       [--unused_parameters UNUSED_PARAMETERS]
                       [--sharded_ddp SHARDED_DDP]
                       [--cudnn_enabled CUDNN_ENABLED]
                       [--cudnn_benchmark CUDNN_BENCHMARK]
                       [--cudnn_deterministic CUDNN_DETERMINISTIC]
                       [--collect_stats COLLECT_STATS]
                       [--write_collected_feats WRITE_COLLECTED_FEATS]
                       [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                       [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                       [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                       [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                       [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                       [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                       [--grad_clip GRAD_CLIP]
                       [--grad_clip_type GRAD_CLIP_TYPE]
                       [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                       [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                       [--train_dtype {float16,float32,float64}]
                       [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                       [--use_matplotlib USE_MATPLOTLIB]
                       [--use_tensorboard USE_TENSORBOARD]
                       [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                       [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                       [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                       [--wandb_name WANDB_NAME]
                       [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                       [--detect_anomaly DETECT_ANOMALY]
                       [--pretrain_path PRETRAIN_PATH]
                       [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                       [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                       [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                       [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                       [--batch_size BATCH_SIZE]
                       [--valid_batch_size VALID_BATCH_SIZE]
                       [--batch_bins BATCH_BINS]
                       [--valid_batch_bins VALID_BATCH_BINS]
                       [--train_shape_file TRAIN_SHAPE_FILE]
                       [--valid_shape_file VALID_SHAPE_FILE]
                       [--batch_type {unsorted,sorted,folded,length,numel}]
                       [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                       [--fold_length FOLD_LENGTH]
                       [--sort_in_batch {descending,ascending}]
                       [--sort_batch {descending,ascending}]
                       [--multiple_iterator MULTIPLE_ITERATOR]
                       [--chunk_length CHUNK_LENGTH]
                       [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                       [--num_cache_chunks NUM_CACHE_CHUNKS]
                       [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                       [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                       [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                       [--max_cache_size MAX_CACHE_SIZE]
                       [--max_cache_fd MAX_CACHE_FD]
                       [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                       [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                       [--optim_conf OPTIM_CONF]
                       [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                       [--scheduler_conf SCHEDULER_CONF]
                       [--token_list TOKEN_LIST]
                       [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                       [--input_size INPUT_SIZE] [--model_conf MODEL_CONF]
                       [--use_preprocessor USE_PREPROCESSOR]
                       [--token_type {bpe,char,word,phn}]
                       [--bpemodel BPEMODEL]
                       [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                       [--cleaner {None,tacotron,jaconv,vietnamese}]
                       [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                       [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                       [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                       [--noise_scp NOISE_SCP]
                       [--noise_apply_prob NOISE_APPLY_PROB]
                       [--noise_db_range NOISE_DB_RANGE]
                       [--pred_masked_weight PRED_MASKED_WEIGHT]
                       [--pred_nomask_weight PRED_NOMASK_WEIGHT]
                       [--loss_weights LOSS_WEIGHTS]
                       [--hubert_dict HUBERT_DICT]
                       [--frontend {default,sliding_window}]
                       [--frontend_conf FRONTEND_CONF]
                       [--specaug {specaug,None}]
                       [--specaug_conf SPECAUG_CONF]
                       [--normalize {global_mvn,utterance_mvn,None}]
                       [--normalize_conf NORMALIZE_CONF]
                       [--preencoder {sinc,None}]
                       [--preencoder_conf PREENCODER_CONF]
                       [--encoder {hubert_pretrain}]
                       [--encoder_conf ENCODER_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --pred_masked_weight PRED_MASKED_WEIGHT
                        weight for predictive loss for masked frames (default: 1.0)
  --pred_nomask_weight PRED_NOMASK_WEIGHT
                        weight for predictive loss for unmasked frames (default: 0.0)
  --loss_weights LOSS_WEIGHTS
                        weights for additional loss terms (not first one) (default: 0.0)
  --hubert_dict HUBERT_DICT
                        word-based target dictionary for Hubert pretraining stage (default: ./dict.txt)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "midi":
                        MIDI format types which supported by sndfile mid, midi, etc.

                           utterance_id_a a.mid
                           utterance_id_b b.mid
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'ignore_id': -1, 'lsm_weight': 0.0, 'length_normalized_loss': False, 'report_cer': False, 'report_wer': False, 'sym_space': '<space>', 'sym_blank': '<blank>', 'pred_masked_weight': 1.0, 'pred_nomask_weight': 0.0, 'loss_weights': 0.0})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value. (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of noise decibel level. (default: 13_15)
  --frontend {default,sliding_window}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --preencoder {sinc,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {hubert_pretrain}
                        The encoder type (default: hubert_pretrain)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})

hugging_face_export_vocabulary.py

usage: hugging_face_export_vocabulary.py [-h]
                                         [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                                         --output OUTPUT --model_name_or_path
                                         MODEL_NAME_OR_PATH
                                         [--add_symbol ADD_SYMBOL]

Export Hugging Face vocabulary

optional arguments:
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output OUTPUT, -o OUTPUT
                        Output text. - indicates sys.stdout (default: None)
  --model_name_or_path MODEL_NAME_OR_PATH
                        Hugging Face model name or path (default: None)
  --add_symbol ADD_SYMBOL
                        Append symbol e.g. --add_symbol '<blank>:0'
                        --add_symbol '<unk>:1' (default: [])

launch.py

usage: launch.py [-h] [--cmd CMD] [--log LOG]
                 [--max_num_log_files MAX_NUM_LOG_FILES] [--ngpu NGPU]
                 [--num_nodes NUM_NODES | --host HOST] [--envfile ENVFILE]
                 [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                 [--master_port MASTER_PORT] [--master_addr MASTER_ADDR]
                 [--init_file_prefix INIT_FILE_PREFIX]
                 args [args ...]

Launch distributed process with appropriate options.

positional arguments:
  args

optional arguments:
  --cmd CMD             The path of cmd script of Kaldi: run.pl. queue.pl, or
                        slurm.pl (default: utils/run.pl)
  --log LOG             The path of log file used by cmd (default: run.log)
  --max_num_log_files MAX_NUM_LOG_FILES
                        The maximum number of log-files to be kept (default:
                        1000)
  --ngpu NGPU           The number of GPUs per node (default: 1)
  --num_nodes NUM_NODES
                        The number of nodes (default: 1)
  --host HOST           Directly specify the host names. The job are submitted
                        via SSH. Multiple host names can be specified by
                        splitting by comma. e.g. host1,host2 You can also the
                        device id after the host name with ':'. e.g.
                        host1:0:2:3,host2:0:2. If the device ids are specified
                        in this way, the value of --ngpu is ignored. (default:
                        None)
  --envfile ENVFILE     Source the shell script before executing command. This
                        option is used when --host is specified. (default:
                        path.sh)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Distributed method is used when single-node mode.
                        (default: True)
  --master_port MASTER_PORT
                        Specify the port number of masterMaster is a host
                        machine has RANK0 process. (default: None)
  --master_addr MASTER_ADDR
                        Specify the address s of master. Master is a host
                        machine has RANK0 process. (default: None)
  --init_file_prefix INIT_FILE_PREFIX
                        The file name prefix for init_file, which is used for
                        'Shared-file system initialization'. This option is
                        used when --port is not specified (default:
                        .dist_init_)

lm_calc_perplexity.py

usage: lm_calc_perplexity.py [-h] [--config CONFIG]
                             [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                             --output_dir OUTPUT_DIR [--ngpu NGPU]
                             [--seed SEED] [--dtype {float16,float32,float64}]
                             [--num_workers NUM_WORKERS]
                             [--batch_size BATCH_SIZE] [--log_base LOG_BASE]
                             --data_path_and_name_and_type
                             DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                             [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                             [--train_config TRAIN_CONFIG]
                             [--model_file MODEL_FILE]

Calc perplexity

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --log_base LOG_BASE   The base of logarithm for Perplexity. If None,
                        napier's constant is used. (default: None)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --train_config TRAIN_CONFIG
  --model_file MODEL_FILE

lm_train.py

usage: lm_train.py [-h] [--config CONFIG] [--print_config]
                   [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                   [--dry_run DRY_RUN]
                   [--iterator_type {sequence,chunk,task,none}]
                   [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                   [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                   [--dist_backend DIST_BACKEND]
                   [--dist_init_method DIST_INIT_METHOD]
                   [--dist_world_size DIST_WORLD_SIZE] [--dist_rank DIST_RANK]
                   [--local_rank LOCAL_RANK]
                   [--dist_master_addr DIST_MASTER_ADDR]
                   [--dist_master_port DIST_MASTER_PORT]
                   [--dist_launcher {slurm,mpi,None}]
                   [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                   [--unused_parameters UNUSED_PARAMETERS]
                   [--sharded_ddp SHARDED_DDP] [--cudnn_enabled CUDNN_ENABLED]
                   [--cudnn_benchmark CUDNN_BENCHMARK]
                   [--cudnn_deterministic CUDNN_DETERMINISTIC]
                   [--collect_stats COLLECT_STATS]
                   [--write_collected_feats WRITE_COLLECTED_FEATS]
                   [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                   [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                   [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                   [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                   [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                   [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                   [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                   [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                   [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                   [--train_dtype {float16,float32,float64}]
                   [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                   [--use_matplotlib USE_MATPLOTLIB]
                   [--use_tensorboard USE_TENSORBOARD]
                   [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                   [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                   [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                   [--wandb_name WANDB_NAME]
                   [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                   [--detect_anomaly DETECT_ANOMALY]
                   [--pretrain_path PRETRAIN_PATH]
                   [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                   [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                   [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                   [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                   [--batch_size BATCH_SIZE]
                   [--valid_batch_size VALID_BATCH_SIZE]
                   [--batch_bins BATCH_BINS]
                   [--valid_batch_bins VALID_BATCH_BINS]
                   [--train_shape_file TRAIN_SHAPE_FILE]
                   [--valid_shape_file VALID_SHAPE_FILE]
                   [--batch_type {unsorted,sorted,folded,length,numel}]
                   [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                   [--fold_length FOLD_LENGTH]
                   [--sort_in_batch {descending,ascending}]
                   [--sort_batch {descending,ascending}]
                   [--multiple_iterator MULTIPLE_ITERATOR]
                   [--chunk_length CHUNK_LENGTH]
                   [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                   [--num_cache_chunks NUM_CACHE_CHUNKS]
                   [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                   [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                   [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                   [--max_cache_size MAX_CACHE_SIZE]
                   [--max_cache_fd MAX_CACHE_FD]
                   [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                   [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                   [--optim_conf OPTIM_CONF]
                   [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                   [--scheduler_conf SCHEDULER_CONF] [--token_list TOKEN_LIST]
                   [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                   [--model_conf MODEL_CONF]
                   [--use_preprocessor USE_PREPROCESSOR]
                   [--token_type {bpe,char,word}] [--bpemodel BPEMODEL]
                   [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                   [--cleaner {None,tacotron,jaconv,vietnamese}]
                   [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                   [--lm {seq_rnn,transformer}] [--lm_conf LM_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "midi":
                        MIDI format types which supported by sndfile mid, midi, etc.

                           utterance_id_a a.mid
                           utterance_id_b b.mid
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'ignore_id': 0})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word}
  --bpemodel BPEMODEL   The model file fo sentencepiece (default: None)
  --lm {seq_rnn,transformer}
                        The lm type (default: seq_rnn)
  --lm_conf LM_CONF     The keyword arguments for lm (default: {})

mt_inference.py

usage: mt_inference.py [-h] [--config CONFIG]
                       [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                       --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                       [--dtype {float16,float32,float64}]
                       [--num_workers NUM_WORKERS]
                       --data_path_and_name_and_type
                       DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                       [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                       [--mt_train_config MT_TRAIN_CONFIG]
                       [--mt_model_file MT_MODEL_FILE]
                       [--lm_train_config LM_TRAIN_CONFIG] [--lm_file LM_FILE]
                       [--word_lm_train_config WORD_LM_TRAIN_CONFIG]
                       [--word_lm_file WORD_LM_FILE] [--ngram_file NGRAM_FILE]
                       [--model_tag MODEL_TAG] [--batch_size BATCH_SIZE]
                       [--nbest NBEST] [--beam_size BEAM_SIZE]
                       [--penalty PENALTY] [--maxlenratio MAXLENRATIO]
                       [--minlenratio MINLENRATIO] [--lm_weight LM_WEIGHT]
                       [--ngram_weight NGRAM_WEIGHT]
                       [--token_type {char,bpe,None}] [--bpemodel BPEMODEL]

MT Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --mt_train_config MT_TRAIN_CONFIG
                        ST training configuration (default: None)
  --mt_model_file MT_MODEL_FILE
                        MT model parameter file (default: None)
  --lm_train_config LM_TRAIN_CONFIG
                        LM training configuration (default: None)
  --lm_file LM_FILE     LM parameter file (default: None)
  --word_lm_train_config WORD_LM_TRAIN_CONFIG
                        Word LM training configuration (default: None)
  --word_lm_file WORD_LM_FILE
                        Word LM parameter file (default: None)
  --ngram_file NGRAM_FILE
                        N-gram parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        *_train_config and *_file will be overwritten
                        (default: None)

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 20)
  --penalty PENALTY     Insertion penalty (default: 0.0)
  --maxlenratio MAXLENRATIO
                        Input length ratio to obtain max output length. If
                        maxlenratio=0.0 (default), it uses a end-detect
                        function to automatically find maximum hypothesis
                        lengths.If maxlenratio<0.0, its absolute value is
                        interpretedas a constant max output length (default:
                        0.0)
  --minlenratio MINLENRATIO
                        Input length ratio to obtain min output length
                        (default: 0.0)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --ngram_weight NGRAM_WEIGHT
                        ngram weight (default: 0.9)

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ST model. If not given, refers from
                        the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)

mt_train.py

usage: mt_train.py [-h] [--config CONFIG] [--print_config]
                   [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                   [--dry_run DRY_RUN]
                   [--iterator_type {sequence,chunk,task,none}]
                   [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                   [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                   [--dist_backend DIST_BACKEND]
                   [--dist_init_method DIST_INIT_METHOD]
                   [--dist_world_size DIST_WORLD_SIZE] [--dist_rank DIST_RANK]
                   [--local_rank LOCAL_RANK]
                   [--dist_master_addr DIST_MASTER_ADDR]
                   [--dist_master_port DIST_MASTER_PORT]
                   [--dist_launcher {slurm,mpi,None}]
                   [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                   [--unused_parameters UNUSED_PARAMETERS]
                   [--sharded_ddp SHARDED_DDP] [--cudnn_enabled CUDNN_ENABLED]
                   [--cudnn_benchmark CUDNN_BENCHMARK]
                   [--cudnn_deterministic CUDNN_DETERMINISTIC]
                   [--collect_stats COLLECT_STATS]
                   [--write_collected_feats WRITE_COLLECTED_FEATS]
                   [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                   [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                   [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                   [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                   [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                   [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                   [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                   [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                   [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                   [--train_dtype {float16,float32,float64}]
                   [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                   [--use_matplotlib USE_MATPLOTLIB]
                   [--use_tensorboard USE_TENSORBOARD]
                   [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                   [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                   [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                   [--wandb_name WANDB_NAME]
                   [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                   [--detect_anomaly DETECT_ANOMALY]
                   [--pretrain_path PRETRAIN_PATH]
                   [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                   [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                   [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                   [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                   [--batch_size BATCH_SIZE]
                   [--valid_batch_size VALID_BATCH_SIZE]
                   [--batch_bins BATCH_BINS]
                   [--valid_batch_bins VALID_BATCH_BINS]
                   [--train_shape_file TRAIN_SHAPE_FILE]
                   [--valid_shape_file VALID_SHAPE_FILE]
                   [--batch_type {unsorted,sorted,folded,length,numel}]
                   [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                   [--fold_length FOLD_LENGTH]
                   [--sort_in_batch {descending,ascending}]
                   [--sort_batch {descending,ascending}]
                   [--multiple_iterator MULTIPLE_ITERATOR]
                   [--chunk_length CHUNK_LENGTH]
                   [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                   [--num_cache_chunks NUM_CACHE_CHUNKS]
                   [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                   [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                   [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                   [--max_cache_size MAX_CACHE_SIZE]
                   [--max_cache_fd MAX_CACHE_FD]
                   [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                   [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                   [--optim_conf OPTIM_CONF]
                   [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                   [--scheduler_conf SCHEDULER_CONF] [--token_list TOKEN_LIST]
                   [--src_token_list SRC_TOKEN_LIST]
                   [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                   [--input_size INPUT_SIZE] [--model_conf MODEL_CONF]
                   [--use_preprocessor USE_PREPROCESSOR]
                   [--token_type {bpe,char,word,phn}]
                   [--src_token_type {bpe,char,word,phn}]
                   [--bpemodel BPEMODEL] [--src_bpemodel SRC_BPEMODEL]
                   [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                   [--cleaner {None,tacotron,jaconv,vietnamese}]
                   [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                   [--frontend {embed}] [--frontend_conf FRONTEND_CONF]
                   [--preencoder {sinc,linear,None}]
                   [--preencoder_conf PREENCODER_CONF]
                   [--encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,branchformer}]
                   [--encoder_conf ENCODER_CONF]
                   [--postencoder {hugging_face_transformers,None}]
                   [--postencoder_conf POSTENCODER_CONF]
                   [--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
                   [--decoder_conf DECODER_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "midi":
                        MIDI format types which supported by sndfile mid, midi, etc.

                           utterance_id_a a.mid
                           utterance_id_b b.mid
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (for target language) (default: None)
  --src_token_list SRC_TOKEN_LIST
                        A text mapping int-id to token (for source language) (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'src_vocab_size': 0, 'src_token_list': [], 'ignore_id': -1, 'lsm_weight': 0.0, 'length_normalized_loss': False, 'report_bleu': True, 'sym_space': '<space>', 'sym_blank': '<blank>', 'extract_feats_in_collect_stats': True, 'share_decoder_input_output_embed': False, 'share_encoder_decoder_input_embed': False})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The target text will be tokenized in the specified level token (default: bpe)
  --src_token_type {bpe,char,word,phn}
                        The source text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (for target language) (default: None)
  --src_bpemodel SRC_BPEMODEL
                        The model file of sentencepiece (for source language) (default: None)
  --frontend {embed}    The frontend type (default: embed)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --preencoder {sinc,linear,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,branchformer}
                        The encoder type (default: rnn)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --postencoder {hugging_face_transformers,None}
                        The postencoder type (default: None)
  --postencoder_conf POSTENCODER_CONF
                        The keyword arguments for postencoder (default: {})
  --decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
                        The decoder type (default: rnn)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})

pack.py

usage: pack.py [-h] {asr,st,tts,enh,diar,svs,enh_s2t} ...

Pack input files to archive format

positional arguments:
  {asr,st,tts,enh,diar,svs,enh_s2t}

optional arguments:

split_scps.py

usage: split_scps.py [-h]
                     [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                     --scps SCPS [SCPS ...] [--names NAMES [NAMES ...]]
                     [--num_splits NUM_SPLITS] --output_dir OUTPUT_DIR

Split scp files

optional arguments:
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --scps SCPS [SCPS ...]
                        Input texts (default: None)
  --names NAMES [NAMES ...]
                        Output names for each files (default: None)
  --num_splits NUM_SPLITS
                        Split number (default: None)
  --output_dir OUTPUT_DIR
                        Output directory (default: None)

st_inference.py

usage: st_inference.py [-h] [--config CONFIG]
                       [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                       --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                       [--dtype {float16,float32,float64}]
                       [--num_workers NUM_WORKERS]
                       --data_path_and_name_and_type
                       DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                       [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                       [--st_train_config ST_TRAIN_CONFIG]
                       [--st_model_file ST_MODEL_FILE]
                       [--lm_train_config LM_TRAIN_CONFIG] [--lm_file LM_FILE]
                       [--word_lm_train_config WORD_LM_TRAIN_CONFIG]
                       [--word_lm_file WORD_LM_FILE] [--ngram_file NGRAM_FILE]
                       [--model_tag MODEL_TAG] [--enh_s2t_task ENH_S2T_TASK]
                       [--batch_size BATCH_SIZE] [--nbest NBEST]
                       [--beam_size BEAM_SIZE] [--penalty PENALTY]
                       [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                       [--lm_weight LM_WEIGHT] [--ngram_weight NGRAM_WEIGHT]
                       [--token_type {char,bpe,None}] [--bpemodel BPEMODEL]

ST Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --st_train_config ST_TRAIN_CONFIG
                        ST training configuration (default: None)
  --st_model_file ST_MODEL_FILE
                        ST model parameter file (default: None)
  --lm_train_config LM_TRAIN_CONFIG
                        LM training configuration (default: None)
  --lm_file LM_FILE     LM parameter file (default: None)
  --word_lm_train_config WORD_LM_TRAIN_CONFIG
                        Word LM training configuration (default: None)
  --word_lm_file WORD_LM_FILE
                        Word LM parameter file (default: None)
  --ngram_file NGRAM_FILE
                        N-gram parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        *_train_config and *_file will be overwritten
                        (default: None)
  --enh_s2t_task ENH_S2T_TASK
                        enhancement and asr joint model (default: False)

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 20)
  --penalty PENALTY     Insertion penalty (default: 0.0)
  --maxlenratio MAXLENRATIO
                        Input length ratio to obtain max output length. If
                        maxlenratio=0.0 (default), it uses a end-detect
                        function to automatically find maximum hypothesis
                        lengths.If maxlenratio<0.0, its absolute value is
                        interpretedas a constant max output length (default:
                        0.0)
  --minlenratio MINLENRATIO
                        Input length ratio to obtain min output length
                        (default: 0.0)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --ngram_weight NGRAM_WEIGHT
                        ngram weight (default: 0.9)

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ST model. If not given, refers from
                        the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)

st_train.py

usage: st_train.py [-h] [--config CONFIG] [--print_config]
                   [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                   [--dry_run DRY_RUN]
                   [--iterator_type {sequence,chunk,task,none}]
                   [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                   [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                   [--dist_backend DIST_BACKEND]
                   [--dist_init_method DIST_INIT_METHOD]
                   [--dist_world_size DIST_WORLD_SIZE] [--dist_rank DIST_RANK]
                   [--local_rank LOCAL_RANK]
                   [--dist_master_addr DIST_MASTER_ADDR]
                   [--dist_master_port DIST_MASTER_PORT]
                   [--dist_launcher {slurm,mpi,None}]
                   [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                   [--unused_parameters UNUSED_PARAMETERS]
                   [--sharded_ddp SHARDED_DDP] [--cudnn_enabled CUDNN_ENABLED]
                   [--cudnn_benchmark CUDNN_BENCHMARK]
                   [--cudnn_deterministic CUDNN_DETERMINISTIC]
                   [--collect_stats COLLECT_STATS]
                   [--write_collected_feats WRITE_COLLECTED_FEATS]
                   [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                   [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                   [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                   [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                   [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                   [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                   [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                   [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                   [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                   [--train_dtype {float16,float32,float64}]
                   [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                   [--use_matplotlib USE_MATPLOTLIB]
                   [--use_tensorboard USE_TENSORBOARD]
                   [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                   [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                   [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                   [--wandb_name WANDB_NAME]
                   [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                   [--detect_anomaly DETECT_ANOMALY]
                   [--pretrain_path PRETRAIN_PATH]
                   [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                   [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                   [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                   [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                   [--batch_size BATCH_SIZE]
                   [--valid_batch_size VALID_BATCH_SIZE]
                   [--batch_bins BATCH_BINS]
                   [--valid_batch_bins VALID_BATCH_BINS]
                   [--train_shape_file TRAIN_SHAPE_FILE]
                   [--valid_shape_file VALID_SHAPE_FILE]
                   [--batch_type {unsorted,sorted,folded,length,numel}]
                   [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                   [--fold_length FOLD_LENGTH]
                   [--sort_in_batch {descending,ascending}]
                   [--sort_batch {descending,ascending}]
                   [--multiple_iterator MULTIPLE_ITERATOR]
                   [--chunk_length CHUNK_LENGTH]
                   [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                   [--num_cache_chunks NUM_CACHE_CHUNKS]
                   [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                   [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                   [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                   [--max_cache_size MAX_CACHE_SIZE]
                   [--max_cache_fd MAX_CACHE_FD]
                   [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                   [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                   [--optim_conf OPTIM_CONF]
                   [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                   [--scheduler_conf SCHEDULER_CONF] [--token_list TOKEN_LIST]
                   [--src_token_list SRC_TOKEN_LIST]
                   [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                   [--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
                   [--model_conf MODEL_CONF]
                   [--use_preprocessor USE_PREPROCESSOR]
                   [--token_type {bpe,char,word,phn}]
                   [--src_token_type {bpe,char,word,phn,none}]
                   [--bpemodel BPEMODEL] [--src_bpemodel SRC_BPEMODEL]
                   [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                   [--cleaner {None,tacotron,jaconv,vietnamese}]
                   [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                   [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                   [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                   [--noise_scp NOISE_SCP]
                   [--noise_apply_prob NOISE_APPLY_PROB]
                   [--noise_db_range NOISE_DB_RANGE]
                   [--short_noise_thres SHORT_NOISE_THRES]
                   [--frontend {default,sliding_window,s3prl}]
                   [--frontend_conf FRONTEND_CONF] [--specaug {specaug,None}]
                   [--specaug_conf SPECAUG_CONF]
                   [--normalize {global_mvn,utterance_mvn,None}]
                   [--normalize_conf NORMALIZE_CONF]
                   [--preencoder {sinc,linear,None}]
                   [--preencoder_conf PREENCODER_CONF]
                   [--encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain}]
                   [--encoder_conf ENCODER_CONF]
                   [--postencoder {hugging_face_transformers,None}]
                   [--postencoder_conf POSTENCODER_CONF]
                   [--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
                   [--decoder_conf DECODER_CONF]
                   [--extra_asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
                   [--extra_asr_decoder_conf EXTRA_ASR_DECODER_CONF]
                   [--extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
                   [--extra_mt_decoder_conf EXTRA_MT_DECODER_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "midi":
                        MIDI format types which supported by sndfile mid, midi, etc.

                           utterance_id_a a.mid
                           utterance_id_b b.mid
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (for target language) (default: None)
  --src_token_list SRC_TOKEN_LIST
                        A text mapping int-id to token (for source language) (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --ctc_conf CTC_CONF   The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True})
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'asr_weight': 0.0, 'mt_weight': 0.0, 'mtlalpha': 0.0, 'ignore_id': -1, 'lsm_weight': 0.0, 'length_normalized_loss': False, 'report_cer': True, 'report_wer': True, 'report_bleu': True, 'sym_space': '<space>', 'sym_blank': '<blank>', 'extract_feats_in_collect_stats': True})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The target text will be tokenized in the specified level token (default: bpe)
  --src_token_type {bpe,char,word,phn,none}
                        The source text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (for target language) (default: None)
  --src_bpemodel SRC_BPEMODEL
                        The model file of sentencepiece (for source language) (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value. (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of noise decibel level. (default: 13_15)
  --short_noise_thres SHORT_NOISE_THRES
                        If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
  --frontend {default,sliding_window,s3prl}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --preencoder {sinc,linear,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain}
                        The encoder type (default: rnn)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --postencoder {hugging_face_transformers,None}
                        The postencoder type (default: None)
  --postencoder_conf POSTENCODER_CONF
                        The keyword arguments for postencoder (default: {})
  --decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
                        The decoder type (default: rnn)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
  --extra_asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
                        The extra_asr_decoder type (default: rnn)
  --extra_asr_decoder_conf EXTRA_ASR_DECODER_CONF
                        The keyword arguments for extra_asr_decoder (default: {})
  --extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
                        The extra_mt_decoder type (default: rnn)
  --extra_mt_decoder_conf EXTRA_MT_DECODER_CONF
                        The keyword arguments for extra_mt_decoder (default: {})

tokenize_text.py

usage: tokenize_text.py [-h]
                        [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        --input INPUT --output OUTPUT [--field FIELD]
                        [--token_type {char,bpe,word,phn}]
                        [--delimiter DELIMITER] [--space_symbol SPACE_SYMBOL]
                        [--bpemodel BPEMODEL]
                        [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                        [--remove_non_linguistic_symbols REMOVE_NON_LINGUISTIC_SYMBOLS]
                        [--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
                        [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                        [--write_vocabulary WRITE_VOCABULARY]
                        [--vocabulary_size VOCABULARY_SIZE] [--cutoff CUTOFF]
                        [--add_symbol ADD_SYMBOL]

Tokenize texts

optional arguments:
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --input INPUT, -i INPUT
                        Input text. - indicates sys.stdin (default: None)
  --output OUTPUT, -o OUTPUT
                        Output text. - indicates sys.stdout (default: None)
  --field FIELD, -f FIELD
                        The target columns of the input text as 1-based
                        integer. e.g 2- (default: None)
  --token_type {char,bpe,word,phn}, -t {char,bpe,word,phn}
                        Token type (default: char)
  --delimiter DELIMITER, -d DELIMITER
                        The delimiter (default: None)
  --space_symbol SPACE_SYMBOL
                        The space symbol (default: <space>)
  --bpemodel BPEMODEL   The bpemodel file path (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --remove_non_linguistic_symbols REMOVE_NON_LINGUISTIC_SYMBOLS
                        Remove non-language-symbols from tokens (default:
                        False)
  --cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)

write_vocabulary mode related:
  --write_vocabulary WRITE_VOCABULARY
                        Write tokens list instead of tokenized text per line
                        (default: False)
  --vocabulary_size VOCABULARY_SIZE
                        Vocabulary size (default: 0)
  --cutoff CUTOFF       cut-off frequency used for write-vocabulary mode
                        (default: 0)
  --add_symbol ADD_SYMBOL
                        Append symbol e.g. --add_symbol '<blank>:0'
                        --add_symbol '<unk>:1' (default: [])

tts_inference.py

usage: tts_inference.py [-h] [--config CONFIG]
                        [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                        [--dtype {float16,float32,float64}]
                        [--num_workers NUM_WORKERS] [--batch_size BATCH_SIZE]
                        --data_path_and_name_and_type
                        DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--train_config TRAIN_CONFIG]
                        [--model_file MODEL_FILE] [--model_tag MODEL_TAG]
                        [--maxlenratio MAXLENRATIO]
                        [--minlenratio MINLENRATIO] [--threshold THRESHOLD]
                        [--use_att_constraint USE_ATT_CONSTRAINT]
                        [--backward_window BACKWARD_WINDOW]
                        [--forward_window FORWARD_WINDOW]
                        [--use_teacher_forcing USE_TEACHER_FORCING]
                        [--speed_control_alpha SPEED_CONTROL_ALPHA]
                        [--noise_scale NOISE_SCALE]
                        [--noise_scale_dur NOISE_SCALE_DUR]
                        [--always_fix_seed ALWAYS_FIX_SEED]
                        [--vocoder_config VOCODER_CONFIG]
                        [--vocoder_file VOCODER_FILE]
                        [--vocoder_tag VOCODER_TAG]

TTS inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
                        The path of output directory (default: None)
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --speed_control_alpha SPEED_CONTROL_ALPHA
                        Alpha in FastSpeech to change the speed of generated
                        speech (default: 1.0)
  --noise_scale NOISE_SCALE
                        Noise scale parameter for the flow in vits (default:
                        0.667)
  --noise_scale_dur NOISE_SCALE_DUR
                        Noise scale parameter for the stochastic duration
                        predictor in vits (default: 0.8)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --train_config TRAIN_CONFIG
                        Training configuration file (default: None)
  --model_file MODEL_FILE
                        Model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        train_config and model_file will be overwritten
                        (default: None)

Decoding related:
  --maxlenratio MAXLENRATIO
                        Maximum length ratio in decoding (default: 10.0)
  --minlenratio MINLENRATIO
                        Minimum length ratio in decoding (default: 0.0)
  --threshold THRESHOLD
                        Threshold value in decoding (default: 0.5)
  --use_att_constraint USE_ATT_CONSTRAINT
                        Whether to use attention constraint (default: False)
  --backward_window BACKWARD_WINDOW
                        Backward window value in attention constraint
                        (default: 1)
  --forward_window FORWARD_WINDOW
                        Forward window value in attention constraint (default:
                        3)
  --use_teacher_forcing USE_TEACHER_FORCING
                        Whether to use teacher forcing (default: False)
  --always_fix_seed ALWAYS_FIX_SEED
                        Whether to always fix seed (default: False)

Vocoder related:
  --vocoder_config VOCODER_CONFIG
                        Vocoder configuration file (default: None)
  --vocoder_file VOCODER_FILE
                        Vocoder parameter file (default: None)
  --vocoder_tag VOCODER_TAG
                        Pretrained vocoder tag. If specify this option,
                        vocoder_config and vocoder_file will be overwritten
                        (default: None)

tts_train.py

usage: tts_train.py [-h] [--config CONFIG] [--print_config]
                    [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--dry_run DRY_RUN]
                    [--iterator_type {sequence,chunk,task,none}]
                    [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                    [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                    [--dist_backend DIST_BACKEND]
                    [--dist_init_method DIST_INIT_METHOD]
                    [--dist_world_size DIST_WORLD_SIZE]
                    [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                    [--dist_master_addr DIST_MASTER_ADDR]
                    [--dist_master_port DIST_MASTER_PORT]
                    [--dist_launcher {slurm,mpi,None}]
                    [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                    [--unused_parameters UNUSED_PARAMETERS]
                    [--sharded_ddp SHARDED_DDP]
                    [--cudnn_enabled CUDNN_ENABLED]
                    [--cudnn_benchmark CUDNN_BENCHMARK]
                    [--cudnn_deterministic CUDNN_DETERMINISTIC]
                    [--collect_stats COLLECT_STATS]
                    [--write_collected_feats WRITE_COLLECTED_FEATS]
                    [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                    [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                    [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                    [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                    [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                    [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                    [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                    [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                    [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                    [--train_dtype {float16,float32,float64}]
                    [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                    [--use_matplotlib USE_MATPLOTLIB]
                    [--use_tensorboard USE_TENSORBOARD]
                    [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                    [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                    [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                    [--wandb_name WANDB_NAME]
                    [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                    [--detect_anomaly DETECT_ANOMALY]
                    [--pretrain_path PRETRAIN_PATH]
                    [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                    [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                    [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                    [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                    [--batch_size BATCH_SIZE]
                    [--valid_batch_size VALID_BATCH_SIZE]
                    [--batch_bins BATCH_BINS]
                    [--valid_batch_bins VALID_BATCH_BINS]
                    [--train_shape_file TRAIN_SHAPE_FILE]
                    [--valid_shape_file VALID_SHAPE_FILE]
                    [--batch_type {unsorted,sorted,folded,length,numel}]
                    [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                    [--fold_length FOLD_LENGTH]
                    [--sort_in_batch {descending,ascending}]
                    [--sort_batch {descending,ascending}]
                    [--multiple_iterator MULTIPLE_ITERATOR]
                    [--chunk_length CHUNK_LENGTH]
                    [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                    [--num_cache_chunks NUM_CACHE_CHUNKS]
                    [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                    [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                    [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                    [--max_cache_size MAX_CACHE_SIZE]
                    [--max_cache_fd MAX_CACHE_FD]
                    [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                    [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                    [--optim_conf OPTIM_CONF]
                    [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                    [--scheduler_conf SCHEDULER_CONF]
                    [--token_list TOKEN_LIST] [--odim ODIM]
                    [--model_conf MODEL_CONF]
                    [--use_preprocessor USE_PREPROCESSOR]
                    [--token_type {bpe,char,word,phn}] [--bpemodel BPEMODEL]
                    [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                    [--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
                    [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                    [--feats_extract {fbank,spectrogram,linear_spectrogram}]
                    [--feats_extract_conf FEATS_EXTRACT_CONF]
                    [--normalize {global_mvn,None}]
                    [--normalize_conf NORMALIZE_CONF]
                    [--tts {tacotron2,transformer,fastspeech,fastspeech2,vits,joint_text2wav,jets}]
                    [--tts_conf TTS_CONF] [--pitch_extract {dio,None}]
                    [--pitch_extract_conf PITCH_EXTRACT_CONF]
                    [--pitch_normalize {global_mvn,None}]
                    [--pitch_normalize_conf PITCH_NORMALIZE_CONF]
                    [--energy_extract {energy,None}]
                    [--energy_extract_conf ENERGY_EXTRACT_CONF]
                    [--energy_normalize {global_mvn,None}]
                    [--energy_normalize_conf ENERGY_NORMALIZE_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "midi":
                        MIDI format types which supported by sndfile mid, midi, etc.

                           utterance_id_a a.mid
                           utterance_id_b b.mid
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmupsteplr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --odim ODIM           The number of dimension of output feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: phn)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --feats_extract {fbank,spectrogram,linear_spectrogram}
                        The feats_extract type (default: fbank)
  --feats_extract_conf FEATS_EXTRACT_CONF
                        The keyword arguments for feats_extract (default: {})
  --normalize {global_mvn,None}
                        The normalize type (default: global_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --tts {tacotron2,transformer,fastspeech,fastspeech2,vits,joint_text2wav,jets}
                        The tts type (default: tacotron2)
  --tts_conf TTS_CONF   The keyword arguments for tts (default: {})
  --pitch_extract {dio,None}
                        The pitch_extract type (default: None)
  --pitch_extract_conf PITCH_EXTRACT_CONF
                        The keyword arguments for pitch_extract (default: {})
  --pitch_normalize {global_mvn,None}
                        The pitch_normalize type (default: None)
  --pitch_normalize_conf PITCH_NORMALIZE_CONF
                        The keyword arguments for pitch_normalize (default: {})
  --energy_extract {energy,None}
                        The energy_extract type (default: None)
  --energy_extract_conf ENERGY_EXTRACT_CONF
                        The keyword arguments for energy_extract (default: {})
  --energy_normalize {global_mvn,None}
                        The energy_normalize type (default: None)
  --energy_normalize_conf ENERGY_NORMALIZE_CONF
                        The keyword arguments for energy_normalize (default: {})