core tools (espnet2)¶

ESPnet2 provides several command-line tools for training and evaluating neural networks (NN) under espnet2/bin:

aggregate_stats_dirs.py¶

usage: aggregate_stats_dirs.py [-h]
                               [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                               [--skip_sum_stats] [--input_dir INPUT_DIR]
                               --output_dir OUTPUT_DIR

Aggregate statistics directories into one directory

optional arguments:
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --skip_sum_stats      Skip computing the sum of statistics. (default: False)
  --input_dir INPUT_DIR
                        Input directories (default: None)
  --output_dir OUTPUT_DIR
                        Output directory (default: None)

asr_align.py¶

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/runner/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /home/runner/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: asr_align.py [-h] [--config CONFIG]
                    [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--ngpu NGPU] [--dtype {float16,float32,float64}]
                    --asr_train_config ASR_TRAIN_CONFIG --asr_model_file
                    ASR_MODEL_FILE [--token_type {char,bpe,None}]
                    [--bpemodel BPEMODEL] [--fs FS]
                    [--min_window_size MIN_WINDOW_SIZE]
                    [--max_window_size MAX_WINDOW_SIZE]
                    [--set_blank SET_BLANK] [--gratis_blank GRATIS_BLANK]
                    [--replace_spaces_with_blanks REPLACE_SPACES_WITH_BLANKS]
                    [--scoring_length SCORING_LENGTH]
                    [--time_stamps {auto,fixed}]
                    [--text_converter {tokenize,classic}]
                    [--kaldi_style_text KALDI_STYLE_TEXT]
                    [--print_utt_text PRINT_UTT_TEXT]
                    [--print_utt_score PRINT_UTT_SCORE] -a AUDIO -t TEXT
                    [-o OUTPUT]

ASR Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)

Model configuration related:
  --asr_train_config ASR_TRAIN_CONFIG
  --asr_model_file ASR_MODEL_FILE

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ASR model. If not given, refers
                        from the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)

CTC segmentation related:
  --fs FS               Sampling Frequency. The sampling frequency (in Hz) is
                        needed to correctly determine the starting and ending
                        time of aligned segments. (default: 16000)
  --min_window_size MIN_WINDOW_SIZE
                        Minimum window size considered for utterance.
                        (default: None)
  --max_window_size MAX_WINDOW_SIZE
                        Maximum window size considered for utterance.
                        (default: None)
  --set_blank SET_BLANK
                        Index of model dictionary for blank token. (default:
                        None)
  --gratis_blank GRATIS_BLANK
                        Set the transition cost of the blank token to zero.
                        Audio sections labeled with blank tokens can then be
                        skipped without penalty. Useful if there are unrelated
                        audio segments between utterances. (default: False)
  --replace_spaces_with_blanks REPLACE_SPACES_WITH_BLANKS
                        Fill blanks in between words to better model pauses
                        between words. This option is only active for
                        `--text_converter classic`. Segments can be misaligned
                        if this option is combined with --gratis-blank.
                        (default: False)
  --scoring_length SCORING_LENGTH
                        Changes partitioning length L for calculation of the
                        confidence score. (default: None)
  --time_stamps {auto,fixed}
                        Select method how CTC index duration is estimated, and
                        thus how the time stamps are calculated. (default:
                        auto)
  --text_converter {tokenize,classic}
                        How CTC segmentation handles text. (default: tokenize)

Input/output arguments:
  --kaldi_style_text KALDI_STYLE_TEXT
                        Assume that the input text file is kaldi-style
                        formatted, i.e., the utterance name is at the
                        beginning of each line. (default: True)
  --print_utt_text PRINT_UTT_TEXT
                        Include the utterance text in the segments output.
                        (default: True)
  --print_utt_score PRINT_UTT_SCORE
                        Include the confidence score in the segments output.
                        (default: True)
  -a AUDIO, --audio AUDIO
                        Input audio file. (default: None)
  -t TEXT, --text TEXT  Input text file. Each line contains the ground truth
                        of a single utterance. Kaldi-style text files include
                        the name of the utterance as the first word in the
                        line. (default: None)
  -o OUTPUT, --output OUTPUT
                        Output in the form of a `segments` file. If not given,
                        output is written to stdout. (default: -)

asr_inference_streaming.py¶

/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: asr_inference_streaming.py [-h] [--config CONFIG]
                                  [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                                  --output_dir OUTPUT_DIR [--ngpu NGPU]
                                  [--seed SEED]
                                  [--dtype {float16,float32,float64}]
                                  [--num_workers NUM_WORKERS]
                                  --data_path_and_name_and_type
                                  DATA_PATH_AND_NAME_AND_TYPE
                                  [--key_file KEY_FILE]
                                  [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                                  [--sim_chunk_length SIM_CHUNK_LENGTH]
                                  --asr_train_config ASR_TRAIN_CONFIG
                                  --asr_model_file ASR_MODEL_FILE
                                  [--lm_train_config LM_TRAIN_CONFIG]
                                  [--lm_file LM_FILE]
                                  [--word_lm_train_config WORD_LM_TRAIN_CONFIG]
                                  [--word_lm_file WORD_LM_FILE]
                                  [--batch_size BATCH_SIZE] [--nbest NBEST]
                                  [--beam_size BEAM_SIZE] [--penalty PENALTY]
                                  [--maxlenratio MAXLENRATIO]
                                  [--minlenratio MINLENRATIO]
                                  [--ctc_weight CTC_WEIGHT]
                                  [--lm_weight LM_WEIGHT]
                                  [--disable_repetition_detection DISABLE_REPETITION_DETECTION]
                                  [--encoded_feat_length_limit ENCODED_FEAT_LENGTH_LIMIT]
                                  [--decoder_text_length_limit DECODER_TEXT_LENGTH_LIMIT]
                                  [--token_type {char,bpe,None}]
                                  [--bpemodel BPEMODEL]
                                  [--normalize_length NORMALIZE_LENGTH]

ASR Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
  --sim_chunk_length SIM_CHUNK_LENGTH
                        The length of one chunk, to which speech will be
                        divided for evalution of streaming processing.
                        (default: 0)

The model configuration related:
  --asr_train_config ASR_TRAIN_CONFIG
  --asr_model_file ASR_MODEL_FILE
  --lm_train_config LM_TRAIN_CONFIG
  --lm_file LM_FILE
  --word_lm_train_config WORD_LM_TRAIN_CONFIG
  --word_lm_file WORD_LM_FILE

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 20)
  --penalty PENALTY     Insertion penalty (default: 0.0)
  --maxlenratio MAXLENRATIO
                        Input length ratio to obtain max output length. If
                        maxlenratio=0.0 (default), it uses a end-detect
                        function to automatically find maximum hypothesis
                        lengths (default: 0.0)
  --minlenratio MINLENRATIO
                        Input length ratio to obtain min output length
                        (default: 0.0)
  --ctc_weight CTC_WEIGHT
                        CTC weight in joint decoding (default: 0.5)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --disable_repetition_detection DISABLE_REPETITION_DETECTION
  --encoded_feat_length_limit ENCODED_FEAT_LENGTH_LIMIT
                        Limit the lengths of the encoded featureto input to
                        the decoder. (default: 0)
  --decoder_text_length_limit DECODER_TEXT_LENGTH_LIMIT
                        Limit the lengths of the textto input to the decoder.
                        (default: 0)

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ASR model. If not given, refers
                        from the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)
  --normalize_length NORMALIZE_LENGTH
                        If true, best hypothesis is selected by length-
                        normalized scores (default: False)

asr_train.py¶

usage: asr_train.py [-h] [--config CONFIG] [--print_config]
                    [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                    [--iterator_type {sequence,category,chunk,task,none}]
                    [--valid_iterator_type {sequence,category,chunk,task,none}]
                    [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                    [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                    [--dist_backend DIST_BACKEND]
                    [--dist_init_method DIST_INIT_METHOD]
                    [--dist_world_size DIST_WORLD_SIZE]
                    [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                    [--dist_master_addr DIST_MASTER_ADDR]
                    [--dist_master_port DIST_MASTER_PORT]
                    [--dist_launcher {slurm,mpi,None}]
                    [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                    [--unused_parameters UNUSED_PARAMETERS]
                    [--sharded_ddp SHARDED_DDP]
                    [--cudnn_enabled CUDNN_ENABLED]
                    [--cudnn_benchmark CUDNN_BENCHMARK]
                    [--cudnn_deterministic CUDNN_DETERMINISTIC]
                    [--collect_stats COLLECT_STATS]
                    [--write_collected_feats WRITE_COLLECTED_FEATS]
                    [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                    [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                    [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                    [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                    [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                    [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                    [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                    [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                    [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                    [--train_dtype {float16,float32,float64}]
                    [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                    [--use_matplotlib USE_MATPLOTLIB]
                    [--use_tensorboard USE_TENSORBOARD]
                    [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                    [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                    [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                    [--wandb_name WANDB_NAME]
                    [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                    [--detect_anomaly DETECT_ANOMALY]
                    [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                    [--save_strategy {all,adapter_only,required_grad_only}]
                    [--adapter_conf ADAPTER_CONF]
                    [--pretrain_path PRETRAIN_PATH]
                    [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                    [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                    [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                    [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                    [--batch_size BATCH_SIZE]
                    [--valid_batch_size VALID_BATCH_SIZE]
                    [--batch_bins BATCH_BINS]
                    [--valid_batch_bins VALID_BATCH_BINS]
                    [--train_shape_file TRAIN_SHAPE_FILE]
                    [--valid_shape_file VALID_SHAPE_FILE]
                    [--batch_type {unsorted,sorted,folded,length,numel}]
                    [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                    [--fold_length FOLD_LENGTH]
                    [--sort_in_batch {descending,ascending}]
                    [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                    [--sort_batch {descending,ascending}]
                    [--multiple_iterator MULTIPLE_ITERATOR]
                    [--chunk_length CHUNK_LENGTH]
                    [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                    [--num_cache_chunks NUM_CACHE_CHUNKS]
                    [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                    [--chunk_default_fs CHUNK_DEFAULT_FS]
                    [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                    [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                    [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                    [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                    [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                    [--max_cache_size MAX_CACHE_SIZE]
                    [--max_cache_fd MAX_CACHE_FD]
                    [--allow_multi_rates ALLOW_MULTI_RATES]
                    [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                    [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                    [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                    [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                    [--optim_conf OPTIM_CONF]
                    [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                    [--scheduler_conf SCHEDULER_CONF]
                    [--token_list TOKEN_LIST]
                    [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                    [--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
                    [--joint_net_conf JOINT_NET_CONF]
                    [--use_preprocessor USE_PREPROCESSOR]
                    [--use_lang_prompt USE_LANG_PROMPT]
                    [--use_nlp_prompt USE_NLP_PROMPT]
                    [--token_type {bpe,char,word,phn,hugging_face,whisper_en,whisper_multilingual}]
                    [--bpemodel BPEMODEL]
                    [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                    [--cleaner {None,tacotron,jaconv,vietnamese,whisper_en,whisper_basic}]
                    [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                    [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                    [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                    [--noise_scp NOISE_SCP]
                    [--noise_apply_prob NOISE_APPLY_PROB]
                    [--noise_db_range NOISE_DB_RANGE]
                    [--short_noise_thres SHORT_NOISE_THRES]
                    [--aux_ctc_tasks AUX_CTC_TASKS [AUX_CTC_TASKS ...]]
                    [--frontend {default,sliding_window,s3prl,fused,whisper}]
                    [--frontend_conf FRONTEND_CONF] [--specaug {specaug,None}]
                    [--specaug_conf SPECAUG_CONF]
                    [--normalize {global_mvn,utterance_mvn,None}]
                    [--normalize_conf NORMALIZE_CONF]
                    [--model {espnet,maskctc,pit_espnet}]
                    [--model_conf MODEL_CONF]
                    [--preencoder {sinc,linear,None}]
                    [--preencoder_conf PREENCODER_CONF]
                    [--encoder {conformer,transformer,transformer_multispkr,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,torchaudiohubert,longformer,branchformer,whisper,e_branchformer,avhubert,multiconv_conformer}]
                    [--encoder_conf ENCODER_CONF]
                    [--postencoder {hugging_face_transformers,length_adaptor,None}]
                    [--postencoder_conf POSTENCODER_CONF]
                    [--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm,whisper,hugging_face_transformers,s4,None}]
                    [--decoder_conf DECODER_CONF]
                    [--preprocessor {default,multi}]
                    [--preprocessor_conf PREPROCESSOR_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --ctc_conf CTC_CONF   The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True, 'brctc_risk_strategy': 'exp', 'brctc_group_strategy': 'end', 'brctc_risk_factor': 0.0})
  --joint_net_conf JOINT_NET_CONF
                        The keyword arguments for joint network class. (default: None)

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --use_lang_prompt USE_LANG_PROMPT
                        Use language id as prompt (default: False)
  --use_nlp_prompt USE_NLP_PROMPT
                        Use natural language phrases as prompt (default: False)
  --token_type {bpe,char,word,phn,hugging_face,whisper_en,whisper_multilingual}
                        The text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese,whisper_en,whisper_basic}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value. (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of noise decibel level. (default: 13_15)
  --short_noise_thres SHORT_NOISE_THRES
                        If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
  --aux_ctc_tasks AUX_CTC_TASKS [AUX_CTC_TASKS ...]
                        Auxillary tasks to train on using CTC loss.  (default: [])
  --frontend {default,sliding_window,s3prl,fused,whisper}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --model {espnet,maskctc,pit_espnet}
                        The model type (default: espnet)
  --model_conf MODEL_CONF
                        The keyword arguments for model (default: {})
  --preencoder {sinc,linear,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {conformer,transformer,transformer_multispkr,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,torchaudiohubert,longformer,branchformer,whisper,e_branchformer,avhubert,multiconv_conformer}
                        The encoder type (default: rnn)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --postencoder {hugging_face_transformers,length_adaptor,None}
                        The postencoder type (default: None)
  --postencoder_conf POSTENCODER_CONF
                        The keyword arguments for postencoder (default: {})
  --decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm,whisper,hugging_face_transformers,s4,None}
                        The decoder type (default: None)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
  --preprocessor {default,multi}
                        The preprocessor type (default: default)
  --preprocessor_conf PREPROCESSOR_CONF
                        The keyword arguments for preprocessor (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

asr_transducer_inference.py¶

/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: asr_transducer_inference.py [-h] [--config CONFIG]
                                   [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                                   --output_dir OUTPUT_DIR [--ngpu NGPU]
                                   [--seed SEED]
                                   [--dtype {float16,float32,float64}]
                                   [--num_workers NUM_WORKERS]
                                   --data_path_and_name_and_type
                                   DATA_PATH_AND_NAME_AND_TYPE
                                   [--key_file KEY_FILE]
                                   [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                                   [--asr_train_config ASR_TRAIN_CONFIG]
                                   [--asr_model_file ASR_MODEL_FILE]
                                   [--lm_train_config LM_TRAIN_CONFIG]
                                   [--lm_file LM_FILE] [--model_tag MODEL_TAG]
                                   [--batch_size BATCH_SIZE] [--nbest NBEST]
                                   [--beam_size BEAM_SIZE]
                                   [--lm_weight LM_WEIGHT]
                                   [--beam_search_config BEAM_SEARCH_CONFIG]
                                   [--token_type {char,bpe,None}]
                                   [--bpemodel BPEMODEL]
                                   [--quantize_asr_model QUANTIZE_ASR_MODEL]
                                   [--quantize_modules [QUANTIZE_MODULES [QUANTIZE_MODULES ...]]]
                                   [--quantize_dtype {float16,qint8}]
                                   [--streaming STREAMING]
                                   [--decoding_window DECODING_WINDOW]
                                   [--left_context LEFT_CONTEXT]
                                   [--display_hypotheses DISPLAY_HYPOTHESES]

ASR Transducer Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --quantize_asr_model QUANTIZE_ASR_MODEL
                        Apply dynamic quantization to ASR model. (default:
                        False)
  --quantize_modules [QUANTIZE_MODULES [QUANTIZE_MODULES ...]]
                        Module names to apply dynamic quantization on. The
                        module names are provided as a list, where each name
                        is separated by a comma (e.g.: --quantize-
                        config=[Linear,LSTM,GRU]). Each specified name should
                        be an attribute of 'torch.nn', e.g.: torch.nn.Linear,
                        torch.nn.LSTM, torch.nn.GRU, ... (default: None)
  --quantize_dtype {float16,qint8}
                        Dtype for dynamic quantization. (default: qint8)
  --streaming STREAMING
                        Whether to perform chunk-by-chunk inference. (default:
                        False)
  --decoding_window DECODING_WINDOW
                        Audio length (in milliseconds) to process during
                        decoding. (default: 640)
  --left_context LEFT_CONTEXT
                        Number of previous frames (AFTER subsamplingà the
                        attention module can see in current chunk (used by
                        Conformer and Branchformer block). (default: 32)
  --display_hypotheses DISPLAY_HYPOTHESES
                        Whether to display hypotheses during inference. If
                        streaming=True, partial hypotheses will also be shown.
                        (default: False)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --asr_train_config ASR_TRAIN_CONFIG
                        ASR training configuration (default: None)
  --asr_model_file ASR_MODEL_FILE
                        ASR model parameter file (default: None)
  --lm_train_config LM_TRAIN_CONFIG
                        LM training configuration (default: None)
  --lm_file LM_FILE     LM parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        *_train_config and *_file will be overwritten
                        (default: None)

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 5)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --beam_search_config BEAM_SEARCH_CONFIG
                        The keyword arguments for transducer beam search.
                        (default: {})

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ASR model. If not given, refers
                        from the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)

asr_transducer_train.py¶

usage: asr_transducer_train.py [-h] [--config CONFIG] [--print_config]
                               [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                               [--drop_last_iter DROP_LAST_ITER]
                               [--dry_run DRY_RUN]
                               [--iterator_type {sequence,category,chunk,task,none}]
                               [--valid_iterator_type {sequence,category,chunk,task,none}]
                               [--output_dir OUTPUT_DIR] [--ngpu NGPU]
                               [--seed SEED] [--num_workers NUM_WORKERS]
                               [--num_att_plot NUM_ATT_PLOT]
                               [--dist_backend DIST_BACKEND]
                               [--dist_init_method DIST_INIT_METHOD]
                               [--dist_world_size DIST_WORLD_SIZE]
                               [--dist_rank DIST_RANK]
                               [--local_rank LOCAL_RANK]
                               [--dist_master_addr DIST_MASTER_ADDR]
                               [--dist_master_port DIST_MASTER_PORT]
                               [--dist_launcher {slurm,mpi,None}]
                               [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                               [--unused_parameters UNUSED_PARAMETERS]
                               [--sharded_ddp SHARDED_DDP]
                               [--cudnn_enabled CUDNN_ENABLED]
                               [--cudnn_benchmark CUDNN_BENCHMARK]
                               [--cudnn_deterministic CUDNN_DETERMINISTIC]
                               [--collect_stats COLLECT_STATS]
                               [--write_collected_feats WRITE_COLLECTED_FEATS]
                               [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                               [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                               [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                               [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                               [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                               [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                               [--grad_clip GRAD_CLIP]
                               [--grad_clip_type GRAD_CLIP_TYPE]
                               [--grad_noise GRAD_NOISE]
                               [--accum_grad ACCUM_GRAD]
                               [--no_forward_run NO_FORWARD_RUN]
                               [--resume RESUME]
                               [--train_dtype {float16,float32,float64}]
                               [--use_amp USE_AMP]
                               [--log_interval LOG_INTERVAL]
                               [--use_matplotlib USE_MATPLOTLIB]
                               [--use_tensorboard USE_TENSORBOARD]
                               [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                               [--use_wandb USE_WANDB]
                               [--wandb_project WANDB_PROJECT]
                               [--wandb_id WANDB_ID]
                               [--wandb_entity WANDB_ENTITY]
                               [--wandb_name WANDB_NAME]
                               [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                               [--detect_anomaly DETECT_ANOMALY]
                               [--use_adapter USE_ADAPTER]
                               [--adapter {lora,houlsby}]
                               [--save_strategy {all,adapter_only,required_grad_only}]
                               [--adapter_conf ADAPTER_CONF]
                               [--pretrain_path PRETRAIN_PATH]
                               [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                               [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                               [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                               [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                               [--batch_size BATCH_SIZE]
                               [--valid_batch_size VALID_BATCH_SIZE]
                               [--batch_bins BATCH_BINS]
                               [--valid_batch_bins VALID_BATCH_BINS]
                               [--train_shape_file TRAIN_SHAPE_FILE]
                               [--valid_shape_file VALID_SHAPE_FILE]
                               [--batch_type {unsorted,sorted,folded,length,numel}]
                               [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                               [--fold_length FOLD_LENGTH]
                               [--sort_in_batch {descending,ascending}]
                               [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                               [--sort_batch {descending,ascending}]
                               [--multiple_iterator MULTIPLE_ITERATOR]
                               [--chunk_length CHUNK_LENGTH]
                               [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                               [--num_cache_chunks NUM_CACHE_CHUNKS]
                               [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                               [--chunk_default_fs CHUNK_DEFAULT_FS]
                               [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                               [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                               [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                               [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                               [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                               [--max_cache_size MAX_CACHE_SIZE]
                               [--max_cache_fd MAX_CACHE_FD]
                               [--allow_multi_rates ALLOW_MULTI_RATES]
                               [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                               [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                               [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                               [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                               [--optim_conf OPTIM_CONF]
                               [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                               [--scheduler_conf SCHEDULER_CONF]
                               [--token_list TOKEN_LIST]
                               [--input_size INPUT_SIZE] [--init INIT]
                               [--model_conf MODEL_CONF]
                               [--encoder_conf ENCODER_CONF]
                               [--joint_network_conf JOINT_NETWORK_CONF]
                               [--use_preprocessor USE_PREPROCESSOR]
                               [--token_type {bpe,char,word,phn}]
                               [--bpemodel BPEMODEL]
                               [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                               [--cleaner {None,tacotron,jaconv,vietnamese}]
                               [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                               [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                               [--rir_scp RIR_SCP]
                               [--rir_apply_prob RIR_APPLY_PROB]
                               [--noise_scp NOISE_SCP]
                               [--noise_apply_prob NOISE_APPLY_PROB]
                               [--noise_db_range NOISE_DB_RANGE]
                               [--frontend {default,sliding_window}]
                               [--frontend_conf FRONTEND_CONF]
                               [--specaug {specaug,None}]
                               [--specaug_conf SPECAUG_CONF]
                               [--normalize {global_mvn,utterance_mvn,None}]
                               [--normalize_conf NORMALIZE_CONF]
                               [--decoder {mega,rnn,rwkv,stateless}]
                               [--decoder_conf DECODER_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related.

  --token_list TOKEN_LIST
                        Integer-string mapper for tokens. (default: None)
  --input_size INPUT_SIZE
                        The number of dimensions for input features. (default: None)
  --init INIT           Type of model initialization to use. (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for the model class. (default: {'transducer_weight': 1.0, 'use_k2_pruned_loss': False, 'k2_pruned_loss_args': {}, 'warmup_steps': 25000, 'validation_nstep': 2, 'fastemit_lambda': 0.0, 'auxiliary_ctc_weight': 0.0, 'auxiliary_ctc_dropout_rate': 0.0, 'auxiliary_lm_loss_weight': 0.0, 'auxiliary_lm_loss_smoothing': 0.05, 'ignore_id': -1, 'sym_space': '<space>', 'sym_blank': '<blank>', 'report_cer': False, 'report_wer': False, 'extract_feats_in_collect_stats': True})
  --encoder_conf ENCODER_CONF
                        The keyword arguments for the encoder class. (default: {})
  --joint_network_conf JOINT_NETWORK_CONF
                        The keyword arguments for the joint network class. (default: {})

  Preprocess related.

  --use_preprocessor USE_PREPROCESSOR
                        Whether to apply preprocessing to input data. (default: True)
  --token_type {bpe,char,word,phn}
                        The type of tokens to use during tokenization. (default: bpe)
  --bpemodel BPEMODEL   The path of the sentencepiece model. (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        The 'non_linguistic_symbols' file path. (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Text cleaner to use. (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        g2p method to use if --token_type=phn. (default: None)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Normalization value for maximum amplitude scaling. (default: None)
  --rir_scp RIR_SCP     The RIR SCP file path. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        The probability of the applied RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The path of noise SCP file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability of the applied noise addition. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of the noise decibel level. (default: 13_15)
  --frontend {default,sliding_window}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --decoder {mega,rnn,rwkv,stateless}
                        The decoder type (default: rnn)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

diar_inference.py¶

/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: diar_inference.py [-h] [--config CONFIG]
                         [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                         --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                         [--dtype {float16,float32,float64}] [--fs FS]
                         [--num_workers NUM_WORKERS]
                         --data_path_and_name_and_type
                         DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                         [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                         [--train_config TRAIN_CONFIG]
                         [--model_file MODEL_FILE] [--model_tag MODEL_TAG]
                         [--batch_size BATCH_SIZE]
                         [--segment_size SEGMENT_SIZE] [--hop_size HOP_SIZE]
                         [--show_progressbar SHOW_PROGRESSBAR]
                         [--num_spk NUM_SPK] [--enh_s2t_task ENH_S2T_TASK]
                         [--normalize_segment_scale NORMALIZE_SEGMENT_SCALE]
                         [--normalize_output_wav NORMALIZE_OUTPUT_WAV]
                         [--multiply_diar_result MULTIPLY_DIAR_RESULT]

Speaker Diarization inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --fs FS               Sampling rate (default: 8000)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --train_config TRAIN_CONFIG
                        Diarization training configuration (default: None)
  --model_file MODEL_FILE
                        Diarization model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        train_config and model_file will be overwritten
                        (default: None)

Data loading related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)

Diarize speech related:
  --segment_size SEGMENT_SIZE
                        Segment length in seconds for segment-wise speaker
                        diarization (default: None)
  --hop_size HOP_SIZE   Hop length in seconds for segment-wise speech
                        enhancement/separation (default: None)
  --show_progressbar SHOW_PROGRESSBAR
                        Whether to show a progress bar when performing
                        segment-wise speaker diarization (default: False)
  --num_spk NUM_SPK     Predetermined number of speakers for inference
                        (default: None)

Enh + Diar related:
  --enh_s2t_task ENH_S2T_TASK
                        enhancement and diarization joint model (default:
                        False)
  --normalize_segment_scale NORMALIZE_SEGMENT_SCALE
                        Whether to normalize the energy of the separated
                        streams in each segment (default: False)
  --normalize_output_wav NORMALIZE_OUTPUT_WAV
                        Whether to normalize the predicted wav to [-1~1]
                        (default: False)
  --multiply_diar_result MULTIPLY_DIAR_RESULT
                        Whether to multiply diar results to separated waves
                        (default: False)

diar_train.py¶

usage: diar_train.py [-h] [--config CONFIG] [--print_config]
                     [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                     [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                     [--iterator_type {sequence,category,chunk,task,none}]
                     [--valid_iterator_type {sequence,category,chunk,task,none}]
                     [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                     [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                     [--dist_backend DIST_BACKEND]
                     [--dist_init_method DIST_INIT_METHOD]
                     [--dist_world_size DIST_WORLD_SIZE]
                     [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                     [--dist_master_addr DIST_MASTER_ADDR]
                     [--dist_master_port DIST_MASTER_PORT]
                     [--dist_launcher {slurm,mpi,None}]
                     [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                     [--unused_parameters UNUSED_PARAMETERS]
                     [--sharded_ddp SHARDED_DDP]
                     [--cudnn_enabled CUDNN_ENABLED]
                     [--cudnn_benchmark CUDNN_BENCHMARK]
                     [--cudnn_deterministic CUDNN_DETERMINISTIC]
                     [--collect_stats COLLECT_STATS]
                     [--write_collected_feats WRITE_COLLECTED_FEATS]
                     [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                     [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                     [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                     [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                     [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                     [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                     [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                     [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                     [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                     [--train_dtype {float16,float32,float64}]
                     [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                     [--use_matplotlib USE_MATPLOTLIB]
                     [--use_tensorboard USE_TENSORBOARD]
                     [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                     [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                     [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                     [--wandb_name WANDB_NAME]
                     [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                     [--detect_anomaly DETECT_ANOMALY]
                     [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                     [--save_strategy {all,adapter_only,required_grad_only}]
                     [--adapter_conf ADAPTER_CONF]
                     [--pretrain_path PRETRAIN_PATH]
                     [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                     [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                     [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                     [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                     [--batch_size BATCH_SIZE]
                     [--valid_batch_size VALID_BATCH_SIZE]
                     [--batch_bins BATCH_BINS]
                     [--valid_batch_bins VALID_BATCH_BINS]
                     [--train_shape_file TRAIN_SHAPE_FILE]
                     [--valid_shape_file VALID_SHAPE_FILE]
                     [--batch_type {unsorted,sorted,folded,length,numel}]
                     [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                     [--fold_length FOLD_LENGTH]
                     [--sort_in_batch {descending,ascending}]
                     [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                     [--sort_batch {descending,ascending}]
                     [--multiple_iterator MULTIPLE_ITERATOR]
                     [--chunk_length CHUNK_LENGTH]
                     [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                     [--num_cache_chunks NUM_CACHE_CHUNKS]
                     [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                     [--chunk_default_fs CHUNK_DEFAULT_FS]
                     [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                     [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                     [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                     [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                     [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                     [--max_cache_size MAX_CACHE_SIZE]
                     [--max_cache_fd MAX_CACHE_FD]
                     [--allow_multi_rates ALLOW_MULTI_RATES]
                     [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                     [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                     [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                     [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                     [--optim_conf OPTIM_CONF]
                     [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                     [--scheduler_conf SCHEDULER_CONF] [--num_spk NUM_SPK]
                     [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                     [--input_size INPUT_SIZE] [--model_conf MODEL_CONF]
                     [--use_preprocessor USE_PREPROCESSOR]
                     [--frontend {default,sliding_window,s3prl,None}]
                     [--frontend_conf FRONTEND_CONF]
                     [--specaug {specaug,None}] [--specaug_conf SPECAUG_CONF]
                     [--normalize {global_mvn,utterance_mvn,None}]
                     [--normalize_conf NORMALIZE_CONF]
                     [--encoder {conformer,transformer,rnn}]
                     [--encoder_conf ENCODER_CONF] [--decoder {linear}]
                     [--decoder_conf DECODER_CONF]
                     [--label_aggregator {label_aggregator}]
                     [--label_aggregator_conf LABEL_AGGREGATOR_CONF]
                     [--attractor {rnn,None}]
                     [--attractor_conf ATTRACTOR_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --num_spk NUM_SPK     The number fo speakers (for each recording) used in system training (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'diar_weight': 1.0, 'attractor_weight': 1.0})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --frontend {default,sliding_window,s3prl,None}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --encoder {conformer,transformer,rnn}
                        The encoder type (default: transformer)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --decoder {linear}    The decoder type (default: linear)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
  --label_aggregator {label_aggregator}
                        The label_aggregator type (default: label_aggregator)
  --label_aggregator_conf LABEL_AGGREGATOR_CONF
                        The keyword arguments for label_aggregator (default: {})
  --attractor {rnn,None}
                        The attractor type (default: None)
  --attractor_conf ATTRACTOR_CONF
                        The keyword arguments for attractor (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

enh_inference.py¶

/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: enh_inference.py [-h] [--config CONFIG]
                        [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                        [--dtype {float16,float32,float64}] [--fs FS]
                        [--num_workers NUM_WORKERS]
                        --data_path_and_name_and_type
                        DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--output_format OUTPUT_FORMAT]
                        [--normalize_output_wav NORMALIZE_OUTPUT_WAV]
                        [--train_config TRAIN_CONFIG]
                        [--model_file MODEL_FILE] [--model_tag MODEL_TAG]
                        [--inference_config INFERENCE_CONFIG]
                        [--enh_s2t_task ENH_S2T_TASK]
                        [--batch_size BATCH_SIZE]
                        [--segment_size SEGMENT_SIZE] [--hop_size HOP_SIZE]
                        [--normalize_segment_scale NORMALIZE_SEGMENT_SCALE]
                        [--show_progressbar SHOW_PROGRESSBAR]
                        [--ref_channel REF_CHANNEL]

Frontend inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --fs FS               Sampling rate (default: 8000)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

Output data related:
  --output_format OUTPUT_FORMAT
                        Output format for the separated speech (default: wav)
  --normalize_output_wav NORMALIZE_OUTPUT_WAV
                        Whether to normalize the predicted wav to [-1~1]
                        (default: False)

The model configuration related:
  --train_config TRAIN_CONFIG
                        Training configuration file (default: None)
  --model_file MODEL_FILE
                        Model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        train_config and model_file will be overwritten
                        (default: None)
  --inference_config INFERENCE_CONFIG
                        Optional configuration file for overwriting enh model
                        attributes during inference (default: None)
  --enh_s2t_task ENH_S2T_TASK
                        enhancement and asr joint model (default: False)

Data loading related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)

SeparateSpeech related:
  --segment_size SEGMENT_SIZE
                        Segment length in seconds for segment-wise speech
                        enhancement/separation (default: None)
  --hop_size HOP_SIZE   Hop length in seconds for segment-wise speech
                        enhancement/separation (default: None)
  --normalize_segment_scale NORMALIZE_SEGMENT_SCALE
                        Whether to normalize the energy of the separated
                        streams in each segment (default: True)
  --show_progressbar SHOW_PROGRESSBAR
                        Whether to show a progress bar when performing
                        segment-wise speech enhancement/separation (default:
                        False)
  --ref_channel REF_CHANNEL
                        If not None, this will overwrite the ref_channel
                        defined in the separator module (for multi-channel
                        speech processing) (default: None)

enh_inference_streaming.py¶

/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: enh_inference_streaming.py [-h] [--config CONFIG]
                                  [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                                  --output_dir OUTPUT_DIR [--ngpu NGPU]
                                  [--seed SEED]
                                  [--dtype {float16,float32,float64}]
                                  [--fs FS] [--num_workers NUM_WORKERS]
                                  --data_path_and_name_and_type
                                  DATA_PATH_AND_NAME_AND_TYPE
                                  [--key_file KEY_FILE]
                                  [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                                  [--output_format OUTPUT_FORMAT]
                                  [--train_config TRAIN_CONFIG]
                                  [--model_file MODEL_FILE]
                                  [--model_tag MODEL_TAG]
                                  [--inference_config INFERENCE_CONFIG]
                                  [--enh_s2t_task ENH_S2T_TASK]
                                  [--batch_size BATCH_SIZE]
                                  [--ref_channel REF_CHANNEL]

Frontend inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --fs FS               Sampling rate (default: 8000)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

Output data related:
  --output_format OUTPUT_FORMAT
                        Output format for the separated speech (default: wav)

The model configuration related:
  --train_config TRAIN_CONFIG
                        Training configuration file (default: None)
  --model_file MODEL_FILE
                        Model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        train_config and model_file will be overwritten
                        (default: None)
  --inference_config INFERENCE_CONFIG
                        Optional configuration file for overwriting enh model
                        attributes during inference (default: None)
  --enh_s2t_task ENH_S2T_TASK
                        enhancement and asr joint model (default: False)

Data loading related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)

SeparateSpeech related:
  --ref_channel REF_CHANNEL
                        If not None, this will overwrite the ref_channel
                        defined in the separator module (for multi-channel
                        speech processing) (default: None)

enh_s2t_train.py¶

usage: enh_s2t_train.py [-h] [--config CONFIG] [--print_config]
                        [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                        [--iterator_type {sequence,category,chunk,task,none}]
                        [--valid_iterator_type {sequence,category,chunk,task,none}]
                        [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                        [--num_workers NUM_WORKERS]
                        [--num_att_plot NUM_ATT_PLOT]
                        [--dist_backend DIST_BACKEND]
                        [--dist_init_method DIST_INIT_METHOD]
                        [--dist_world_size DIST_WORLD_SIZE]
                        [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                        [--dist_master_addr DIST_MASTER_ADDR]
                        [--dist_master_port DIST_MASTER_PORT]
                        [--dist_launcher {slurm,mpi,None}]
                        [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                        [--unused_parameters UNUSED_PARAMETERS]
                        [--sharded_ddp SHARDED_DDP]
                        [--cudnn_enabled CUDNN_ENABLED]
                        [--cudnn_benchmark CUDNN_BENCHMARK]
                        [--cudnn_deterministic CUDNN_DETERMINISTIC]
                        [--collect_stats COLLECT_STATS]
                        [--write_collected_feats WRITE_COLLECTED_FEATS]
                        [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                        [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                        [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                        [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                        [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                        [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                        [--grad_clip GRAD_CLIP]
                        [--grad_clip_type GRAD_CLIP_TYPE]
                        [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                        [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                        [--train_dtype {float16,float32,float64}]
                        [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                        [--use_matplotlib USE_MATPLOTLIB]
                        [--use_tensorboard USE_TENSORBOARD]
                        [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                        [--use_wandb USE_WANDB]
                        [--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
                        [--wandb_entity WANDB_ENTITY]
                        [--wandb_name WANDB_NAME]
                        [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                        [--detect_anomaly DETECT_ANOMALY]
                        [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                        [--save_strategy {all,adapter_only,required_grad_only}]
                        [--adapter_conf ADAPTER_CONF]
                        [--pretrain_path PRETRAIN_PATH]
                        [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                        [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                        [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                        [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                        [--batch_size BATCH_SIZE]
                        [--valid_batch_size VALID_BATCH_SIZE]
                        [--batch_bins BATCH_BINS]
                        [--valid_batch_bins VALID_BATCH_BINS]
                        [--train_shape_file TRAIN_SHAPE_FILE]
                        [--valid_shape_file VALID_SHAPE_FILE]
                        [--batch_type {unsorted,sorted,folded,length,numel}]
                        [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                        [--fold_length FOLD_LENGTH]
                        [--sort_in_batch {descending,ascending}]
                        [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                        [--sort_batch {descending,ascending}]
                        [--multiple_iterator MULTIPLE_ITERATOR]
                        [--chunk_length CHUNK_LENGTH]
                        [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                        [--num_cache_chunks NUM_CACHE_CHUNKS]
                        [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                        [--chunk_default_fs CHUNK_DEFAULT_FS]
                        [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                        [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                        [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                        [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--max_cache_size MAX_CACHE_SIZE]
                        [--max_cache_fd MAX_CACHE_FD]
                        [--allow_multi_rates ALLOW_MULTI_RATES]
                        [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                        [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                        [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                        [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                        [--optim_conf OPTIM_CONF]
                        [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                        [--scheduler_conf SCHEDULER_CONF]
                        [--token_list TOKEN_LIST]
                        [--src_token_list SRC_TOKEN_LIST]
                        [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                        [--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
                        [--enh_criterions ENH_CRITERIONS]
                        [--diar_num_spk DIAR_NUM_SPK]
                        [--diar_input_size DIAR_INPUT_SIZE]
                        [--enh_model_conf ENH_MODEL_CONF]
                        [--asr_model_conf ASR_MODEL_CONF]
                        [--st_model_conf ST_MODEL_CONF]
                        [--diar_model_conf DIAR_MODEL_CONF]
                        [--subtask_series {enh,asr,st,diar} [{enh,asr,st,diar} ...]]
                        [--model_conf MODEL_CONF]
                        [--use_preprocessor USE_PREPROCESSOR]
                        [--token_type {bpe,char,word,phn}]
                        [--bpemodel BPEMODEL]
                        [--src_token_type {bpe,char,word,phn}]
                        [--src_bpemodel SRC_BPEMODEL]
                        [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                        [--cleaner {None,tacotron,jaconv,vietnamese}]
                        [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                        [--text_name TEXT_NAME [TEXT_NAME ...]]
                        [--enh_encoder {stft,conv,same}]
                        [--enh_encoder_conf ENH_ENCODER_CONF]
                        [--enh_separator {asteroid,bsrnn,conformer,dan,dc_crn,dccrn,dpcl,dpcl_e2e,dprnn,dptnet,fasnet,rnn,skim,svoice,tcn,transformer,wpe_beamformer,tcn_nomask,ineube,tfgridnet,tfgridnetv2,tfgridnetv3,uses}]
                        [--enh_separator_conf ENH_SEPARATOR_CONF]
                        [--enh_decoder {stft,conv,same}]
                        [--enh_decoder_conf ENH_DECODER_CONF]
                        [--enh_mask_module {multi_mask}]
                        [--enh_mask_module_conf ENH_MASK_MODULE_CONF]
                        [--frontend {default,sliding_window,s3prl,fused,whisper}]
                        [--frontend_conf FRONTEND_CONF]
                        [--specaug {specaug,None}]
                        [--specaug_conf SPECAUG_CONF]
                        [--normalize {global_mvn,utterance_mvn,None}]
                        [--normalize_conf NORMALIZE_CONF]
                        [--asr_preencoder {sinc,linear,None}]
                        [--asr_preencoder_conf ASR_PREENCODER_CONF]
                        [--asr_encoder {conformer,transformer,transformer_multispkr,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,torchaudiohubert,longformer,branchformer,whisper,e_branchformer,avhubert,multiconv_conformer}]
                        [--asr_encoder_conf ASR_ENCODER_CONF]
                        [--asr_postencoder {hugging_face_transformers,length_adaptor,None}]
                        [--asr_postencoder_conf ASR_POSTENCODER_CONF]
                        [--asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm,whisper,hugging_face_transformers,s4,None}]
                        [--asr_decoder_conf ASR_DECODER_CONF]
                        [--st_preencoder {sinc,linear,None}]
                        [--st_preencoder_conf ST_PREENCODER_CONF]
                        [--st_encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,branchformer,e_branchformer,whisper}]
                        [--st_encoder_conf ST_ENCODER_CONF]
                        [--st_postencoder {hugging_face_transformers,length_adaptor,None}]
                        [--st_postencoder_conf ST_POSTENCODER_CONF]
                        [--st_decoder {transformer,transformer_md,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,whisper,hugging_face_transformers}]
                        [--st_decoder_conf ST_DECODER_CONF]
                        [--st_extra_asr_decoder {transformer,transformer_md,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,None}]
                        [--st_extra_asr_decoder_conf ST_EXTRA_ASR_DECODER_CONF]
                        [--st_extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,None}]
                        [--st_extra_mt_decoder_conf ST_EXTRA_MT_DECODER_CONF]
                        [--diar_frontend {default,sliding_window,s3prl,None}]
                        [--diar_frontend_conf DIAR_FRONTEND_CONF]
                        [--diar_specaug {specaug,None}]
                        [--diar_specaug_conf DIAR_SPECAUG_CONF]
                        [--diar_normalize {global_mvn,utterance_mvn,None}]
                        [--diar_normalize_conf DIAR_NORMALIZE_CONF]
                        [--diar_encoder {conformer,transformer,rnn}]
                        [--diar_encoder_conf DIAR_ENCODER_CONF]
                        [--diar_decoder {linear}]
                        [--diar_decoder_conf DIAR_DECODER_CONF]
                        [--label_aggregator {label_aggregator}]
                        [--label_aggregator_conf LABEL_AGGREGATOR_CONF]
                        [--diar_attractor {rnn,None}]
                        [--diar_attractor_conf DIAR_ATTRACTOR_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --src_token_list SRC_TOKEN_LIST
                        A text mapping int-id to token (for source language) (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --ctc_conf CTC_CONF   The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True, 'brctc_risk_strategy': 'exp', 'brctc_group_strategy': 'end', 'brctc_risk_factor': 0.0})
  --enh_criterions ENH_CRITERIONS
                        The criterions binded with the loss wrappers. (default: [{'name': 'si_snr', 'conf': {}, 'wrapper': 'fixed_order', 'wrapper_conf': {}}])
  --diar_num_spk DIAR_NUM_SPK
                        The number of speakers (for each recording) for diar submodel class (default: None)
  --diar_input_size DIAR_INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --enh_model_conf ENH_MODEL_CONF
                        The keyword arguments for enh submodel class. (default: {'stft_consistency': False, 'loss_type': 'mask_mse', 'mask_type': None, 'flexible_numspk': False, 'extract_feats_in_collect_stats': False, 'normalize_variance': False, 'normalize_variance_per_ch': False, 'categories': [], 'category_weights': [], 'always_forward_in_48k': False})
  --asr_model_conf ASR_MODEL_CONF
                        The keyword arguments for asr submodel class. (default: {'aux_ctc': None, 'ctc_weight': 0.5, 'interctc_weight': 0.0, 'ignore_id': -1, 'lsm_weight': 0.0, 'length_normalized_loss': False, 'report_cer': True, 'report_wer': True, 'sym_space': '<space>', 'sym_blank': '<blank>', 'transducer_multi_blank_durations': [], 'transducer_multi_blank_sigma': 0.05, 'sym_sos': '<sos/eos>', 'sym_eos': '<sos/eos>', 'extract_feats_in_collect_stats': True, 'lang_token_id': -1})
  --st_model_conf ST_MODEL_CONF
                        The keyword arguments for st submodel class. (default: {'stft_consistency': False, 'loss_type': 'mask_mse', 'mask_type': None, 'flexible_numspk': False, 'extract_feats_in_collect_stats': False, 'normalize_variance': False, 'normalize_variance_per_ch': False, 'categories': [], 'category_weights': [], 'always_forward_in_48k': False})
  --diar_model_conf DIAR_MODEL_CONF
                        The keyword arguments for diar submodel class. (default: {'diar_weight': 1.0, 'attractor_weight': 1.0})
  --subtask_series {enh,asr,st,diar} [{enh,asr,st,diar} ...]
                        The series of subtasks in the pipeline. (default: ('enh', 'asr'))
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'calc_enh_loss': True, 'bypass_enh_prob': 0})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: False)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --src_token_type {bpe,char,word,phn}
                        The source text will be tokenized in the specified level token (default: bpe)
  --src_bpemodel SRC_BPEMODEL
                        The model file of sentencepiece (for source language) (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --text_name TEXT_NAME [TEXT_NAME ...]
                        Specify the text_name attribute used in the preprocessor (default: ['text'])
  --enh_encoder {stft,conv,same}
                        The enh_encoder type (default: stft)
  --enh_encoder_conf ENH_ENCODER_CONF
                        The keyword arguments for enh_encoder (default: {})
  --enh_separator {asteroid,bsrnn,conformer,dan,dc_crn,dccrn,dpcl,dpcl_e2e,dprnn,dptnet,fasnet,rnn,skim,svoice,tcn,transformer,wpe_beamformer,tcn_nomask,ineube,tfgridnet,tfgridnetv2,tfgridnetv3,uses}
                        The enh_separator type (default: rnn)
  --enh_separator_conf ENH_SEPARATOR_CONF
                        The keyword arguments for enh_separator (default: {})
  --enh_decoder {stft,conv,same}
                        The enh_decoder type (default: stft)
  --enh_decoder_conf ENH_DECODER_CONF
                        The keyword arguments for enh_decoder (default: {})
  --enh_mask_module {multi_mask}
                        The enh_mask_module type (default: multi_mask)
  --enh_mask_module_conf ENH_MASK_MODULE_CONF
                        The keyword arguments for enh_mask_module (default: {})
  --frontend {default,sliding_window,s3prl,fused,whisper}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --asr_preencoder {sinc,linear,None}
                        The asr_preencoder type (default: None)
  --asr_preencoder_conf ASR_PREENCODER_CONF
                        The keyword arguments for asr_preencoder (default: {})
  --asr_encoder {conformer,transformer,transformer_multispkr,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,torchaudiohubert,longformer,branchformer,whisper,e_branchformer,avhubert,multiconv_conformer}
                        The asr_encoder type (default: rnn)
  --asr_encoder_conf ASR_ENCODER_CONF
                        The keyword arguments for asr_encoder (default: {})
  --asr_postencoder {hugging_face_transformers,length_adaptor,None}
                        The asr_postencoder type (default: None)
  --asr_postencoder_conf ASR_POSTENCODER_CONF
                        The keyword arguments for asr_postencoder (default: {})
  --asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm,whisper,hugging_face_transformers,s4,None}
                        The asr_decoder type (default: None)
  --asr_decoder_conf ASR_DECODER_CONF
                        The keyword arguments for asr_decoder (default: {})
  --st_preencoder {sinc,linear,None}
                        The st_preencoder type (default: None)
  --st_preencoder_conf ST_PREENCODER_CONF
                        The keyword arguments for st_preencoder (default: {})
  --st_encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,branchformer,e_branchformer,whisper}
                        The st_encoder type (default: rnn)
  --st_encoder_conf ST_ENCODER_CONF
                        The keyword arguments for st_encoder (default: {})
  --st_postencoder {hugging_face_transformers,length_adaptor,None}
                        The st_postencoder type (default: None)
  --st_postencoder_conf ST_POSTENCODER_CONF
                        The keyword arguments for st_postencoder (default: {})
  --st_decoder {transformer,transformer_md,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,whisper,hugging_face_transformers}
                        The st_decoder type (default: rnn)
  --st_decoder_conf ST_DECODER_CONF
                        The keyword arguments for st_decoder (default: {})
  --st_extra_asr_decoder {transformer,transformer_md,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,None}
                        The st_extra_asr_decoder type (default: None)
  --st_extra_asr_decoder_conf ST_EXTRA_ASR_DECODER_CONF
                        The keyword arguments for st_extra_asr_decoder (default: {})
  --st_extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,None}
                        The st_extra_mt_decoder type (default: None)
  --st_extra_mt_decoder_conf ST_EXTRA_MT_DECODER_CONF
                        The keyword arguments for st_extra_mt_decoder (default: {})
  --diar_frontend {default,sliding_window,s3prl,None}
                        The diar_frontend type (default: default)
  --diar_frontend_conf DIAR_FRONTEND_CONF
                        The keyword arguments for diar_frontend (default: {})
  --diar_specaug {specaug,None}
                        The diar_specaug type (default: None)
  --diar_specaug_conf DIAR_SPECAUG_CONF
                        The keyword arguments for diar_specaug (default: {})
  --diar_normalize {global_mvn,utterance_mvn,None}
                        The diar_normalize type (default: utterance_mvn)
  --diar_normalize_conf DIAR_NORMALIZE_CONF
                        The keyword arguments for diar_normalize (default: {})
  --diar_encoder {conformer,transformer,rnn}
                        The diar_encoder type (default: transformer)
  --diar_encoder_conf DIAR_ENCODER_CONF
                        The keyword arguments for diar_encoder (default: {})
  --diar_decoder {linear}
                        The diar_decoder type (default: linear)
  --diar_decoder_conf DIAR_DECODER_CONF
                        The keyword arguments for diar_decoder (default: {})
  --label_aggregator {label_aggregator}
                        The label_aggregator type (default: label_aggregator)
  --label_aggregator_conf LABEL_AGGREGATOR_CONF
                        The keyword arguments for label_aggregator (default: {})
  --diar_attractor {rnn,None}
                        The diar_attractor type (default: None)
  --diar_attractor_conf DIAR_ATTRACTOR_CONF
                        The keyword arguments for diar_attractor (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

enh_scoring.py¶

usage: enh_scoring.py [-h] [--config CONFIG]
                      [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                      --output_dir OUTPUT_DIR
                      [--dtype {float16,float32,float64}] --ref_scp REF_SCP
                      --inf_scp INF_SCP [--key_file KEY_FILE]
                      [--ref_channel REF_CHANNEL]
                      [--flexible_numspk FLEXIBLE_NUMSPK] [--is_tse IS_TSE]
                      [--use_dnsmos USE_DNSMOS] [--dnsmos_mode {local,web}]
                      [--dnsmos_auth_key DNSMOS_AUTH_KEY]
                      [--dnsmos_use_gpu DNSMOS_USE_GPU]
                      [--dnsmos_convert_to_torch DNSMOS_CONVERT_TO_TORCH]
                      [--dnsmos_primary_model DNSMOS_PRIMARY_MODEL]
                      [--dnsmos_p808_model DNSMOS_P808_MODEL]
                      [--use_pesq USE_PESQ]

Frontend inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --dtype {float16,float32,float64}
                        Data type (default: float32)

Input data related:
  --ref_scp REF_SCP
  --inf_scp INF_SCP
  --key_file KEY_FILE
  --ref_channel REF_CHANNEL
  --flexible_numspk FLEXIBLE_NUMSPK
  --is_tse IS_TSE

DNSMOS related:
  --use_dnsmos USE_DNSMOS
  --dnsmos_mode {local,web}
                        Use local DNSMOS model or web API for DNSMOS
                        calculation (default: local)
  --dnsmos_auth_key DNSMOS_AUTH_KEY
                        Required if dnsmsos_mode='web' (default: )
  --dnsmos_use_gpu DNSMOS_USE_GPU
                        used when dnsmsos_mode='local' (default: False)
  --dnsmos_convert_to_torch DNSMOS_CONVERT_TO_TORCH
                        used when dnsmsos_mode='local' (default: False)
  --dnsmos_primary_model DNSMOS_PRIMARY_MODEL
                        Path to the primary DNSMOS model. Required if
                        dnsmsos_mode='local' (default:
                        ./DNSMOS/sig_bak_ovr.onnx)
  --dnsmos_p808_model DNSMOS_P808_MODEL
                        Path to the p808 model. Required if
                        dnsmsos_mode='local' (default: ./DNSMOS/model_v8.onnx)

PESQ related:
  --use_pesq USE_PESQ   Bebore setting this to True, please make sure that you
                        or your institution have the license (check
                        https://www.itu.int/rec/T-REC-P.862-200511-I!Amd2/en)
                        to report PESQ (default: False)

enh_train.py¶

usage: enh_train.py [-h] [--config CONFIG] [--print_config]
                    [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                    [--iterator_type {sequence,category,chunk,task,none}]
                    [--valid_iterator_type {sequence,category,chunk,task,none}]
                    [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                    [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                    [--dist_backend DIST_BACKEND]
                    [--dist_init_method DIST_INIT_METHOD]
                    [--dist_world_size DIST_WORLD_SIZE]
                    [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                    [--dist_master_addr DIST_MASTER_ADDR]
                    [--dist_master_port DIST_MASTER_PORT]
                    [--dist_launcher {slurm,mpi,None}]
                    [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                    [--unused_parameters UNUSED_PARAMETERS]
                    [--sharded_ddp SHARDED_DDP]
                    [--cudnn_enabled CUDNN_ENABLED]
                    [--cudnn_benchmark CUDNN_BENCHMARK]
                    [--cudnn_deterministic CUDNN_DETERMINISTIC]
                    [--collect_stats COLLECT_STATS]
                    [--write_collected_feats WRITE_COLLECTED_FEATS]
                    [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                    [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                    [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                    [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                    [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                    [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                    [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                    [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                    [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                    [--train_dtype {float16,float32,float64}]
                    [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                    [--use_matplotlib USE_MATPLOTLIB]
                    [--use_tensorboard USE_TENSORBOARD]
                    [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                    [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                    [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                    [--wandb_name WANDB_NAME]
                    [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                    [--detect_anomaly DETECT_ANOMALY]
                    [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                    [--save_strategy {all,adapter_only,required_grad_only}]
                    [--adapter_conf ADAPTER_CONF]
                    [--pretrain_path PRETRAIN_PATH]
                    [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                    [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                    [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                    [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                    [--batch_size BATCH_SIZE]
                    [--valid_batch_size VALID_BATCH_SIZE]
                    [--batch_bins BATCH_BINS]
                    [--valid_batch_bins VALID_BATCH_BINS]
                    [--train_shape_file TRAIN_SHAPE_FILE]
                    [--valid_shape_file VALID_SHAPE_FILE]
                    [--batch_type {unsorted,sorted,folded,length,numel}]
                    [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                    [--fold_length FOLD_LENGTH]
                    [--sort_in_batch {descending,ascending}]
                    [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                    [--sort_batch {descending,ascending}]
                    [--multiple_iterator MULTIPLE_ITERATOR]
                    [--chunk_length CHUNK_LENGTH]
                    [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                    [--num_cache_chunks NUM_CACHE_CHUNKS]
                    [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                    [--chunk_default_fs CHUNK_DEFAULT_FS]
                    [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                    [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                    [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                    [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                    [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                    [--max_cache_size MAX_CACHE_SIZE]
                    [--max_cache_fd MAX_CACHE_FD]
                    [--allow_multi_rates ALLOW_MULTI_RATES]
                    [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                    [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                    [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                    [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                    [--optim_conf OPTIM_CONF]
                    [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                    [--scheduler_conf SCHEDULER_CONF]
                    [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                    [--model_conf MODEL_CONF] [--criterions CRITERIONS]
                    [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                    [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                    [--noise_scp NOISE_SCP]
                    [--noise_apply_prob NOISE_APPLY_PROB]
                    [--noise_db_range NOISE_DB_RANGE]
                    [--short_noise_thres SHORT_NOISE_THRES]
                    [--use_reverberant_ref USE_REVERBERANT_REF]
                    [--num_spk NUM_SPK] [--num_noise_type NUM_NOISE_TYPE]
                    [--sample_rate SAMPLE_RATE]
                    [--force_single_channel FORCE_SINGLE_CHANNEL]
                    [--channel_reordering CHANNEL_REORDERING]
                    [--categories CATEGORIES [CATEGORIES ...]]
                    [--speech_segment SPEECH_SEGMENT]
                    [--avoid_allzero_segment AVOID_ALLZERO_SEGMENT]
                    [--flexible_numspk FLEXIBLE_NUMSPK]
                    [--dynamic_mixing DYNAMIC_MIXING] [--utt2spk UTT2SPK]
                    [--dynamic_mixing_gain_db DYNAMIC_MIXING_GAIN_DB]
                    [--encoder {stft,conv,same}] [--encoder_conf ENCODER_CONF]
                    [--separator {asteroid,bsrnn,conformer,dan,dc_crn,dccrn,dpcl,dpcl_e2e,dprnn,dptnet,fasnet,rnn,skim,svoice,tcn,transformer,wpe_beamformer,tcn_nomask,ineube,tfgridnet,tfgridnetv2,tfgridnetv3,uses}]
                    [--separator_conf SEPARATOR_CONF]
                    [--decoder {stft,conv,same}] [--decoder_conf DECODER_CONF]
                    [--mask_module {multi_mask}]
                    [--mask_module_conf MASK_MODULE_CONF]
                    [--preprocessor {dynamic_mixing,enh,None}]
                    [--preprocessor_conf PREPROCESSOR_CONF]
                    [--diffusion_model {sgmse,None}]
                    [--diffusion_model_conf DIFFUSION_MODEL_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'stft_consistency': False, 'loss_type': 'mask_mse', 'mask_type': None, 'flexible_numspk': False, 'extract_feats_in_collect_stats': False, 'normalize_variance': False, 'normalize_variance_per_ch': False, 'categories': [], 'category_weights': [], 'always_forward_in_48k': False})
  --criterions CRITERIONS
                        The criterions binded with the loss wrappers. (default: [{'name': 'si_snr', 'conf': {}, 'wrapper': 'fixed_order', 'wrapper_conf': {}}])

  Preprocess related

  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value or range. e.g. --speech_volume_normalize 1.0 scales it to 1.0.
                        --speech_volume_normalize 0.5_1.0 scales it to a random number in the range [0.5, 1.0) (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of signal-to-noise ratio (SNR) level in decibel. (default: 13_15)
  --short_noise_thres SHORT_NOISE_THRES
                        If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
  --use_reverberant_ref USE_REVERBERANT_REF
                        Whether to use reverberant speech references instead of anechoic ones (default: False)
  --num_spk NUM_SPK     Number of speakers in the input signal. (default: 1)
  --num_noise_type NUM_NOISE_TYPE
                        Number of noise types. (default: 1)
  --sample_rate SAMPLE_RATE
                        Sampling rate of the data (in Hz). (default: 8000)
  --force_single_channel FORCE_SINGLE_CHANNEL
                        Whether to force all data to be single-channel. (default: False)
  --channel_reordering CHANNEL_REORDERING
                        Whether to randomly reorder the channels of the multi-channel signals. (default: False)
  --categories CATEGORIES [CATEGORIES ...]
                        The set of all possible categories in the dataset. Used to add the category information to each sample (default: [])
  --speech_segment SPEECH_SEGMENT
                        Truncate the audios to the specified length (in samples) if not None (default: None)
  --avoid_allzero_segment AVOID_ALLZERO_SEGMENT
                        Only used when --speech_segment is specified. If True, make sure all truncated segments are not all-zero (default: True)
  --flexible_numspk FLEXIBLE_NUMSPK
                        Whether to load variable numbers of speakers in each sample. In this case, only the first-speaker files such as 'spk1.scp' and 'dereverb1.scp' are used, which are expected to have multiple columns. Other numbered files such as 'spk2.scp' and 'dereverb2.scp' are ignored. (default: False)
  --dynamic_mixing DYNAMIC_MIXING
                        Apply dynamic mixing (default: False)
  --utt2spk UTT2SPK     The file path of utt2spk file. Only used in dynamic_mixing mode. (default: None)
  --dynamic_mixing_gain_db DYNAMIC_MIXING_GAIN_DB
                        Random gain (in dB) for dynamic mixing sources (default: 0.0)
  --encoder {stft,conv,same}
                        The encoder type (default: stft)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --separator {asteroid,bsrnn,conformer,dan,dc_crn,dccrn,dpcl,dpcl_e2e,dprnn,dptnet,fasnet,rnn,skim,svoice,tcn,transformer,wpe_beamformer,tcn_nomask,ineube,tfgridnet,tfgridnetv2,tfgridnetv3,uses}
                        The separator type (default: rnn)
  --separator_conf SEPARATOR_CONF
                        The keyword arguments for separator (default: {})
  --decoder {stft,conv,same}
                        The decoder type (default: stft)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
  --mask_module {multi_mask}
                        The mask_module type (default: multi_mask)
  --mask_module_conf MASK_MODULE_CONF
                        The keyword arguments for mask_module (default: {})
  --preprocessor {dynamic_mixing,enh,None}
                        The preprocessor type (default: None)
  --preprocessor_conf PREPROCESSOR_CONF
                        The keyword arguments for preprocessor (default: {})
  --diffusion_model {sgmse,None}
                        The diffusion_model type (default: None)
  --diffusion_model_conf DIFFUSION_MODEL_CONF
                        The keyword arguments for diffusion_model (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

enh_tse_inference.py¶

/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: enh_tse_inference.py [-h] [--config CONFIG]
                            [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                            --output_dir OUTPUT_DIR [--ngpu NGPU]
                            [--seed SEED] [--dtype {float16,float32,float64}]
                            [--fs FS] [--num_workers NUM_WORKERS]
                            --data_path_and_name_and_type
                            DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                            [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                            [--normalize_output_wav NORMALIZE_OUTPUT_WAV]
                            [--output_format OUTPUT_FORMAT]
                            [--train_config TRAIN_CONFIG]
                            [--model_file MODEL_FILE] [--model_tag MODEL_TAG]
                            [--inference_config INFERENCE_CONFIG]
                            [--batch_size BATCH_SIZE]
                            [--segment_size SEGMENT_SIZE]
                            [--hop_size HOP_SIZE]
                            [--normalize_segment_scale NORMALIZE_SEGMENT_SCALE]
                            [--show_progressbar SHOW_PROGRESSBAR]
                            [--ref_channel REF_CHANNEL]

Frontend inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --fs FS               Sampling rate (default: 8000)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

Output data related:
  --normalize_output_wav NORMALIZE_OUTPUT_WAV
                        Whether to normalize the predicted wav to [-1~1]
                        (default: False)
  --output_format OUTPUT_FORMAT
                        Output format for the separated speech (default: wav)

The model configuration related:
  --train_config TRAIN_CONFIG
                        Training configuration file (default: None)
  --model_file MODEL_FILE
                        Model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        train_config and model_file will be overwritten
                        (default: None)
  --inference_config INFERENCE_CONFIG
                        Optional configuration file for overwriting enh model
                        attributes during inference (default: None)

Data loading related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)

SeparateSpeech related:
  --segment_size SEGMENT_SIZE
                        Segment length in seconds for segment-wise speech
                        enhancement/separation (default: None)
  --hop_size HOP_SIZE   Hop length in seconds for segment-wise speech
                        enhancement/separation (default: None)
  --normalize_segment_scale NORMALIZE_SEGMENT_SCALE
                        Whether to normalize the energy of the separated
                        streams in each segment (default: False)
  --show_progressbar SHOW_PROGRESSBAR
                        Whether to show a progress bar when performing
                        segment-wise speech enhancement/separation (default:
                        False)
  --ref_channel REF_CHANNEL
                        If not None, this will overwrite the ref_channel
                        defined in the extractor module (for multi-channel
                        speech processing) (default: None)

enh_tse_train.py¶

usage: enh_tse_train.py [-h] [--config CONFIG] [--print_config]
                        [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                        [--iterator_type {sequence,category,chunk,task,none}]
                        [--valid_iterator_type {sequence,category,chunk,task,none}]
                        [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                        [--num_workers NUM_WORKERS]
                        [--num_att_plot NUM_ATT_PLOT]
                        [--dist_backend DIST_BACKEND]
                        [--dist_init_method DIST_INIT_METHOD]
                        [--dist_world_size DIST_WORLD_SIZE]
                        [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                        [--dist_master_addr DIST_MASTER_ADDR]
                        [--dist_master_port DIST_MASTER_PORT]
                        [--dist_launcher {slurm,mpi,None}]
                        [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                        [--unused_parameters UNUSED_PARAMETERS]
                        [--sharded_ddp SHARDED_DDP]
                        [--cudnn_enabled CUDNN_ENABLED]
                        [--cudnn_benchmark CUDNN_BENCHMARK]
                        [--cudnn_deterministic CUDNN_DETERMINISTIC]
                        [--collect_stats COLLECT_STATS]
                        [--write_collected_feats WRITE_COLLECTED_FEATS]
                        [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                        [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                        [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                        [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                        [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                        [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                        [--grad_clip GRAD_CLIP]
                        [--grad_clip_type GRAD_CLIP_TYPE]
                        [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                        [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                        [--train_dtype {float16,float32,float64}]
                        [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                        [--use_matplotlib USE_MATPLOTLIB]
                        [--use_tensorboard USE_TENSORBOARD]
                        [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                        [--use_wandb USE_WANDB]
                        [--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
                        [--wandb_entity WANDB_ENTITY]
                        [--wandb_name WANDB_NAME]
                        [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                        [--detect_anomaly DETECT_ANOMALY]
                        [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                        [--save_strategy {all,adapter_only,required_grad_only}]
                        [--adapter_conf ADAPTER_CONF]
                        [--pretrain_path PRETRAIN_PATH]
                        [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                        [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                        [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                        [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                        [--batch_size BATCH_SIZE]
                        [--valid_batch_size VALID_BATCH_SIZE]
                        [--batch_bins BATCH_BINS]
                        [--valid_batch_bins VALID_BATCH_BINS]
                        [--train_shape_file TRAIN_SHAPE_FILE]
                        [--valid_shape_file VALID_SHAPE_FILE]
                        [--batch_type {unsorted,sorted,folded,length,numel}]
                        [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                        [--fold_length FOLD_LENGTH]
                        [--sort_in_batch {descending,ascending}]
                        [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                        [--sort_batch {descending,ascending}]
                        [--multiple_iterator MULTIPLE_ITERATOR]
                        [--chunk_length CHUNK_LENGTH]
                        [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                        [--num_cache_chunks NUM_CACHE_CHUNKS]
                        [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                        [--chunk_default_fs CHUNK_DEFAULT_FS]
                        [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                        [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                        [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                        [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--max_cache_size MAX_CACHE_SIZE]
                        [--max_cache_fd MAX_CACHE_FD]
                        [--allow_multi_rates ALLOW_MULTI_RATES]
                        [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                        [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                        [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                        [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                        [--optim_conf OPTIM_CONF]
                        [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                        [--scheduler_conf SCHEDULER_CONF]
                        [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                        [--model_conf MODEL_CONF] [--criterions CRITERIONS]
                        [--train_spk2enroll TRAIN_SPK2ENROLL]
                        [--enroll_segment ENROLL_SEGMENT]
                        [--load_spk_embedding LOAD_SPK_EMBEDDING]
                        [--load_all_speakers LOAD_ALL_SPEAKERS]
                        [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                        [--noise_scp NOISE_SCP]
                        [--noise_apply_prob NOISE_APPLY_PROB]
                        [--noise_db_range NOISE_DB_RANGE]
                        [--short_noise_thres SHORT_NOISE_THRES]
                        [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                        [--use_reverberant_ref USE_REVERBERANT_REF]
                        [--num_spk NUM_SPK] [--num_noise_type NUM_NOISE_TYPE]
                        [--sample_rate SAMPLE_RATE]
                        [--force_single_channel FORCE_SINGLE_CHANNEL]
                        [--channel_reordering CHANNEL_REORDERING]
                        [--categories CATEGORIES [CATEGORIES ...]]
                        [--speech_segment SPEECH_SEGMENT]
                        [--avoid_allzero_segment AVOID_ALLZERO_SEGMENT]
                        [--flexible_numspk FLEXIBLE_NUMSPK]
                        [--encoder {stft,conv,same}]
                        [--encoder_conf ENCODER_CONF]
                        [--extractor {td_speakerbeam}]
                        [--extractor_conf EXTRACTOR_CONF]
                        [--decoder {stft,conv,same}]
                        [--decoder_conf DECODER_CONF] [--preprocessor {tse}]
                        [--preprocessor_conf PREPROCESSOR_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'num_spk': 1, 'flexible_numspk': False, 'share_encoder': True, 'extract_feats_in_collect_stats': False})
  --criterions CRITERIONS
                        The criterions binded with the loss wrappers. (default: [{'name': 'si_snr', 'conf': {}, 'wrapper': 'fixed_order', 'wrapper_conf': {}}])

  Preprocess related

  --train_spk2enroll TRAIN_SPK2ENROLL
                        The scp file containing the mapping from speakerID to enrollment
                        (This is used to sample the target-speaker enrollment signal) (default: None)
  --enroll_segment ENROLL_SEGMENT
                        Truncate the enrollment audio to the specified length if not None (default: None)
  --load_spk_embedding LOAD_SPK_EMBEDDING
                        Whether to load speaker embeddings instead of enrollments (default: False)
  --load_all_speakers LOAD_ALL_SPEAKERS
                        Whether to load target-speaker for all speakers in each sample (default: False)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of signal-to-noise ratio (SNR) level in decibel. (default: 13_15)
  --short_noise_thres SHORT_NOISE_THRES
                        If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value or range. e.g. --speech_volume_normalize 1.0 scales it to 1.0.
                        --speech_volume_normalize 0.5_1.0 scales it to a random number in the range [0.5, 1.0) (default: None)
  --use_reverberant_ref USE_REVERBERANT_REF
                        Whether to use reverberant speech references instead of anechoic ones (default: False)
  --num_spk NUM_SPK     Number of speakers in the input signal. (default: 1)
  --num_noise_type NUM_NOISE_TYPE
                        Number of noise types. (default: 1)
  --sample_rate SAMPLE_RATE
                        Sampling rate of the data (in Hz). (default: 8000)
  --force_single_channel FORCE_SINGLE_CHANNEL
                        Whether to force all data to be single-channel. (default: False)
  --channel_reordering CHANNEL_REORDERING
                        Whether to randomly reorder the channels of the multi-channel signals. (default: False)
  --categories CATEGORIES [CATEGORIES ...]
                        The set of all possible categories in the dataset. Used to add the category information to each sample (default: [])
  --speech_segment SPEECH_SEGMENT
                        Truncate the audios (except for the enrollment) to the specified length if not None (default: None)
  --avoid_allzero_segment AVOID_ALLZERO_SEGMENT
                        Only used when --speech_segment is specified. If True, make sure all truncated segments are not all-zero (default: True)
  --flexible_numspk FLEXIBLE_NUMSPK
                        Whether to load variable numbers of speakers in each sample. In this case, only the first-speaker files such as 'spk1.scp' and 'dereverb1.scp' are used, which are expected to have multiple columns. Other numbered files such as 'spk2.scp' and 'dereverb2.scp' are ignored. (default: False)
  --encoder {stft,conv,same}
                        The encoder type (default: stft)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --extractor {td_speakerbeam}
                        The extractor type (default: td_speakerbeam)
  --extractor_conf EXTRACTOR_CONF
                        The keyword arguments for extractor (default: {})
  --decoder {stft,conv,same}
                        The decoder type (default: stft)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
  --preprocessor {tse}  The preprocessor type (default: tse)
  --preprocessor_conf PREPROCESSOR_CONF
                        The keyword arguments for preprocessor (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

gan_svs_train.py¶

usage: gan_svs_train.py [-h] [--config CONFIG] [--print_config]
                        [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                        [--iterator_type {sequence,category,chunk,task,none}]
                        [--valid_iterator_type {sequence,category,chunk,task,none}]
                        [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                        [--num_workers NUM_WORKERS]
                        [--num_att_plot NUM_ATT_PLOT]
                        [--dist_backend DIST_BACKEND]
                        [--dist_init_method DIST_INIT_METHOD]
                        [--dist_world_size DIST_WORLD_SIZE]
                        [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                        [--dist_master_addr DIST_MASTER_ADDR]
                        [--dist_master_port DIST_MASTER_PORT]
                        [--dist_launcher {slurm,mpi,None}]
                        [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                        [--unused_parameters UNUSED_PARAMETERS]
                        [--sharded_ddp SHARDED_DDP]
                        [--cudnn_enabled CUDNN_ENABLED]
                        [--cudnn_benchmark CUDNN_BENCHMARK]
                        [--cudnn_deterministic CUDNN_DETERMINISTIC]
                        [--collect_stats COLLECT_STATS]
                        [--write_collected_feats WRITE_COLLECTED_FEATS]
                        [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                        [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                        [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                        [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                        [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                        [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                        [--grad_clip GRAD_CLIP]
                        [--grad_clip_type GRAD_CLIP_TYPE]
                        [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                        [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                        [--train_dtype {float16,float32,float64}]
                        [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                        [--use_matplotlib USE_MATPLOTLIB]
                        [--use_tensorboard USE_TENSORBOARD]
                        [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                        [--use_wandb USE_WANDB]
                        [--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
                        [--wandb_entity WANDB_ENTITY]
                        [--wandb_name WANDB_NAME]
                        [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                        [--detect_anomaly DETECT_ANOMALY]
                        [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                        [--save_strategy {all,adapter_only,required_grad_only}]
                        [--adapter_conf ADAPTER_CONF]
                        [--pretrain_path PRETRAIN_PATH]
                        [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                        [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                        [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                        [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                        [--batch_size BATCH_SIZE]
                        [--valid_batch_size VALID_BATCH_SIZE]
                        [--batch_bins BATCH_BINS]
                        [--valid_batch_bins VALID_BATCH_BINS]
                        [--train_shape_file TRAIN_SHAPE_FILE]
                        [--valid_shape_file VALID_SHAPE_FILE]
                        [--batch_type {unsorted,sorted,folded,length,numel}]
                        [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                        [--fold_length FOLD_LENGTH]
                        [--sort_in_batch {descending,ascending}]
                        [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                        [--sort_batch {descending,ascending}]
                        [--multiple_iterator MULTIPLE_ITERATOR]
                        [--chunk_length CHUNK_LENGTH]
                        [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                        [--num_cache_chunks NUM_CACHE_CHUNKS]
                        [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                        [--chunk_default_fs CHUNK_DEFAULT_FS]
                        [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                        [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                        [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                        [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--max_cache_size MAX_CACHE_SIZE]
                        [--max_cache_fd MAX_CACHE_FD]
                        [--allow_multi_rates ALLOW_MULTI_RATES]
                        [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                        [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                        [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                        [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                        [--optim_conf OPTIM_CONF]
                        [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                        [--scheduler_conf SCHEDULER_CONF]
                        [--optim2 {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                        [--optim2_conf OPTIM2_CONF]
                        [--scheduler2 {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                        [--scheduler2_conf SCHEDULER2_CONF]
                        [--generator_first GENERATOR_FIRST]
                        [--input_size INPUT_SIZE] [--token_list TOKEN_LIST]
                        [--odim ODIM] [--model_conf MODEL_CONF]
                        [--use_preprocessor USE_PREPROCESSOR]
                        [--token_type {bpe,char,word,phn}]
                        [--bpemodel BPEMODEL]
                        [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                        [--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
                        [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                        [--fs FS] [--postfrontend {s3prl,fused,None}]
                        [--postfrontend_conf POSTFRONTEND_CONF]
                        [--score_feats_extract {frame_score_feats,syllable_score_feats}]
                        [--score_feats_extract_conf SCORE_FEATS_EXTRACT_CONF]
                        [--feats_extract {fbank,log_spectrogram,linear_spectrogram}]
                        [--feats_extract_conf FEATS_EXTRACT_CONF]
                        [--normalize {global_mvn,utterance_mvn,None}]
                        [--normalize_conf NORMALIZE_CONF]
                        [--svs {vits,joint_score2wav}] [--svs_conf SVS_CONF]
                        [--pitch_extract {dio,None}]
                        [--pitch_extract_conf PITCH_EXTRACT_CONF]
                        [--pitch_normalize {global_mvn,utterance_mvn,None}]
                        [--pitch_normalize_conf PITCH_NORMALIZE_CONF]
                        [--ying_extract {ying,None}]
                        [--ying_extract_conf YING_EXTRACT_CONF]
                        [--energy_extract {energy,None}]
                        [--energy_extract_conf ENERGY_EXTRACT_CONF]
                        [--energy_normalize {global_mvn,utterance_mvn,None}]
                        [--energy_normalize_conf ENERGY_NORMALIZE_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --generator_first GENERATOR_FIRST
                        Whether to update generator first. (default: False)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --fs FS               sample rate (default: 24000)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})
  --optim2 {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim2_conf OPTIM2_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler2 {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler2_conf SCHEDULER2_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --odim ODIM           The number of dimension of output feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: phn)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --postfrontend {s3prl,fused,None}
                        The postfrontend type (default: None)
  --postfrontend_conf POSTFRONTEND_CONF
                        The keyword arguments for postfrontend (default: {})
  --score_feats_extract {frame_score_feats,syllable_score_feats}
                        The score_feats_extract type (default: frame_score_feats)
  --score_feats_extract_conf SCORE_FEATS_EXTRACT_CONF
                        The keyword arguments for score_feats_extract (default: {})
  --feats_extract {fbank,log_spectrogram,linear_spectrogram}
                        The feats_extract type (default: linear_spectrogram)
  --feats_extract_conf FEATS_EXTRACT_CONF
                        The keyword arguments for feats_extract (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: None)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --svs {vits,joint_score2wav}
                        The svs type (default: vits)
  --svs_conf SVS_CONF   The keyword arguments for svs (default: {})
  --pitch_extract {dio,None}
                        The pitch_extract type (default: None)
  --pitch_extract_conf PITCH_EXTRACT_CONF
                        The keyword arguments for pitch_extract (default: {})
  --pitch_normalize {global_mvn,utterance_mvn,None}
                        The pitch_normalize type (default: None)
  --pitch_normalize_conf PITCH_NORMALIZE_CONF
                        The keyword arguments for pitch_normalize (default: {})
  --ying_extract {ying,None}
                        The ying_extract type (default: None)
  --ying_extract_conf YING_EXTRACT_CONF
                        The keyword arguments for ying_extract (default: {})
  --energy_extract {energy,None}
                        The energy_extract type (default: None)
  --energy_extract_conf ENERGY_EXTRACT_CONF
                        The keyword arguments for energy_extract (default: {})
  --energy_normalize {global_mvn,utterance_mvn,None}
                        The energy_normalize type (default: None)
  --energy_normalize_conf ENERGY_NORMALIZE_CONF
                        The keyword arguments for energy_normalize (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

gan_tts_train.py¶

usage: gan_tts_train.py [-h] [--config CONFIG] [--print_config]
                        [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                        [--iterator_type {sequence,category,chunk,task,none}]
                        [--valid_iterator_type {sequence,category,chunk,task,none}]
                        [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                        [--num_workers NUM_WORKERS]
                        [--num_att_plot NUM_ATT_PLOT]
                        [--dist_backend DIST_BACKEND]
                        [--dist_init_method DIST_INIT_METHOD]
                        [--dist_world_size DIST_WORLD_SIZE]
                        [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                        [--dist_master_addr DIST_MASTER_ADDR]
                        [--dist_master_port DIST_MASTER_PORT]
                        [--dist_launcher {slurm,mpi,None}]
                        [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                        [--unused_parameters UNUSED_PARAMETERS]
                        [--sharded_ddp SHARDED_DDP]
                        [--cudnn_enabled CUDNN_ENABLED]
                        [--cudnn_benchmark CUDNN_BENCHMARK]
                        [--cudnn_deterministic CUDNN_DETERMINISTIC]
                        [--collect_stats COLLECT_STATS]
                        [--write_collected_feats WRITE_COLLECTED_FEATS]
                        [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                        [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                        [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                        [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                        [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                        [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                        [--grad_clip GRAD_CLIP]
                        [--grad_clip_type GRAD_CLIP_TYPE]
                        [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                        [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                        [--train_dtype {float16,float32,float64}]
                        [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                        [--use_matplotlib USE_MATPLOTLIB]
                        [--use_tensorboard USE_TENSORBOARD]
                        [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                        [--use_wandb USE_WANDB]
                        [--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
                        [--wandb_entity WANDB_ENTITY]
                        [--wandb_name WANDB_NAME]
                        [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                        [--detect_anomaly DETECT_ANOMALY]
                        [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                        [--save_strategy {all,adapter_only,required_grad_only}]
                        [--adapter_conf ADAPTER_CONF]
                        [--pretrain_path PRETRAIN_PATH]
                        [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                        [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                        [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                        [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                        [--batch_size BATCH_SIZE]
                        [--valid_batch_size VALID_BATCH_SIZE]
                        [--batch_bins BATCH_BINS]
                        [--valid_batch_bins VALID_BATCH_BINS]
                        [--train_shape_file TRAIN_SHAPE_FILE]
                        [--valid_shape_file VALID_SHAPE_FILE]
                        [--batch_type {unsorted,sorted,folded,length,numel}]
                        [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                        [--fold_length FOLD_LENGTH]
                        [--sort_in_batch {descending,ascending}]
                        [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                        [--sort_batch {descending,ascending}]
                        [--multiple_iterator MULTIPLE_ITERATOR]
                        [--chunk_length CHUNK_LENGTH]
                        [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                        [--num_cache_chunks NUM_CACHE_CHUNKS]
                        [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                        [--chunk_default_fs CHUNK_DEFAULT_FS]
                        [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                        [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                        [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                        [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--max_cache_size MAX_CACHE_SIZE]
                        [--max_cache_fd MAX_CACHE_FD]
                        [--allow_multi_rates ALLOW_MULTI_RATES]
                        [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                        [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                        [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                        [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                        [--optim_conf OPTIM_CONF]
                        [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                        [--scheduler_conf SCHEDULER_CONF]
                        [--optim2 {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                        [--optim2_conf OPTIM2_CONF]
                        [--scheduler2 {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                        [--scheduler2_conf SCHEDULER2_CONF]
                        [--generator_first GENERATOR_FIRST]
                        [--token_list TOKEN_LIST] [--odim ODIM]
                        [--model_conf MODEL_CONF]
                        [--use_preprocessor USE_PREPROCESSOR]
                        [--token_type {bpe,char,word,phn}]
                        [--bpemodel BPEMODEL]
                        [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                        [--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
                        [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                        [--feats_extract {fbank,log_spectrogram,linear_spectrogram}]
                        [--feats_extract_conf FEATS_EXTRACT_CONF]
                        [--normalize {global_mvn,utterance_mvn,None}]
                        [--normalize_conf NORMALIZE_CONF]
                        [--tts {vits,joint_text2wav,jets}]
                        [--tts_conf TTS_CONF] [--pitch_extract {dio,None}]
                        [--pitch_extract_conf PITCH_EXTRACT_CONF]
                        [--pitch_normalize {global_mvn,utterance_mvn,None}]
                        [--pitch_normalize_conf PITCH_NORMALIZE_CONF]
                        [--energy_extract {energy,None}]
                        [--energy_extract_conf ENERGY_EXTRACT_CONF]
                        [--energy_normalize {global_mvn,utterance_mvn,None}]
                        [--energy_normalize_conf ENERGY_NORMALIZE_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --generator_first GENERATOR_FIRST
                        Whether to update generator first. (default: False)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})
  --optim2 {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim2_conf OPTIM2_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler2 {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler2_conf SCHEDULER2_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --odim ODIM           The number of dimension of output feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: phn)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --feats_extract {fbank,log_spectrogram,linear_spectrogram}
                        The feats_extract type (default: linear_spectrogram)
  --feats_extract_conf FEATS_EXTRACT_CONF
                        The keyword arguments for feats_extract (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: None)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --tts {vits,joint_text2wav,jets}
                        The tts type (default: vits)
  --tts_conf TTS_CONF   The keyword arguments for tts (default: {})
  --pitch_extract {dio,None}
                        The pitch_extract type (default: None)
  --pitch_extract_conf PITCH_EXTRACT_CONF
                        The keyword arguments for pitch_extract (default: {})
  --pitch_normalize {global_mvn,utterance_mvn,None}
                        The pitch_normalize type (default: None)
  --pitch_normalize_conf PITCH_NORMALIZE_CONF
                        The keyword arguments for pitch_normalize (default: {})
  --energy_extract {energy,None}
                        The energy_extract type (default: None)
  --energy_extract_conf ENERGY_EXTRACT_CONF
                        The keyword arguments for energy_extract (default: {})
  --energy_normalize {global_mvn,utterance_mvn,None}
                        The energy_normalize type (default: None)
  --energy_normalize_conf ENERGY_NORMALIZE_CONF
                        The keyword arguments for energy_normalize (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

hubert_train.py¶

usage: hubert_train.py [-h] [--config CONFIG] [--print_config]
                       [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                       [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                       [--iterator_type {sequence,category,chunk,task,none}]
                       [--valid_iterator_type {sequence,category,chunk,task,none}]
                       [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                       [--num_workers NUM_WORKERS]
                       [--num_att_plot NUM_ATT_PLOT]
                       [--dist_backend DIST_BACKEND]
                       [--dist_init_method DIST_INIT_METHOD]
                       [--dist_world_size DIST_WORLD_SIZE]
                       [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                       [--dist_master_addr DIST_MASTER_ADDR]
                       [--dist_master_port DIST_MASTER_PORT]
                       [--dist_launcher {slurm,mpi,None}]
                       [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                       [--unused_parameters UNUSED_PARAMETERS]
                       [--sharded_ddp SHARDED_DDP]
                       [--cudnn_enabled CUDNN_ENABLED]
                       [--cudnn_benchmark CUDNN_BENCHMARK]
                       [--cudnn_deterministic CUDNN_DETERMINISTIC]
                       [--collect_stats COLLECT_STATS]
                       [--write_collected_feats WRITE_COLLECTED_FEATS]
                       [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                       [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                       [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                       [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                       [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                       [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                       [--grad_clip GRAD_CLIP]
                       [--grad_clip_type GRAD_CLIP_TYPE]
                       [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                       [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                       [--train_dtype {float16,float32,float64}]
                       [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                       [--use_matplotlib USE_MATPLOTLIB]
                       [--use_tensorboard USE_TENSORBOARD]
                       [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                       [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                       [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                       [--wandb_name WANDB_NAME]
                       [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                       [--detect_anomaly DETECT_ANOMALY]
                       [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                       [--save_strategy {all,adapter_only,required_grad_only}]
                       [--adapter_conf ADAPTER_CONF]
                       [--pretrain_path PRETRAIN_PATH]
                       [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                       [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                       [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                       [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                       [--batch_size BATCH_SIZE]
                       [--valid_batch_size VALID_BATCH_SIZE]
                       [--batch_bins BATCH_BINS]
                       [--valid_batch_bins VALID_BATCH_BINS]
                       [--train_shape_file TRAIN_SHAPE_FILE]
                       [--valid_shape_file VALID_SHAPE_FILE]
                       [--batch_type {unsorted,sorted,folded,length,numel}]
                       [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                       [--fold_length FOLD_LENGTH]
                       [--sort_in_batch {descending,ascending}]
                       [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                       [--sort_batch {descending,ascending}]
                       [--multiple_iterator MULTIPLE_ITERATOR]
                       [--chunk_length CHUNK_LENGTH]
                       [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                       [--num_cache_chunks NUM_CACHE_CHUNKS]
                       [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                       [--chunk_default_fs CHUNK_DEFAULT_FS]
                       [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                       [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                       [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                       [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                       [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                       [--max_cache_size MAX_CACHE_SIZE]
                       [--max_cache_fd MAX_CACHE_FD]
                       [--allow_multi_rates ALLOW_MULTI_RATES]
                       [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                       [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                       [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                       [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                       [--optim_conf OPTIM_CONF]
                       [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                       [--scheduler_conf SCHEDULER_CONF]
                       [--token_list TOKEN_LIST]
                       [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                       [--collate_fn_conf COLLATE_FN_CONF]
                       [--input_size INPUT_SIZE] [--num_classes NUM_CLASSES]
                       [--use_preprocessor USE_PREPROCESSOR]
                       [--token_type {bpe,char,word,phn}]
                       [--bpemodel BPEMODEL]
                       [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                       [--cleaner {None,tacotron,jaconv,vietnamese}]
                       [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                       [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                       [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                       [--noise_scp NOISE_SCP]
                       [--noise_apply_prob NOISE_APPLY_PROB]
                       [--noise_db_range NOISE_DB_RANGE]
                       [--pred_masked_weight PRED_MASKED_WEIGHT]
                       [--pred_nomask_weight PRED_NOMASK_WEIGHT]
                       [--loss_weights LOSS_WEIGHTS]
                       [--frontend {default,sliding_window}]
                       [--frontend_conf FRONTEND_CONF]
                       [--specaug {specaug,None}]
                       [--specaug_conf SPECAUG_CONF]
                       [--normalize {global_mvn,utterance_mvn,None}]
                       [--normalize_conf NORMALIZE_CONF]
                       [--preencoder {sinc,None}]
                       [--preencoder_conf PREENCODER_CONF]
                       [--encoder {hubert_pretrain,torchaudio_hubert}]
                       [--encoder_conf ENCODER_CONF]
                       [--model {fairseq,torchaudio}]
                       [--model_conf MODEL_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --pred_masked_weight PRED_MASKED_WEIGHT
                        weight for predictive loss for masked frames (default: 1.0)
  --pred_nomask_weight PRED_NOMASK_WEIGHT
                        weight for predictive loss for unmasked frames (default: 0.0)
  --loss_weights LOSS_WEIGHTS
                        weights for additional loss terms (not first one) (default: 0.0)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --collate_fn_conf COLLATE_FN_CONF
                        The keyword arguments for collate_fn class. (default: {})
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --num_classes NUM_CLASSES
                        The number of classes in hubert (default: None)

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value. (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of noise decibel level. (default: 13_15)
  --frontend {default,sliding_window}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --preencoder {sinc,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {hubert_pretrain,torchaudio_hubert}
                        The encoder type (default: hubert_pretrain)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --model {fairseq,torchaudio}
                        The model type (default: fairseq)
  --model_conf MODEL_CONF
                        The keyword arguments for model (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

hugging_face_export_vocabulary.py¶

usage: hugging_face_export_vocabulary.py [-h]
                                         [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                                         --output OUTPUT --model_name_or_path
                                         MODEL_NAME_OR_PATH
                                         [--add_symbol ADD_SYMBOL]

Export Hugging Face vocabulary

optional arguments:
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output OUTPUT, -o OUTPUT
                        Output text. - indicates sys.stdout (default: None)
  --model_name_or_path MODEL_NAME_OR_PATH
                        Hugging Face model name or path (default: None)
  --add_symbol ADD_SYMBOL
                        Append symbol e.g. --add_symbol '<blank>:0'
                        --add_symbol '<unk>:1' (default: [])

launch.py¶

usage: launch.py [-h] [--cmd CMD] [--log LOG]
                 [--max_num_log_files MAX_NUM_LOG_FILES] [--ngpu NGPU]
                 [--num_nodes NUM_NODES | --host HOST] [--envfile ENVFILE]
                 [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                 [--master_port MASTER_PORT] [--master_addr MASTER_ADDR]
                 [--init_file_prefix INIT_FILE_PREFIX]
                 args [args ...]

Launch distributed process with appropriate options.

positional arguments:
  args

optional arguments:
  --cmd CMD             The path of cmd script of Kaldi: run.pl. queue.pl, or
                        slurm.pl (default: utils/run.pl)
  --log LOG             The path of log file used by cmd (default: run.log)
  --max_num_log_files MAX_NUM_LOG_FILES
                        The maximum number of log-files to be kept (default:
                        1000)
  --ngpu NGPU           The number of GPUs per node (default: 1)
  --num_nodes NUM_NODES
                        The number of nodes (default: 1)
  --host HOST           Directly specify the host names. The job are submitted
                        via SSH. Multiple host names can be specified by
                        splitting by comma. e.g. host1,host2 You can also the
                        device id after the host name with ':'. e.g.
                        host1:0:2:3,host2:0:2. If the device ids are specified
                        in this way, the value of --ngpu is ignored. (default:
                        None)
  --envfile ENVFILE     Source the shell script before executing command. This
                        option is used when --host is specified. (default:
                        path.sh)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Distributed method is used when single-node mode.
                        (default: True)
  --master_port MASTER_PORT
                        Specify the port number of masterMaster is a host
                        machine has RANK0 process. (default: None)
  --master_addr MASTER_ADDR
                        Specify the address s of master. Master is a host
                        machine has RANK0 process. (default: None)
  --init_file_prefix INIT_FILE_PREFIX
                        The file name prefix for init_file, which is used for
                        'Shared-file system initialization'. This option is
                        used when --port is not specified (default:
                        .dist_init_)

lm_calc_perplexity.py¶

/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: lm_calc_perplexity.py [-h] [--config CONFIG]
                             [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                             --output_dir OUTPUT_DIR [--ngpu NGPU]
                             [--seed SEED] [--dtype {float16,float32,float64}]
                             [--num_workers NUM_WORKERS]
                             [--batch_size BATCH_SIZE] [--log_base LOG_BASE]
                             --data_path_and_name_and_type
                             DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                             [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                             [--train_config TRAIN_CONFIG]
                             [--model_file MODEL_FILE]

Calc perplexity

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --log_base LOG_BASE   The base of logarithm for Perplexity. If None,
                        napier's constant is used. (default: None)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --train_config TRAIN_CONFIG
  --model_file MODEL_FILE

lm_train.py¶

usage: lm_train.py [-h] [--config CONFIG] [--print_config]
                   [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                   [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                   [--iterator_type {sequence,category,chunk,task,none}]
                   [--valid_iterator_type {sequence,category,chunk,task,none}]
                   [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                   [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                   [--dist_backend DIST_BACKEND]
                   [--dist_init_method DIST_INIT_METHOD]
                   [--dist_world_size DIST_WORLD_SIZE] [--dist_rank DIST_RANK]
                   [--local_rank LOCAL_RANK]
                   [--dist_master_addr DIST_MASTER_ADDR]
                   [--dist_master_port DIST_MASTER_PORT]
                   [--dist_launcher {slurm,mpi,None}]
                   [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                   [--unused_parameters UNUSED_PARAMETERS]
                   [--sharded_ddp SHARDED_DDP] [--cudnn_enabled CUDNN_ENABLED]
                   [--cudnn_benchmark CUDNN_BENCHMARK]
                   [--cudnn_deterministic CUDNN_DETERMINISTIC]
                   [--collect_stats COLLECT_STATS]
                   [--write_collected_feats WRITE_COLLECTED_FEATS]
                   [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                   [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                   [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                   [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                   [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                   [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                   [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                   [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                   [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                   [--train_dtype {float16,float32,float64}]
                   [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                   [--use_matplotlib USE_MATPLOTLIB]
                   [--use_tensorboard USE_TENSORBOARD]
                   [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                   [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                   [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                   [--wandb_name WANDB_NAME]
                   [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                   [--detect_anomaly DETECT_ANOMALY]
                   [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                   [--save_strategy {all,adapter_only,required_grad_only}]
                   [--adapter_conf ADAPTER_CONF]
                   [--pretrain_path PRETRAIN_PATH]
                   [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                   [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                   [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                   [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                   [--batch_size BATCH_SIZE]
                   [--valid_batch_size VALID_BATCH_SIZE]
                   [--batch_bins BATCH_BINS]
                   [--valid_batch_bins VALID_BATCH_BINS]
                   [--train_shape_file TRAIN_SHAPE_FILE]
                   [--valid_shape_file VALID_SHAPE_FILE]
                   [--batch_type {unsorted,sorted,folded,length,numel}]
                   [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                   [--fold_length FOLD_LENGTH]
                   [--sort_in_batch {descending,ascending}]
                   [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                   [--sort_batch {descending,ascending}]
                   [--multiple_iterator MULTIPLE_ITERATOR]
                   [--chunk_length CHUNK_LENGTH]
                   [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                   [--num_cache_chunks NUM_CACHE_CHUNKS]
                   [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                   [--chunk_default_fs CHUNK_DEFAULT_FS]
                   [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                   [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                   [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                   [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                   [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                   [--max_cache_size MAX_CACHE_SIZE]
                   [--max_cache_fd MAX_CACHE_FD]
                   [--allow_multi_rates ALLOW_MULTI_RATES]
                   [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                   [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                   [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                   [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                   [--optim_conf OPTIM_CONF]
                   [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                   [--scheduler_conf SCHEDULER_CONF] [--token_list TOKEN_LIST]
                   [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                   [--use_preprocessor USE_PREPROCESSOR]
                   [--token_type {bpe,char,word}] [--bpemodel BPEMODEL]
                   [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                   [--cleaner {None,tacotron,jaconv,vietnamese}]
                   [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                   [--lm {seq_rnn,transformer,transformer_opt}]
                   [--lm_conf LM_CONF] [--model {lm,lm_multitask}]
                   [--model_conf MODEL_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word}
  --bpemodel BPEMODEL   The model file fo sentencepiece (default: None)
  --lm {seq_rnn,transformer,transformer_opt}
                        The lm type (default: seq_rnn)
  --lm_conf LM_CONF     The keyword arguments for lm (default: {})
  --model {lm,lm_multitask}
                        The model type (default: lm)
  --model_conf MODEL_CONF
                        The keyword arguments for model (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

mt_inference.py¶

/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: mt_inference.py [-h] [--config CONFIG]
                       [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                       --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                       [--dtype {float16,float32,float64}]
                       [--num_workers NUM_WORKERS]
                       --data_path_and_name_and_type
                       DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                       [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                       [--mt_train_config MT_TRAIN_CONFIG]
                       [--mt_model_file MT_MODEL_FILE]
                       [--lm_train_config LM_TRAIN_CONFIG] [--lm_file LM_FILE]
                       [--word_lm_train_config WORD_LM_TRAIN_CONFIG]
                       [--word_lm_file WORD_LM_FILE] [--ngram_file NGRAM_FILE]
                       [--model_tag MODEL_TAG] [--batch_size BATCH_SIZE]
                       [--nbest NBEST] [--beam_size BEAM_SIZE]
                       [--penalty PENALTY] [--maxlenratio MAXLENRATIO]
                       [--minlenratio MINLENRATIO] [--ctc_weight CTC_WEIGHT]
                       [--lm_weight LM_WEIGHT] [--ngram_weight NGRAM_WEIGHT]
                       [--token_type {char,bpe,None}] [--bpemodel BPEMODEL]
                       [--normalize_length NORMALIZE_LENGTH]

MT Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --mt_train_config MT_TRAIN_CONFIG
                        ST training configuration (default: None)
  --mt_model_file MT_MODEL_FILE
                        MT model parameter file (default: None)
  --lm_train_config LM_TRAIN_CONFIG
                        LM training configuration (default: None)
  --lm_file LM_FILE     LM parameter file (default: None)
  --word_lm_train_config WORD_LM_TRAIN_CONFIG
                        Word LM training configuration (default: None)
  --word_lm_file WORD_LM_FILE
                        Word LM parameter file (default: None)
  --ngram_file NGRAM_FILE
                        N-gram parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        *_train_config and *_file will be overwritten
                        (default: None)

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 20)
  --penalty PENALTY     Insertion penalty (default: 0.0)
  --maxlenratio MAXLENRATIO
                        Input length ratio to obtain max output length. If
                        maxlenratio=0.0 (default), it uses a end-detect
                        function to automatically find maximum hypothesis
                        lengths.If maxlenratio<0.0, its absolute value is
                        interpretedas a constant max output length (default:
                        0.0)
  --minlenratio MINLENRATIO
                        Input length ratio to obtain min output length
                        (default: 0.0)
  --ctc_weight CTC_WEIGHT
                        CTC weight in joint decoding (default: 0.0)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --ngram_weight NGRAM_WEIGHT
                        ngram weight (default: 0.9)

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ST model. If not given, refers from
                        the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)
  --normalize_length NORMALIZE_LENGTH
                        If true, pruning is based on length-normalized scores
                        (default: False)

mt_train.py¶

usage: mt_train.py [-h] [--config CONFIG] [--print_config]
                   [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                   [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                   [--iterator_type {sequence,category,chunk,task,none}]
                   [--valid_iterator_type {sequence,category,chunk,task,none}]
                   [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                   [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                   [--dist_backend DIST_BACKEND]
                   [--dist_init_method DIST_INIT_METHOD]
                   [--dist_world_size DIST_WORLD_SIZE] [--dist_rank DIST_RANK]
                   [--local_rank LOCAL_RANK]
                   [--dist_master_addr DIST_MASTER_ADDR]
                   [--dist_master_port DIST_MASTER_PORT]
                   [--dist_launcher {slurm,mpi,None}]
                   [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                   [--unused_parameters UNUSED_PARAMETERS]
                   [--sharded_ddp SHARDED_DDP] [--cudnn_enabled CUDNN_ENABLED]
                   [--cudnn_benchmark CUDNN_BENCHMARK]
                   [--cudnn_deterministic CUDNN_DETERMINISTIC]
                   [--collect_stats COLLECT_STATS]
                   [--write_collected_feats WRITE_COLLECTED_FEATS]
                   [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                   [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                   [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                   [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                   [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                   [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                   [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                   [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                   [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                   [--train_dtype {float16,float32,float64}]
                   [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                   [--use_matplotlib USE_MATPLOTLIB]
                   [--use_tensorboard USE_TENSORBOARD]
                   [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                   [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                   [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                   [--wandb_name WANDB_NAME]
                   [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                   [--detect_anomaly DETECT_ANOMALY]
                   [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                   [--save_strategy {all,adapter_only,required_grad_only}]
                   [--adapter_conf ADAPTER_CONF]
                   [--pretrain_path PRETRAIN_PATH]
                   [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                   [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                   [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                   [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                   [--batch_size BATCH_SIZE]
                   [--valid_batch_size VALID_BATCH_SIZE]
                   [--batch_bins BATCH_BINS]
                   [--valid_batch_bins VALID_BATCH_BINS]
                   [--train_shape_file TRAIN_SHAPE_FILE]
                   [--valid_shape_file VALID_SHAPE_FILE]
                   [--batch_type {unsorted,sorted,folded,length,numel}]
                   [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                   [--fold_length FOLD_LENGTH]
                   [--sort_in_batch {descending,ascending}]
                   [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                   [--sort_batch {descending,ascending}]
                   [--multiple_iterator MULTIPLE_ITERATOR]
                   [--chunk_length CHUNK_LENGTH]
                   [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                   [--num_cache_chunks NUM_CACHE_CHUNKS]
                   [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                   [--chunk_default_fs CHUNK_DEFAULT_FS]
                   [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                   [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                   [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                   [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                   [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                   [--max_cache_size MAX_CACHE_SIZE]
                   [--max_cache_fd MAX_CACHE_FD]
                   [--allow_multi_rates ALLOW_MULTI_RATES]
                   [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                   [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                   [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                   [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                   [--optim_conf OPTIM_CONF]
                   [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                   [--scheduler_conf SCHEDULER_CONF] [--token_list TOKEN_LIST]
                   [--src_token_list SRC_TOKEN_LIST]
                   [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                   [--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
                   [--use_preprocessor USE_PREPROCESSOR]
                   [--token_type {bpe,char,word,phn}]
                   [--src_token_type {bpe,char,word,phn}]
                   [--bpemodel BPEMODEL] [--src_bpemodel SRC_BPEMODEL]
                   [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                   [--cleaner {None,tacotron,jaconv,vietnamese}]
                   [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                   [--tokenizer_encode_conf TOKENIZER_ENCODE_CONF]
                   [--src_tokenizer_encode_conf SRC_TOKENIZER_ENCODE_CONF]
                   [--frontend {embed}] [--frontend_conf FRONTEND_CONF]
                   [--specaug {specaug,None}] [--specaug_conf SPECAUG_CONF]
                   [--preencoder {sinc,linear,None}]
                   [--preencoder_conf PREENCODER_CONF]
                   [--encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,branchformer,e_branchformer}]
                   [--encoder_conf ENCODER_CONF]
                   [--postencoder {hugging_face_transformers,None}]
                   [--postencoder_conf POSTENCODER_CONF]
                   [--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
                   [--decoder_conf DECODER_CONF] [--model {mt,discrete_asr}]
                   [--model_conf MODEL_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --tokenizer_encode_conf TOKENIZER_ENCODE_CONF
                        Tokenization encoder conf, e.g. BPE dropout: enable_sampling=True, alpha=0.1, nbest_size=-1 (default: None)
  --src_tokenizer_encode_conf SRC_TOKENIZER_ENCODE_CONF
                        Src tokenization encoder conf, e.g. BPE dropout: enable_sampling=True, alpha=0.1, nbest_size=-1 (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (for target language) (default: None)
  --src_token_list SRC_TOKEN_LIST
                        A text mapping int-id to token (for source language) (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --ctc_conf CTC_CONF   The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True, 'brctc_risk_strategy': 'exp', 'brctc_group_strategy': 'end', 'brctc_risk_factor': 0.0})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The target text will be tokenized in the specified level token (default: bpe)
  --src_token_type {bpe,char,word,phn}
                        The source text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (for target language) (default: None)
  --src_bpemodel SRC_BPEMODEL
                        The model file of sentencepiece (for source language) (default: None)
  --frontend {embed}    The frontend type (default: embed)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --preencoder {sinc,linear,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,branchformer,e_branchformer}
                        The encoder type (default: rnn)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --postencoder {hugging_face_transformers,None}
                        The postencoder type (default: None)
  --postencoder_conf POSTENCODER_CONF
                        The keyword arguments for postencoder (default: {})
  --decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
                        The decoder type (default: rnn)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
  --model {mt,discrete_asr}
                        The model type (default: mt)
  --model_conf MODEL_CONF
                        The keyword arguments for model (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

pack.py¶

usage: pack.py [-h] {asr,st,tts,enh,diar,svs,enh_s2t,ssl,s2st,s2t,spk} ...

Pack input files to archive format

positional arguments:
  {asr,st,tts,enh,diar,svs,enh_s2t,ssl,s2st,s2t,spk}

optional arguments:

s2st_inference.py¶

/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: s2st_inference.py [-h] [--config CONFIG]
                         [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                         --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                         [--dtype {float16,float32,float64}]
                         [--num_workers NUM_WORKERS] [--batch_size BATCH_SIZE]
                         --data_path_and_name_and_type
                         DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                         [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                         [--train_config TRAIN_CONFIG]
                         [--model_file MODEL_FILE] [--maxlenratio MAXLENRATIO]
                         [--minlenratio MINLENRATIO]
                         [--st_subtask_maxlenratio ST_SUBTASK_MAXLENRATIO]
                         [--st_subtask_minlenratio ST_SUBTASK_MINLENRATIO]
                         [--threshold THRESHOLD]
                         [--use_att_constraint USE_ATT_CONSTRAINT]
                         [--backward_window BACKWARD_WINDOW]
                         [--forward_window FORWARD_WINDOW]
                         [--use_teacher_forcing USE_TEACHER_FORCING]
                         [--always_fix_seed ALWAYS_FIX_SEED] [--nbest NBEST]
                         [--beam_size BEAM_SIZE] [--penalty PENALTY]
                         [--st_subtask_nbest ST_SUBTASK_NBEST]
                         [--st_subtask_beam_size ST_SUBTASK_BEAM_SIZE]
                         [--st_subtask_penalty ST_SUBTASK_PENALTY]
                         [--vocoder_config VOCODER_CONFIG]
                         [--vocoder_file VOCODER_FILE]
                         [--vocoder_tag VOCODER_TAG]
                         [--st_subtask_token_type {char,bpe,None}]
                         [--st_subtask_bpemodel ST_SUBTASK_BPEMODEL]
                         [--normalize_length NORMALIZE_LENGTH]

S2ST inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
                        The path of output directory (default: None)
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --train_config TRAIN_CONFIG
                        Training configuration file (default: None)
  --model_file MODEL_FILE
                        Model parameter file (default: None)

Decoding related:
  --maxlenratio MAXLENRATIO
                        Maximum length ratio in decoding (default: 10.0)
  --minlenratio MINLENRATIO
                        Minimum length ratio in decoding (default: 0.0)
  --st_subtask_maxlenratio ST_SUBTASK_MAXLENRATIO
                        Maximum length ratio in decoding (default: 1.5)
  --st_subtask_minlenratio ST_SUBTASK_MINLENRATIO
                        Minimum length ratio in decoding (default: 0.1)

Spectrogram-based generation related:
  --threshold THRESHOLD
                        Threshold value in decoding (default: 0.5)
  --use_att_constraint USE_ATT_CONSTRAINT
                        Whether to use attention constraint (default: False)
  --backward_window BACKWARD_WINDOW
                        Backward window value in attention constraint
                        (default: 1)
  --forward_window FORWARD_WINDOW
                        Forward window value in attention constraint (default:
                        3)
  --use_teacher_forcing USE_TEACHER_FORCING
                        Whether to use teacher forcing (default: False)
  --always_fix_seed ALWAYS_FIX_SEED
                        Whether to always fix seed (default: False)

Beam-search (discrete unit/multi-pass) related:
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 20)
  --penalty PENALTY     Insertion penalty (default: 0.0)
  --st_subtask_nbest ST_SUBTASK_NBEST
                        Output N-best hypotheses for st subtask (default: 1)
  --st_subtask_beam_size ST_SUBTASK_BEAM_SIZE
                        Beam size for st subtask (default: 5)
  --st_subtask_penalty ST_SUBTASK_PENALTY
                        Insertion penalty for st subtask (default: 0.0)

Vocoder related:
  --vocoder_config VOCODER_CONFIG
                        Vocoder configuration file (default: None)
  --vocoder_file VOCODER_FILE
                        Vocoder parameter file (default: None)
  --vocoder_tag VOCODER_TAG
                        Pretrained vocoder tag. If specify this option,
                        vocoder_config and vocoder_file will be overwritten
                        (default: None)

Text converter related:
  --st_subtask_token_type {char,bpe,None}
                        The token type for ST model. If not given, refers from
                        the training args (default: None)
  --st_subtask_bpemodel ST_SUBTASK_BPEMODEL
                        The model path of sentencepiece. If not given, refers
                        from the training args (default: None)
  --normalize_length NORMALIZE_LENGTH
                        If true, best hypothesis is selected by length-
                        normalized scores (default: False)

s2st_train.py¶

usage: s2st_train.py [-h] [--config CONFIG] [--print_config]
                     [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                     [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                     [--iterator_type {sequence,category,chunk,task,none}]
                     [--valid_iterator_type {sequence,category,chunk,task,none}]
                     [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                     [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                     [--dist_backend DIST_BACKEND]
                     [--dist_init_method DIST_INIT_METHOD]
                     [--dist_world_size DIST_WORLD_SIZE]
                     [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                     [--dist_master_addr DIST_MASTER_ADDR]
                     [--dist_master_port DIST_MASTER_PORT]
                     [--dist_launcher {slurm,mpi,None}]
                     [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                     [--unused_parameters UNUSED_PARAMETERS]
                     [--sharded_ddp SHARDED_DDP]
                     [--cudnn_enabled CUDNN_ENABLED]
                     [--cudnn_benchmark CUDNN_BENCHMARK]
                     [--cudnn_deterministic CUDNN_DETERMINISTIC]
                     [--collect_stats COLLECT_STATS]
                     [--write_collected_feats WRITE_COLLECTED_FEATS]
                     [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                     [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                     [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                     [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                     [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                     [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                     [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                     [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                     [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                     [--train_dtype {float16,float32,float64}]
                     [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                     [--use_matplotlib USE_MATPLOTLIB]
                     [--use_tensorboard USE_TENSORBOARD]
                     [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                     [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                     [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                     [--wandb_name WANDB_NAME]
                     [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                     [--detect_anomaly DETECT_ANOMALY]
                     [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                     [--save_strategy {all,adapter_only,required_grad_only}]
                     [--adapter_conf ADAPTER_CONF]
                     [--pretrain_path PRETRAIN_PATH]
                     [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                     [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                     [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                     [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                     [--batch_size BATCH_SIZE]
                     [--valid_batch_size VALID_BATCH_SIZE]
                     [--batch_bins BATCH_BINS]
                     [--valid_batch_bins VALID_BATCH_BINS]
                     [--train_shape_file TRAIN_SHAPE_FILE]
                     [--valid_shape_file VALID_SHAPE_FILE]
                     [--batch_type {unsorted,sorted,folded,length,numel}]
                     [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                     [--fold_length FOLD_LENGTH]
                     [--sort_in_batch {descending,ascending}]
                     [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                     [--sort_batch {descending,ascending}]
                     [--multiple_iterator MULTIPLE_ITERATOR]
                     [--chunk_length CHUNK_LENGTH]
                     [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                     [--num_cache_chunks NUM_CACHE_CHUNKS]
                     [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                     [--chunk_default_fs CHUNK_DEFAULT_FS]
                     [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                     [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                     [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                     [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                     [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                     [--max_cache_size MAX_CACHE_SIZE]
                     [--max_cache_fd MAX_CACHE_FD]
                     [--allow_multi_rates ALLOW_MULTI_RATES]
                     [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                     [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                     [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                     [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                     [--optim_conf OPTIM_CONF]
                     [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                     [--scheduler_conf SCHEDULER_CONF]
                     [--s2st_type {translatotron,translatotron2,discrete_unit,unity}]
                     [--tgt_token_list TGT_TOKEN_LIST]
                     [--src_token_list SRC_TOKEN_LIST]
                     [--unit_token_list UNIT_TOKEN_LIST] [--odim ODIM]
                     [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                     [--input_size INPUT_SIZE] [--output_size OUTPUT_SIZE]
                     [--asr_ctc ASR_CTC] [--st_ctc ST_CTC]
                     [--asr_ctc_conf ASR_CTC_CONF] [--st_ctc_conf ST_CTC_CONF]
                     [--model_conf MODEL_CONF]
                     [--use_preprocessor USE_PREPROCESSOR]
                     [--tgt_token_type {bpe,char,word,phn}]
                     [--src_token_type {bpe,char,word,phn,none}]
                     [--tgt_bpemodel TGT_BPEMODEL]
                     [--src_bpemodel SRC_BPEMODEL]
                     [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                     [--cleaner {None,tacotron,jaconv,vietnamese}]
                     [--tgt_g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                     [--src_g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                     [--losses LOSSES]
                     [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                     [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                     [--noise_scp NOISE_SCP]
                     [--noise_apply_prob NOISE_APPLY_PROB]
                     [--noise_db_range NOISE_DB_RANGE]
                     [--short_noise_thres SHORT_NOISE_THRES]
                     [--frontend {default,sliding_window,s3prl,None}]
                     [--frontend_conf FRONTEND_CONF]
                     [--tgt_feats_extract {fbank,spectrogram,linear_spectrogram,None}]
                     [--tgt_feats_extract_conf TGT_FEATS_EXTRACT_CONF]
                     [--specaug {specaug,None}] [--specaug_conf SPECAUG_CONF]
                     [--src_normalize {global_mvn,utterance_mvn,None}]
                     [--src_normalize_conf SRC_NORMALIZE_CONF]
                     [--tgt_normalize {global_mvn,utterance_mvn,None}]
                     [--tgt_normalize_conf TGT_NORMALIZE_CONF]
                     [--preencoder {sinc,linear,None}]
                     [--preencoder_conf PREENCODER_CONF]
                     [--encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,linear}]
                     [--encoder_conf ENCODER_CONF]
                     [--postencoder {hugging_face_transformers,None}]
                     [--postencoder_conf POSTENCODER_CONF]
                     [--asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,None}]
                     [--asr_decoder_conf ASR_DECODER_CONF]
                     [--st_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,None}]
                     [--st_decoder_conf ST_DECODER_CONF]
                     [--aux_attention {multihead,None}]
                     [--aux_attention_conf AUX_ATTENTION_CONF]
                     [--unit_encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,linear,None}]
                     [--unit_encoder_conf UNIT_ENCODER_CONF]
                     [--synthesizer {translatotron,discrete_unit}]
                     [--synthesizer_conf SYNTHESIZER_CONF]
                     [--loss {tacotron,guided_attention,attention,ctc}]
                     [--loss_conf LOSS_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --s2st_type {translatotron,translatotron2,discrete_unit,unity}
                        Types of S2ST (default: discrete_unit)
  --tgt_token_list TGT_TOKEN_LIST
                        A text mapping int-id to token (for target language) (default: None)
  --src_token_list SRC_TOKEN_LIST
                        A text mapping int-id to token (for source language) (default: None)
  --unit_token_list UNIT_TOKEN_LIST
                        A text mapping int-id to token (for discrete_unit) (default: None)
  --odim ODIM           The number of dimension of output feature (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --output_size OUTPUT_SIZE
                        The number of output dimension of the feature (default: None)
  --asr_ctc ASR_CTC     whether to conduct CTC on ASR objectives (default: False)
  --st_ctc ST_CTC       whether to conduct CTC on ST objectives (default: False)
  --asr_ctc_conf ASR_CTC_CONF
                        The keyword arguments for ASR CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True, 'brctc_risk_strategy': 'exp', 'brctc_group_strategy': 'end', 'brctc_risk_factor': 0.0})
  --st_ctc_conf ST_CTC_CONF
                        The keyword arguments for ST CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True, 'brctc_risk_strategy': 'exp', 'brctc_group_strategy': 'end', 'brctc_risk_factor': 0.0})
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'ignore_id': -1, 'report_cer': True, 'report_wer': True, 'report_bleu': True, 'sym_space': '<space>', 'sym_blank': '<blank>', 'extract_feats_in_collect_stats': True})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --tgt_token_type {bpe,char,word,phn}
                        The target text will be tokenized in the specified level token (default: bpe)
  --src_token_type {bpe,char,word,phn,none}
                        The source text will be tokenized in the specified level token (default: bpe)
  --tgt_bpemodel TGT_BPEMODEL
                        The model file of sentencepiece (for target language) (default: None)
  --src_bpemodel SRC_BPEMODEL
                        The model file of sentencepiece (for source language) (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --tgt_g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --src_g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --losses LOSSES       The criterions binded with the loss wrappers. (default: [{'name': 'synthesis', 'conf': {}, 'type': 'attention'}])
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value. (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of noise decibel level. (default: 13_15)
  --short_noise_thres SHORT_NOISE_THRES
                        If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
  --frontend {default,sliding_window,s3prl,None}
                        The frontend type (default: None)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --tgt_feats_extract {fbank,spectrogram,linear_spectrogram,None}
                        The tgt_feats_extract type (default: None)
  --tgt_feats_extract_conf TGT_FEATS_EXTRACT_CONF
                        The keyword arguments for tgt_feats_extract (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --src_normalize {global_mvn,utterance_mvn,None}
                        The src_normalize type (default: utterance_mvn)
  --src_normalize_conf SRC_NORMALIZE_CONF
                        The keyword arguments for src_normalize (default: {})
  --tgt_normalize {global_mvn,utterance_mvn,None}
                        The tgt_normalize type (default: utterance_mvn)
  --tgt_normalize_conf TGT_NORMALIZE_CONF
                        The keyword arguments for tgt_normalize (default: {})
  --preencoder {sinc,linear,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,linear}
                        The encoder type (default: transformer)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --postencoder {hugging_face_transformers,None}
                        The postencoder type (default: None)
  --postencoder_conf POSTENCODER_CONF
                        The keyword arguments for postencoder (default: {})
  --asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,None}
                        The asr_decoder type (default: None)
  --asr_decoder_conf ASR_DECODER_CONF
                        The keyword arguments for asr_decoder (default: {})
  --st_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,None}
                        The st_decoder type (default: None)
  --st_decoder_conf ST_DECODER_CONF
                        The keyword arguments for st_decoder (default: {})
  --aux_attention {multihead,None}
                        The aux_attention type (default: None)
  --aux_attention_conf AUX_ATTENTION_CONF
                        The keyword arguments for aux_attention (default: {})
  --unit_encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,linear,None}
                        The unit_encoder type (default: None)
  --unit_encoder_conf UNIT_ENCODER_CONF
                        The keyword arguments for unit_encoder (default: {})
  --synthesizer {translatotron,discrete_unit}
                        The synthesizer type (default: discrete_unit)
  --synthesizer_conf SYNTHESIZER_CONF
                        The keyword arguments for synthesizer (default: {})
  --loss {tacotron,guided_attention,attention,ctc}
                        The loss type (default: tacotron)
  --loss_conf LOSS_CONF
                        The keyword arguments for loss (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

s2t_train.py¶

usage: s2t_train.py [-h] [--config CONFIG] [--print_config]
                    [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                    [--iterator_type {sequence,category,chunk,task,none}]
                    [--valid_iterator_type {sequence,category,chunk,task,none}]
                    [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                    [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                    [--dist_backend DIST_BACKEND]
                    [--dist_init_method DIST_INIT_METHOD]
                    [--dist_world_size DIST_WORLD_SIZE]
                    [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                    [--dist_master_addr DIST_MASTER_ADDR]
                    [--dist_master_port DIST_MASTER_PORT]
                    [--dist_launcher {slurm,mpi,None}]
                    [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                    [--unused_parameters UNUSED_PARAMETERS]
                    [--sharded_ddp SHARDED_DDP]
                    [--cudnn_enabled CUDNN_ENABLED]
                    [--cudnn_benchmark CUDNN_BENCHMARK]
                    [--cudnn_deterministic CUDNN_DETERMINISTIC]
                    [--collect_stats COLLECT_STATS]
                    [--write_collected_feats WRITE_COLLECTED_FEATS]
                    [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                    [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                    [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                    [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                    [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                    [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                    [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                    [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                    [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                    [--train_dtype {float16,float32,float64}]
                    [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                    [--use_matplotlib USE_MATPLOTLIB]
                    [--use_tensorboard USE_TENSORBOARD]
                    [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                    [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                    [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                    [--wandb_name WANDB_NAME]
                    [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                    [--detect_anomaly DETECT_ANOMALY]
                    [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                    [--save_strategy {all,adapter_only,required_grad_only}]
                    [--adapter_conf ADAPTER_CONF]
                    [--pretrain_path PRETRAIN_PATH]
                    [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                    [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                    [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                    [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                    [--batch_size BATCH_SIZE]
                    [--valid_batch_size VALID_BATCH_SIZE]
                    [--batch_bins BATCH_BINS]
                    [--valid_batch_bins VALID_BATCH_BINS]
                    [--train_shape_file TRAIN_SHAPE_FILE]
                    [--valid_shape_file VALID_SHAPE_FILE]
                    [--batch_type {unsorted,sorted,folded,length,numel}]
                    [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                    [--fold_length FOLD_LENGTH]
                    [--sort_in_batch {descending,ascending}]
                    [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                    [--sort_batch {descending,ascending}]
                    [--multiple_iterator MULTIPLE_ITERATOR]
                    [--chunk_length CHUNK_LENGTH]
                    [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                    [--num_cache_chunks NUM_CACHE_CHUNKS]
                    [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                    [--chunk_default_fs CHUNK_DEFAULT_FS]
                    [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                    [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                    [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                    [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                    [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                    [--max_cache_size MAX_CACHE_SIZE]
                    [--max_cache_fd MAX_CACHE_FD]
                    [--allow_multi_rates ALLOW_MULTI_RATES]
                    [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                    [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                    [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                    [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                    [--optim_conf OPTIM_CONF]
                    [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                    [--scheduler_conf SCHEDULER_CONF]
                    [--token_list TOKEN_LIST]
                    [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                    [--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
                    [--use_preprocessor USE_PREPROCESSOR]
                    [--token_type {bpe,char,word,phn,hugging_face,whisper_en,whisper_multilingual}]
                    [--bpemodel BPEMODEL]
                    [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                    [--cleaner {None,tacotron,jaconv,vietnamese,whisper_en,whisper_basic}]
                    [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                    [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                    [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                    [--noise_scp NOISE_SCP]
                    [--noise_apply_prob NOISE_APPLY_PROB]
                    [--noise_db_range NOISE_DB_RANGE]
                    [--short_noise_thres SHORT_NOISE_THRES]
                    [--frontend {default,sliding_window,s3prl,fused,whisper}]
                    [--frontend_conf FRONTEND_CONF] [--specaug {specaug,None}]
                    [--specaug_conf SPECAUG_CONF]
                    [--normalize {global_mvn,utterance_mvn,None}]
                    [--normalize_conf NORMALIZE_CONF] [--model {espnet}]
                    [--model_conf MODEL_CONF]
                    [--preencoder {sinc,linear,None}]
                    [--preencoder_conf PREENCODER_CONF]
                    [--encoder {conformer,transformer,transformer_multispkr,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,torchaudiohubert,longformer,branchformer,whisper,e_branchformer}]
                    [--encoder_conf ENCODER_CONF]
                    [--postencoder {hugging_face_transformers,None}]
                    [--postencoder_conf POSTENCODER_CONF]
                    [--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,mlm,whisper,hugging_face_transformers,s4,None}]
                    [--decoder_conf DECODER_CONF] [--preprocessor {s2t}]
                    [--preprocessor_conf PREPROCESSOR_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --ctc_conf CTC_CONF   The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True, 'brctc_risk_strategy': 'exp', 'brctc_group_strategy': 'end', 'brctc_risk_factor': 0.0})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn,hugging_face,whisper_en,whisper_multilingual}
                        The text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese,whisper_en,whisper_basic}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value. (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of noise decibel level. (default: 13_15)
  --short_noise_thres SHORT_NOISE_THRES
                        If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
  --frontend {default,sliding_window,s3prl,fused,whisper}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --model {espnet}      The model type (default: espnet)
  --model_conf MODEL_CONF
                        The keyword arguments for model (default: {})
  --preencoder {sinc,linear,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {conformer,transformer,transformer_multispkr,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,torchaudiohubert,longformer,branchformer,whisper,e_branchformer}
                        The encoder type (default: rnn)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --postencoder {hugging_face_transformers,None}
                        The postencoder type (default: None)
  --postencoder_conf POSTENCODER_CONF
                        The keyword arguments for postencoder (default: {})
  --decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,mlm,whisper,hugging_face_transformers,s4,None}
                        The decoder type (default: None)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
  --preprocessor {s2t}  The preprocessor type (default: s2t)
  --preprocessor_conf PREPROCESSOR_CONF
                        The keyword arguments for preprocessor (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

spk_embed_extract.py¶

usage: spk_embed_extract.py [-h] [--config CONFIG]
                            [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                            --output_dir OUTPUT_DIR [--ngpu NGPU]
                            [--seed SEED] [--dtype {float16,float32,float64}]
                            [--num_workers NUM_WORKERS]
                            --data_path_and_name_and_type
                            DATA_PATH_AND_NAME_AND_TYPE
                            [--batch_type {unsorted,sorted,folded,length,numel}]
                            [--batch_bins BATCH_BINS]
                            [--valid_batch_bins VALID_BATCH_BINS]
                            [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                            [--max_cache_size MAX_CACHE_SIZE]
                            [--max_cache_fd MAX_CACHE_FD]
                            [--allow_multi_rates ALLOW_MULTI_RATES]
                            [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                            [--shape_file SHAPE_FILE]
                            [--input_size INPUT_SIZE]
                            [--num_cohort_spk NUM_COHORT_SPK]
                            [--num_utt_per_spk NUM_UTT_PER_SPK]
                            [--utt_select_sec UTT_SELECT_SEC]
                            [--average_spk AVERAGE_SPK]
                            [--adaptive_cohort_size ADAPTIVE_COHORT_SIZE]
                            [--qmf_dur_thresh QMF_DUR_THRESH]
                            [--qmf_num_trial_per_condition QMF_NUM_TRIAL_PER_CONDITION]
                            [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                            [--average_embd AVERAGE_EMBD]
                            [--train_dtype {float16,float32,float64}]
                            [--use_amp USE_AMP]
                            [--no_forward_run NO_FORWARD_RUN]
                            [--sort_in_batch {descending,ascending}]
                            [--sort_batch {descending,ascending}]
                            [--drop_last_iter DROP_LAST_ITER]
                            [--spk_train_config SPK_TRAIN_CONFIG]
                            [--spk_model_file SPK_MODEL_FILE]
                            [--model_tag MODEL_TAG]
                            [--dist_backend DIST_BACKEND]
                            [--dist_init_method DIST_INIT_METHOD]
                            [--dist_world_size DIST_WORLD_SIZE]
                            [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                            [--dist_master_addr DIST_MASTER_ADDR]
                            [--dist_master_port DIST_MASTER_PORT]
                            [--dist_launcher {slurm,mpi,None}]
                            [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                            [--unused_parameters UNUSED_PARAMETERS]
                            [--sharded_ddp SHARDED_DDP]
                            [--use_matplotlib USE_MATPLOTLIB]
                            [--use_tensorboard USE_TENSORBOARD]
                            [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                            [--use_wandb USE_WANDB]
                            [--wandb_project WANDB_PROJECT]
                            [--wandb_id WANDB_ID]
                            [--wandb_entity WANDB_ENTITY]
                            [--wandb_name WANDB_NAME]
                            [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                            [--detect_anomaly DETECT_ANOMALY]
                            [--use_lora USE_LORA]
                            [--save_lora_only SAVE_LORA_ONLY]
                            [--lora_conf LORA_CONF]
                            [--cudnn_enabled CUDNN_ENABLED]
                            [--cudnn_benchmark CUDNN_BENCHMARK]
                            [--cudnn_deterministic CUDNN_DETERMINISTIC]
                            [--valid_batch_size VALID_BATCH_SIZE]
                            [--target_duration TARGET_DURATION]
                            [--num_eval NUM_EVAL] [--fold_length FOLD_LENGTH]
                            [--use_preprocessor USE_PREPROCESSOR]

speaker embedding extraction

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted": UnsortedBatchSampler has nothing in
                        particular feature and just creates mini-batches which
                        has constant batch_size. This sampler doesn't require
                        any length information for each feature. 'key_file' is
                        just a text file which describes each sample name.
                        utterance_id_a utterance_id_b utterance_id_c The fist
                        column is referred, so 'shape file' can be used, too.
                        utterance_id_a 100,80 utterance_id_b 400,80
                        utterance_id_c 512,80 "sorted": SortedBatchSampler
                        sorts samples by the length of the first input in
                        order to make each sample in a mini-batch has close
                        length. This sampler requires a text file which
                        describes the length for each sample utterance_id_a
                        1000 utterance_id_b 1453 utterance_id_c 1241 The first
                        element of feature dimensions is referred, so
                        'shape_file' can be also used. utterance_id_a 1000,80
                        utterance_id_b 1453,80 utterance_id_c 1241,80
                        "folded": FoldedBatchSampler supports variable
                        batch_size. The batch_size is decided by batch_size =
                        base_batch_size // (L // fold_length) L is referred to
                        the largest length of samples in the mini-batch. This
                        samples requires length information as same as
                        SortedBatchSampler "length": LengthBatchSampler
                        supports variable batch_size. This sampler makes mini-
                        batches which have same number of 'bins' as possible
                        counting by the total lengths of each feature in the
                        mini-batch. This sampler requires a text file which
                        describes the length for each sample. utterance_id_a
                        1000 utterance_id_b 1453 utterance_id_c 1241 The first
                        element of feature dimensions is referred, so
                        'shape_file' can be also used. utterance_id_a 1000,80
                        utterance_id_b 1453,80 utterance_id_c 1241,80 "numel":
                        NumElementsBatchSampler supports variable batch_size.
                        Just like LengthBatchSampler, this sampler makes mini-
                        batches which have same number of 'bins' as possible
                        counting by the total number of elements of each
                        feature instead of the length. Thus this sampler
                        requires the full information of the dimension of the
                        features. utterance_id_a 1000,80 utterance_id_b
                        1453,80 utterance_id_c 1241,80 (default: folded)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length'
                        or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used
                        (default: None)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used
                        (default: None)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB,
                        20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as
                        opened for ark files. This feature is only valid when
                        data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling
                        rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader.
                        e.g. 10MB, 20GB. If None, the 5 percent size of
                        --max_cache_size (default: None)
  --shape_file SHAPE_FILE
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default:
                        None)
  --num_cohort_spk NUM_COHORT_SPK
                        The number of cohort speakers in score norm (default:
                        5994)
  --num_utt_per_spk NUM_UTT_PER_SPK
                        The number of utterances per speaker in score norm
                        (default: 10)
  --utt_select_sec UTT_SELECT_SEC
                        Minimum duration for including the utt in cohort set
                        in score norm (default: 8)
  --average_spk AVERAGE_SPK
                        whether to average cohort embeds per speaker in score
                        norm (default: False)
  --adaptive_cohort_size ADAPTIVE_COHORT_SIZE
                        top-k cohort size in score norm (default: 400)
  --qmf_dur_thresh QMF_DUR_THRESH
                        threshold of duration to be considered as long in qmf
                        trainset (default: 6)
  --qmf_num_trial_per_condition QMF_NUM_TRIAL_PER_CONDITION
                        number of trials per condition in qmf trainset
                        (default: 5000)
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
  --average_embd AVERAGE_EMBD
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature
                        requires pytorch>=1.6 (default: False)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model
                        forwarding and training (default: False)
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample
                        lengths. To enable this, "shape_file" must have the
                        length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default:
                        descending)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)

The model configuration related:
  --spk_train_config SPK_TRAIN_CONFIG
                        SPK training configuration (default: None)
  --spk_model_file SPK_MODEL_FILE
                        SPK model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        *_train_config and *_file will be overwritten
                        (default: None)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT",
                        "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred.
                        (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default:
                        None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is
                        used if --multiprocessing_distributed=false (default:
                        None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This
                        value is used when dist_init_method == 'env://'
                        (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is
                        used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default:
                        None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N
                        processes per node, which has N GPUs. This is the
                        fastest way to use PyTorch for either single node or
                        multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in
                        torch.nn.parallel.DistributedDataParallel (default:
                        False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale
                        (default: False)

trainer initialization related:
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default:
                        False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_lora USE_LORA   Enable LoRA based finetuning, see
                        (https://arxiv.org/abs/2106.09685) for large pre-
                        trained foundation models, like Whisper (default:
                        False)
  --save_lora_only SAVE_LORA_ONLY
                        Only save LoRA parameters or save all model parameters
                        (default: True)
  --lora_conf LORA_CONF
                        Configuration for LoRA based finetuning (default: {})

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

The inference hyperparameter related:
  --valid_batch_size VALID_BATCH_SIZE
                        The batch size for inference (default: 1)
  --target_duration TARGET_DURATION
                        Duration (in seconds) of samples in a minibatch
                        (default: 3.0)
  --num_eval NUM_EVAL   Number of segments to make from one utterance in the
                        inference phase (default: 10)
  --fold_length FOLD_LENGTH
  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

spk_inference.py¶

/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: spk_inference.py [-h] [--config CONFIG]
                        [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                        [--dtype {float16,float32,float64}]
                        [--num_workers NUM_WORKERS]
                        --data_path_and_name_and_type
                        DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                        [--batch_size BATCH_SIZE]
                        [--train_config TRAIN_CONFIG]
                        [--model_file MODEL_FILE] [--model_tag MODEL_TAG]

Speaker Embedding Extraction

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)

The model configuration related:
  --train_config TRAIN_CONFIG
                        Speaker model training configuration (default: None)
  --model_file MODEL_FILE
                        Speaker model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        *_train_config and *_file will be overwritten
                        (default: None)

spk_train.py¶

usage: spk_train.py [-h] [--config CONFIG] [--print_config]
                    [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                    [--iterator_type {sequence,category,chunk,task,none}]
                    [--valid_iterator_type {sequence,category,chunk,task,none}]
                    [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                    [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                    [--dist_backend DIST_BACKEND]
                    [--dist_init_method DIST_INIT_METHOD]
                    [--dist_world_size DIST_WORLD_SIZE]
                    [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                    [--dist_master_addr DIST_MASTER_ADDR]
                    [--dist_master_port DIST_MASTER_PORT]
                    [--dist_launcher {slurm,mpi,None}]
                    [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                    [--unused_parameters UNUSED_PARAMETERS]
                    [--sharded_ddp SHARDED_DDP]
                    [--cudnn_enabled CUDNN_ENABLED]
                    [--cudnn_benchmark CUDNN_BENCHMARK]
                    [--cudnn_deterministic CUDNN_DETERMINISTIC]
                    [--collect_stats COLLECT_STATS]
                    [--write_collected_feats WRITE_COLLECTED_FEATS]
                    [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                    [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                    [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                    [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                    [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                    [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                    [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                    [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                    [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                    [--train_dtype {float16,float32,float64}]
                    [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                    [--use_matplotlib USE_MATPLOTLIB]
                    [--use_tensorboard USE_TENSORBOARD]
                    [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                    [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                    [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                    [--wandb_name WANDB_NAME]
                    [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                    [--detect_anomaly DETECT_ANOMALY]
                    [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                    [--save_strategy {all,adapter_only,required_grad_only}]
                    [--adapter_conf ADAPTER_CONF]
                    [--pretrain_path PRETRAIN_PATH]
                    [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                    [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                    [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                    [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                    [--batch_size BATCH_SIZE]
                    [--valid_batch_size VALID_BATCH_SIZE]
                    [--batch_bins BATCH_BINS]
                    [--valid_batch_bins VALID_BATCH_BINS]
                    [--train_shape_file TRAIN_SHAPE_FILE]
                    [--valid_shape_file VALID_SHAPE_FILE]
                    [--batch_type {unsorted,sorted,folded,length,numel}]
                    [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                    [--fold_length FOLD_LENGTH]
                    [--sort_in_batch {descending,ascending}]
                    [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                    [--sort_batch {descending,ascending}]
                    [--multiple_iterator MULTIPLE_ITERATOR]
                    [--chunk_length CHUNK_LENGTH]
                    [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                    [--num_cache_chunks NUM_CACHE_CHUNKS]
                    [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                    [--chunk_default_fs CHUNK_DEFAULT_FS]
                    [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                    [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                    [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                    [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                    [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                    [--max_cache_size MAX_CACHE_SIZE]
                    [--max_cache_fd MAX_CACHE_FD]
                    [--allow_multi_rates ALLOW_MULTI_RATES]
                    [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                    [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                    [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                    [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                    [--optim_conf OPTIM_CONF]
                    [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                    [--scheduler_conf SCHEDULER_CONF]
                    [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                    [--use_preprocessor USE_PREPROCESSOR]
                    [--input_size INPUT_SIZE]
                    [--target_duration TARGET_DURATION] [--spk2utt SPK2UTT]
                    [--spk_num SPK_NUM] [--sample_rate SAMPLE_RATE]
                    [--num_eval NUM_EVAL] [--rir_scp RIR_SCP]
                    [--model_conf MODEL_CONF]
                    [--frontend {asteroid_frontend,default,fused,melspec_torch,sliding_window,s3prl,None}]
                    [--frontend_conf FRONTEND_CONF] [--specaug {specaug,None}]
                    [--specaug_conf SPECAUG_CONF]
                    [--normalize {global_mvn,utterance_mvn,None}]
                    [--normalize_conf NORMALIZE_CONF]
                    [--encoder {ecapa_tdnn,identity,mfaconformer,rawnet3,ska_tdnn,xvector}]
                    [--encoder_conf ENCODER_CONF]
                    [--pooling {chn_attn_stat,mean,stats}]
                    [--pooling_conf POOLING_CONF]
                    [--projector {rawnet3,ska_tdnn,xvector}]
                    [--projector_conf PROJECTOR_CONF]
                    [--preprocessor {common,spk}]
                    [--preprocessor_conf PREPROCESSOR_CONF]
                    [--loss {aamsoftmax,aamsoftmax_sc_topk}]
                    [--loss_conf LOSS_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --target_duration TARGET_DURATION
                        Duration (in seconds) of samples in a minibatch (default: 3.0)
  --spk2utt SPK2UTT     Directory of spk2utt file to be used in label mapping (default: )
  --spk_num SPK_NUM     specify the number of speakers during training (default: None)
  --sample_rate SAMPLE_RATE
                        Sampling rate (default: 16000)
  --num_eval NUM_EVAL   Number of segments to make from one utterance in the inference phase (default: 10)
  --rir_scp RIR_SCP     Directory of the rir data to be augmented (default: )
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {})
  --frontend {asteroid_frontend,default,fused,melspec_torch,sliding_window,s3prl,None}
                        The frontend type (default: None)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: None)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --encoder {ecapa_tdnn,identity,mfaconformer,rawnet3,ska_tdnn,xvector}
                        The encoder type (default: rawnet3)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --pooling {chn_attn_stat,mean,stats}
                        The pooling type (default: chn_attn_stat)
  --pooling_conf POOLING_CONF
                        The keyword arguments for pooling (default: {})
  --projector {rawnet3,ska_tdnn,xvector}
                        The projector type (default: rawnet3)
  --projector_conf PROJECTOR_CONF
                        The keyword arguments for projector (default: {})
  --preprocessor {common,spk}
                        The preprocessor type (default: spk)
  --preprocessor_conf PREPROCESSOR_CONF
                        The keyword arguments for preprocessor (default: {})
  --loss {aamsoftmax,aamsoftmax_sc_topk}
                        The loss type (default: aamsoftmax)
  --loss_conf LOSS_CONF
                        The keyword arguments for loss (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

split_scps.py¶

usage: split_scps.py [-h]
                     [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                     --scps SCPS [SCPS ...] [--names NAMES [NAMES ...]]
                     [--num_splits NUM_SPLITS] --output_dir OUTPUT_DIR

Split scp files

optional arguments:
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --scps SCPS [SCPS ...]
                        Input texts (default: None)
  --names NAMES [NAMES ...]
                        Output names for each files (default: None)
  --num_splits NUM_SPLITS
                        Split number (default: None)
  --output_dir OUTPUT_DIR
                        Output directory (default: None)

st_inference.py¶

usage: st_inference.py [-h] [--config CONFIG]
                       [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                       --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                       [--dtype {float16,float32,float64}]
                       [--num_workers NUM_WORKERS]
                       --data_path_and_name_and_type
                       DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                       [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                       [--st_train_config ST_TRAIN_CONFIG]
                       [--st_model_file ST_MODEL_FILE]
                       [--lm_train_config LM_TRAIN_CONFIG]
                       [--src_lm_train_config SRC_LM_TRAIN_CONFIG]
                       [--lm_file LM_FILE] [--src_lm_file SRC_LM_FILE]
                       [--word_lm_train_config WORD_LM_TRAIN_CONFIG]
                       [--src_word_lm_train_config SRC_WORD_LM_TRAIN_CONFIG]
                       [--word_lm_file WORD_LM_FILE]
                       [--src_word_lm_file SRC_WORD_LM_FILE]
                       [--ngram_file NGRAM_FILE]
                       [--src_ngram_file SRC_NGRAM_FILE]
                       [--model_tag MODEL_TAG] [--enh_s2t_task ENH_S2T_TASK]
                       [--batch_size BATCH_SIZE] [--nbest NBEST]
                       [--asr_nbest ASR_NBEST] [--beam_size BEAM_SIZE]
                       [--asr_beam_size ASR_BEAM_SIZE] [--penalty PENALTY]
                       [--asr_penalty ASR_PENALTY] [--maxlenratio MAXLENRATIO]
                       [--asr_maxlenratio ASR_MAXLENRATIO]
                       [--minlenratio MINLENRATIO]
                       [--asr_minlenratio ASR_MINLENRATIO]
                       [--lm_weight LM_WEIGHT] [--asr_lm_weight ASR_LM_WEIGHT]
                       [--ngram_weight NGRAM_WEIGHT]
                       [--asr_ngram_weight ASR_NGRAM_WEIGHT]
                       [--ctc_weight CTC_WEIGHT]
                       [--asr_ctc_weight ASR_CTC_WEIGHT]
                       [--transducer_conf TRANSDUCER_CONF]
                       [--token_type {char,bpe,None}]
                       [--src_token_type {char,bpe,None}]
                       [--bpemodel BPEMODEL] [--src_bpemodel SRC_BPEMODEL]
                       [--ctc_greedy CTC_GREEDY]
                       [--hugging_face_decoder HUGGING_FACE_DECODER]
                       [--hugging_face_decoder_max_length HUGGING_FACE_DECODER_MAX_LENGTH]
                       [--normalize_length NORMALIZE_LENGTH]

ST Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --st_train_config ST_TRAIN_CONFIG
                        ST training configuration (default: None)
  --st_model_file ST_MODEL_FILE
                        ST model parameter file (default: None)
  --lm_train_config LM_TRAIN_CONFIG
                        LM training configuration (default: None)
  --src_lm_train_config SRC_LM_TRAIN_CONFIG
                        LM training configuration (default: None)
  --lm_file LM_FILE     LM parameter file (default: None)
  --src_lm_file SRC_LM_FILE
                        LM parameter file (default: None)
  --word_lm_train_config WORD_LM_TRAIN_CONFIG
                        Word LM training configuration (default: None)
  --src_word_lm_train_config SRC_WORD_LM_TRAIN_CONFIG
                        Word LM training configuration (default: None)
  --word_lm_file WORD_LM_FILE
                        Word LM parameter file (default: None)
  --src_word_lm_file SRC_WORD_LM_FILE
                        Word LM parameter file (default: None)
  --ngram_file NGRAM_FILE
                        N-gram parameter file (default: None)
  --src_ngram_file SRC_NGRAM_FILE
                        N-gram parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        *_train_config and *_file will be overwritten
                        (default: None)
  --enh_s2t_task ENH_S2T_TASK
                        enhancement and asr joint model (default: False)

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --asr_nbest ASR_NBEST
                        Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 20)
  --asr_beam_size ASR_BEAM_SIZE
                        Beam size (default: 20)
  --penalty PENALTY     Insertion penalty (default: 0.0)
  --asr_penalty ASR_PENALTY
                        Insertion penalty (default: 0.0)
  --maxlenratio MAXLENRATIO
                        Input length ratio to obtain max output length. If
                        maxlenratio=0.0 (default), it uses a end-detect
                        function to automatically find maximum hypothesis
                        lengths.If maxlenratio<0.0, its absolute value is
                        interpretedas a constant max output length (default:
                        0.0)
  --asr_maxlenratio ASR_MAXLENRATIO
                        Input length ratio to obtain max output length. If
                        maxlenratio=0.0 (default), it uses a end-detect
                        function to automatically find maximum hypothesis
                        lengths.If maxlenratio<0.0, its absolute value is
                        interpretedas a constant max output length (default:
                        0.0)
  --minlenratio MINLENRATIO
                        Input length ratio to obtain min output length
                        (default: 0.0)
  --asr_minlenratio ASR_MINLENRATIO
                        Input length ratio to obtain min output length
                        (default: 0.0)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --asr_lm_weight ASR_LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --ngram_weight NGRAM_WEIGHT
                        ngram weight (default: 0.9)
  --asr_ngram_weight ASR_NGRAM_WEIGHT
                        ngram weight (default: 0.9)
  --ctc_weight CTC_WEIGHT
                        ST CTC weight (default: 0.0)
  --asr_ctc_weight ASR_CTC_WEIGHT
                        ASR CTC weight (default: 0.3)
  --transducer_conf TRANSDUCER_CONF
                        The keyword arguments for transducer beam search.
                        (default: None)

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ST model. If not given, refers from
                        the training args (default: None)
  --src_token_type {char,bpe,None}
                        The token type for ST model. If not given, refers from
                        the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)
  --src_bpemodel SRC_BPEMODEL
                        The model path of sentencepiece. If not given, refers
                        from the training args (default: None)
  --ctc_greedy CTC_GREEDY
  --hugging_face_decoder HUGGING_FACE_DECODER
  --hugging_face_decoder_max_length HUGGING_FACE_DECODER_MAX_LENGTH
  --normalize_length NORMALIZE_LENGTH
                        If true, best hypothesis is selected by length-
                        normalized scores (default: False)
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

st_train.py¶

usage: st_train.py [-h] [--config CONFIG] [--print_config]
                   [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                   [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                   [--iterator_type {sequence,category,chunk,task,none}]
                   [--valid_iterator_type {sequence,category,chunk,task,none}]
                   [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                   [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                   [--dist_backend DIST_BACKEND]
                   [--dist_init_method DIST_INIT_METHOD]
                   [--dist_world_size DIST_WORLD_SIZE] [--dist_rank DIST_RANK]
                   [--local_rank LOCAL_RANK]
                   [--dist_master_addr DIST_MASTER_ADDR]
                   [--dist_master_port DIST_MASTER_PORT]
                   [--dist_launcher {slurm,mpi,None}]
                   [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                   [--unused_parameters UNUSED_PARAMETERS]
                   [--sharded_ddp SHARDED_DDP] [--cudnn_enabled CUDNN_ENABLED]
                   [--cudnn_benchmark CUDNN_BENCHMARK]
                   [--cudnn_deterministic CUDNN_DETERMINISTIC]
                   [--collect_stats COLLECT_STATS]
                   [--write_collected_feats WRITE_COLLECTED_FEATS]
                   [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                   [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                   [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                   [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                   [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                   [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                   [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                   [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                   [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                   [--train_dtype {float16,float32,float64}]
                   [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                   [--use_matplotlib USE_MATPLOTLIB]
                   [--use_tensorboard USE_TENSORBOARD]
                   [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                   [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                   [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                   [--wandb_name WANDB_NAME]
                   [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                   [--detect_anomaly DETECT_ANOMALY]
                   [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                   [--save_strategy {all,adapter_only,required_grad_only}]
                   [--adapter_conf ADAPTER_CONF]
                   [--pretrain_path PRETRAIN_PATH]
                   [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                   [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                   [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                   [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                   [--batch_size BATCH_SIZE]
                   [--valid_batch_size VALID_BATCH_SIZE]
                   [--batch_bins BATCH_BINS]
                   [--valid_batch_bins VALID_BATCH_BINS]
                   [--train_shape_file TRAIN_SHAPE_FILE]
                   [--valid_shape_file VALID_SHAPE_FILE]
                   [--batch_type {unsorted,sorted,folded,length,numel}]
                   [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                   [--fold_length FOLD_LENGTH]
                   [--sort_in_batch {descending,ascending}]
                   [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                   [--sort_batch {descending,ascending}]
                   [--multiple_iterator MULTIPLE_ITERATOR]
                   [--chunk_length CHUNK_LENGTH]
                   [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                   [--num_cache_chunks NUM_CACHE_CHUNKS]
                   [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                   [--chunk_default_fs CHUNK_DEFAULT_FS]
                   [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                   [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                   [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                   [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                   [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                   [--max_cache_size MAX_CACHE_SIZE]
                   [--max_cache_fd MAX_CACHE_FD]
                   [--allow_multi_rates ALLOW_MULTI_RATES]
                   [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                   [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                   [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                   [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                   [--optim_conf OPTIM_CONF]
                   [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                   [--scheduler_conf SCHEDULER_CONF] [--token_list TOKEN_LIST]
                   [--src_token_list SRC_TOKEN_LIST]
                   [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                   [--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
                   [--st_joint_net_conf ST_JOINT_NET_CONF]
                   [--model_conf MODEL_CONF]
                   [--use_preprocessor USE_PREPROCESSOR]
                   [--token_type {bpe,char,word,phn,hugging_face,whisper_en,whisper_multilingual}]
                   [--src_token_type {bpe,char,word,phn,none,whisper_en,whisper_multilingual}]
                   [--bpemodel BPEMODEL] [--src_bpemodel SRC_BPEMODEL]
                   [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                   [--cleaner {None,tacotron,jaconv,vietnamese,whisper_en,whisper_basic}]
                   [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                   [--src_g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                   [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                   [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                   [--noise_scp NOISE_SCP]
                   [--noise_apply_prob NOISE_APPLY_PROB]
                   [--noise_db_range NOISE_DB_RANGE]
                   [--short_noise_thres SHORT_NOISE_THRES]
                   [--ctc_sample_rate CTC_SAMPLE_RATE]
                   [--frontend {default,sliding_window,s3prl}]
                   [--frontend_conf FRONTEND_CONF] [--specaug {specaug,None}]
                   [--specaug_conf SPECAUG_CONF]
                   [--normalize {global_mvn,utterance_mvn,None}]
                   [--normalize_conf NORMALIZE_CONF]
                   [--preencoder {sinc,linear,None}]
                   [--preencoder_conf PREENCODER_CONF]
                   [--encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,branchformer,e_branchformer,whisper}]
                   [--encoder_conf ENCODER_CONF]
                   [--postencoder {hugging_face_transformers,length_adaptor,None}]
                   [--postencoder_conf POSTENCODER_CONF]
                   [--decoder {transformer,transformer_md,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,whisper,hugging_face_transformers}]
                   [--decoder_conf DECODER_CONF]
                   [--extra_asr_decoder {transformer,transformer_md,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,None}]
                   [--extra_asr_decoder_conf EXTRA_ASR_DECODER_CONF]
                   [--extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,None}]
                   [--extra_mt_decoder_conf EXTRA_MT_DECODER_CONF]
                   [--md_encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,branchformer,e_branchformer,None}]
                   [--md_encoder_conf MD_ENCODER_CONF]
                   [--hier_encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,branchformer,e_branchformer,None}]
                   [--hier_encoder_conf HIER_ENCODER_CONF]
                   [--extra_mt_encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,branchformer,e_branchformer,hugging_face_transformers,None}]
                   [--extra_mt_encoder_conf EXTRA_MT_ENCODER_CONF]
                   [--preprocessor {default}]
                   [--preprocessor_conf PREPROCESSOR_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (for target language) (default: None)
  --src_token_list SRC_TOKEN_LIST
                        A text mapping int-id to token (for source language) (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --ctc_conf CTC_CONF   The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': None, 'zero_infinity': True, 'brctc_risk_strategy': 'exp', 'brctc_group_strategy': 'end', 'brctc_risk_factor': 0.0})
  --st_joint_net_conf ST_JOINT_NET_CONF
                        The keyword arguments for joint network class. (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'asr_weight': 0.0, 'mt_weight': 0.0, 'mtlalpha': 0.0, 'st_mtlalpha': 0.0, 'ignore_id': -1, 'tgt_ignore_id': -1, 'lsm_weight': 0.0, 'length_normalized_loss': False, 'report_cer': True, 'report_wer': True, 'report_bleu': True, 'sym_space': '<space>', 'sym_blank': '<blank>', 'tgt_sym_space': '<space>', 'tgt_sym_blank': '<blank>', 'extract_feats_in_collect_stats': True, 'ctc_sample_rate': 0.0, 'tgt_sym_sos': '<sos/eos>', 'tgt_sym_eos': '<sos/eos>', 'lang_token_id': -1})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn,hugging_face,whisper_en,whisper_multilingual}
                        The target text will be tokenized in the specified level token (default: bpe)
  --src_token_type {bpe,char,word,phn,none,whisper_en,whisper_multilingual}
                        The source text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (for target language) (default: None)
  --src_bpemodel SRC_BPEMODEL
                        The model file of sentencepiece (for source language) (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese,whisper_en,whisper_basic}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --src_g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value. (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of noise decibel level. (default: 13_15)
  --short_noise_thres SHORT_NOISE_THRES
                        If len(noise) / len(speech) is smaller than this threshold during dynamic mixing, a warning will be displayed. (default: 0.5)
  --ctc_sample_rate CTC_SAMPLE_RATE
                        Sample greedy CTC output as AR decoder target. (default: 0.0)
  --frontend {default,sliding_window,s3prl}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --preencoder {sinc,linear,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,branchformer,e_branchformer,whisper}
                        The encoder type (default: rnn)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --postencoder {hugging_face_transformers,length_adaptor,None}
                        The postencoder type (default: None)
  --postencoder_conf POSTENCODER_CONF
                        The keyword arguments for postencoder (default: {})
  --decoder {transformer,transformer_md,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,whisper,hugging_face_transformers}
                        The decoder type (default: rnn)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
  --extra_asr_decoder {transformer,transformer_md,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,None}
                        The extra_asr_decoder type (default: None)
  --extra_asr_decoder_conf EXTRA_ASR_DECODER_CONF
                        The keyword arguments for extra_asr_decoder (default: {})
  --extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,None}
                        The extra_mt_decoder type (default: None)
  --extra_mt_decoder_conf EXTRA_MT_DECODER_CONF
                        The keyword arguments for extra_mt_decoder (default: {})
  --md_encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,branchformer,e_branchformer,None}
                        The md_encoder type (default: None)
  --md_encoder_conf MD_ENCODER_CONF
                        The keyword arguments for md_encoder (default: {})
  --hier_encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,branchformer,e_branchformer,None}
                        The hier_encoder type (default: None)
  --hier_encoder_conf HIER_ENCODER_CONF
                        The keyword arguments for hier_encoder (default: {})
  --extra_mt_encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,branchformer,e_branchformer,hugging_face_transformers,None}
                        The extra_mt_encoder type (default: None)
  --extra_mt_encoder_conf EXTRA_MT_ENCODER_CONF
                        The keyword arguments for extra_mt_encoder (default: {})
  --preprocessor {default}
                        The preprocessor type (default: default)
  --preprocessor_conf PREPROCESSOR_CONF
                        The keyword arguments for preprocessor (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

tts2_inference.py¶

/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: tts2_inference.py [-h] [--config CONFIG]
                         [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                         --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                         [--dtype {float16,float32,float64}]
                         [--num_workers NUM_WORKERS] [--batch_size BATCH_SIZE]
                         --data_path_and_name_and_type
                         DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                         [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                         [--train_config TRAIN_CONFIG]
                         [--model_file MODEL_FILE] [--model_tag MODEL_TAG]
                         [--maxlenratio MAXLENRATIO]
                         [--minlenratio MINLENRATIO] [--threshold THRESHOLD]
                         [--use_att_constraint USE_ATT_CONSTRAINT]
                         [--backward_window BACKWARD_WINDOW]
                         [--forward_window FORWARD_WINDOW]
                         [--use_teacher_forcing USE_TEACHER_FORCING]
                         [--speed_control_alpha SPEED_CONTROL_ALPHA]
                         [--noise_scale NOISE_SCALE]
                         [--noise_scale_dur NOISE_SCALE_DUR]
                         [--always_fix_seed ALWAYS_FIX_SEED]
                         [--vocoder_config VOCODER_CONFIG]
                         [--vocoder_file VOCODER_FILE]
                         [--vocoder_tag VOCODER_TAG]

TTS inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
                        The path of output directory (default: None)
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --speed_control_alpha SPEED_CONTROL_ALPHA
                        Alpha in FastSpeech to change the speed of generated
                        speech (default: 1.0)
  --noise_scale NOISE_SCALE
                        Noise scale parameter for the flow in vits (default:
                        0.667)
  --noise_scale_dur NOISE_SCALE_DUR
                        Noise scale parameter for the stochastic duration
                        predictor in vits (default: 0.8)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --train_config TRAIN_CONFIG
                        Training configuration file (default: None)
  --model_file MODEL_FILE
                        Model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        train_config and model_file will be overwritten
                        (default: None)

Decoding related:
  --maxlenratio MAXLENRATIO
                        Maximum length ratio in decoding (default: 10.0)
  --minlenratio MINLENRATIO
                        Minimum length ratio in decoding (default: 0.0)
  --threshold THRESHOLD
                        Threshold value in decoding (default: 0.5)
  --use_att_constraint USE_ATT_CONSTRAINT
                        Whether to use attention constraint (default: False)
  --backward_window BACKWARD_WINDOW
                        Backward window value in attention constraint
                        (default: 1)
  --forward_window FORWARD_WINDOW
                        Forward window value in attention constraint (default:
                        3)
  --use_teacher_forcing USE_TEACHER_FORCING
                        Whether to use teacher forcing (default: False)
  --always_fix_seed ALWAYS_FIX_SEED
                        Whether to always fix seed (default: False)

Vocoder related:
  --vocoder_config VOCODER_CONFIG
                        Vocoder configuration file (default: None)
  --vocoder_file VOCODER_FILE
                        Vocoder parameter file (default: None)
  --vocoder_tag VOCODER_TAG
                        Pretrained vocoder tag. If specify this option,
                        vocoder_config and vocoder_file will be overwritten
                        (default: None)

tts2_train.py¶

usage: tts2_train.py [-h] [--config CONFIG] [--print_config]
                     [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                     [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                     [--iterator_type {sequence,category,chunk,task,none}]
                     [--valid_iterator_type {sequence,category,chunk,task,none}]
                     [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                     [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                     [--dist_backend DIST_BACKEND]
                     [--dist_init_method DIST_INIT_METHOD]
                     [--dist_world_size DIST_WORLD_SIZE]
                     [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                     [--dist_master_addr DIST_MASTER_ADDR]
                     [--dist_master_port DIST_MASTER_PORT]
                     [--dist_launcher {slurm,mpi,None}]
                     [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                     [--unused_parameters UNUSED_PARAMETERS]
                     [--sharded_ddp SHARDED_DDP]
                     [--cudnn_enabled CUDNN_ENABLED]
                     [--cudnn_benchmark CUDNN_BENCHMARK]
                     [--cudnn_deterministic CUDNN_DETERMINISTIC]
                     [--collect_stats COLLECT_STATS]
                     [--write_collected_feats WRITE_COLLECTED_FEATS]
                     [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                     [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                     [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                     [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                     [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                     [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                     [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                     [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                     [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                     [--train_dtype {float16,float32,float64}]
                     [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                     [--use_matplotlib USE_MATPLOTLIB]
                     [--use_tensorboard USE_TENSORBOARD]
                     [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                     [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                     [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                     [--wandb_name WANDB_NAME]
                     [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                     [--detect_anomaly DETECT_ANOMALY]
                     [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                     [--save_strategy {all,adapter_only,required_grad_only}]
                     [--adapter_conf ADAPTER_CONF]
                     [--pretrain_path PRETRAIN_PATH]
                     [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                     [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                     [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                     [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                     [--batch_size BATCH_SIZE]
                     [--valid_batch_size VALID_BATCH_SIZE]
                     [--batch_bins BATCH_BINS]
                     [--valid_batch_bins VALID_BATCH_BINS]
                     [--train_shape_file TRAIN_SHAPE_FILE]
                     [--valid_shape_file VALID_SHAPE_FILE]
                     [--batch_type {unsorted,sorted,folded,length,numel}]
                     [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                     [--fold_length FOLD_LENGTH]
                     [--sort_in_batch {descending,ascending}]
                     [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                     [--sort_batch {descending,ascending}]
                     [--multiple_iterator MULTIPLE_ITERATOR]
                     [--chunk_length CHUNK_LENGTH]
                     [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                     [--num_cache_chunks NUM_CACHE_CHUNKS]
                     [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                     [--chunk_default_fs CHUNK_DEFAULT_FS]
                     [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                     [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                     [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                     [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                     [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                     [--max_cache_size MAX_CACHE_SIZE]
                     [--max_cache_fd MAX_CACHE_FD]
                     [--allow_multi_rates ALLOW_MULTI_RATES]
                     [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                     [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                     [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                     [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                     [--optim_conf OPTIM_CONF]
                     [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                     [--scheduler_conf SCHEDULER_CONF]
                     [--src_token_list SRC_TOKEN_LIST]
                     [--tgt_token_list TGT_TOKEN_LIST]
                     [--model_conf MODEL_CONF]
                     [--use_preprocessor USE_PREPROCESSOR]
                     [--src_token_type {bpe,char,word,phn}]
                     [--bpemodel BPEMODEL]
                     [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                     [--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
                     [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                     [--discrete_feats_extract {identity}]
                     [--discrete_feats_extract_conf DISCRETE_FEATS_EXTRACT_CONF]
                     [--tts {fastspeech2}] [--tts_conf TTS_CONF]
                     [--pitch_extract {dio,None}]
                     [--pitch_extract_conf PITCH_EXTRACT_CONF]
                     [--pitch_normalize {global_mvn,None}]
                     [--pitch_normalize_conf PITCH_NORMALIZE_CONF]
                     [--energy_extract {energy,None}]
                     [--energy_extract_conf ENERGY_EXTRACT_CONF]
                     [--energy_normalize {global_mvn,None}]
                     [--energy_normalize_conf ENERGY_NORMALIZE_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --src_token_list SRC_TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --tgt_token_list TGT_TOKEN_LIST
                        A text mapping int-id to target speech token (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --src_token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: phn)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --discrete_feats_extract {identity}
                        The discrete_feats_extract type (default: identity)
  --discrete_feats_extract_conf DISCRETE_FEATS_EXTRACT_CONF
                        The keyword arguments for discrete_feats_extract (default: {})
  --tts {fastspeech2}   The tts type (default: fastspeech2)
  --tts_conf TTS_CONF   The keyword arguments for tts (default: {})
  --pitch_extract {dio,None}
                        The pitch_extract type (default: None)
  --pitch_extract_conf PITCH_EXTRACT_CONF
                        The keyword arguments for pitch_extract (default: {})
  --pitch_normalize {global_mvn,None}
                        The pitch_normalize type (default: None)
  --pitch_normalize_conf PITCH_NORMALIZE_CONF
                        The keyword arguments for pitch_normalize (default: {})
  --energy_extract {energy,None}
                        The energy_extract type (default: None)
  --energy_extract_conf ENERGY_EXTRACT_CONF
                        The keyword arguments for energy_extract (default: {})
  --energy_normalize {global_mvn,None}
                        The energy_normalize type (default: None)
  --energy_normalize_conf ENERGY_NORMALIZE_CONF
                        The keyword arguments for energy_normalize (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):

tts_inference.py¶

/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):
usage: tts_inference.py [-h] [--config CONFIG]
                        [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                        [--dtype {float16,float32,float64}]
                        [--num_workers NUM_WORKERS] [--batch_size BATCH_SIZE]
                        --data_path_and_name_and_type
                        DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--train_config TRAIN_CONFIG]
                        [--model_file MODEL_FILE] [--model_tag MODEL_TAG]
                        [--maxlenratio MAXLENRATIO]
                        [--minlenratio MINLENRATIO] [--threshold THRESHOLD]
                        [--use_att_constraint USE_ATT_CONSTRAINT]
                        [--backward_window BACKWARD_WINDOW]
                        [--forward_window FORWARD_WINDOW]
                        [--use_teacher_forcing USE_TEACHER_FORCING]
                        [--speed_control_alpha SPEED_CONTROL_ALPHA]
                        [--noise_scale NOISE_SCALE]
                        [--noise_scale_dur NOISE_SCALE_DUR]
                        [--always_fix_seed ALWAYS_FIX_SEED]
                        [--vocoder_config VOCODER_CONFIG]
                        [--vocoder_file VOCODER_FILE]
                        [--vocoder_tag VOCODER_TAG]

TTS inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
                        The path of output directory (default: None)
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --speed_control_alpha SPEED_CONTROL_ALPHA
                        Alpha in FastSpeech to change the speed of generated
                        speech (default: 1.0)
  --noise_scale NOISE_SCALE
                        Noise scale parameter for the flow in vits (default:
                        0.667)
  --noise_scale_dur NOISE_SCALE_DUR
                        Noise scale parameter for the stochastic duration
                        predictor in vits (default: 0.8)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --train_config TRAIN_CONFIG
                        Training configuration file (default: None)
  --model_file MODEL_FILE
                        Model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        train_config and model_file will be overwritten
                        (default: None)

Decoding related:
  --maxlenratio MAXLENRATIO
                        Maximum length ratio in decoding (default: 10.0)
  --minlenratio MINLENRATIO
                        Minimum length ratio in decoding (default: 0.0)
  --threshold THRESHOLD
                        Threshold value in decoding (default: 0.5)
  --use_att_constraint USE_ATT_CONSTRAINT
                        Whether to use attention constraint (default: False)
  --backward_window BACKWARD_WINDOW
                        Backward window value in attention constraint
                        (default: 1)
  --forward_window FORWARD_WINDOW
                        Forward window value in attention constraint (default:
                        3)
  --use_teacher_forcing USE_TEACHER_FORCING
                        Whether to use teacher forcing (default: False)
  --always_fix_seed ALWAYS_FIX_SEED
                        Whether to always fix seed (default: False)

Vocoder related:
  --vocoder_config VOCODER_CONFIG
                        Vocoder configuration file (default: None)
  --vocoder_file VOCODER_FILE
                        Vocoder parameter file (default: None)
  --vocoder_tag VOCODER_TAG
                        Pretrained vocoder tag. If specify this option,
                        vocoder_config and vocoder_file will be overwritten
                        (default: None)

tts_train.py¶

usage: tts_train.py [-h] [--config CONFIG] [--print_config]
                    [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--drop_last_iter DROP_LAST_ITER] [--dry_run DRY_RUN]
                    [--iterator_type {sequence,category,chunk,task,none}]
                    [--valid_iterator_type {sequence,category,chunk,task,none}]
                    [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                    [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                    [--dist_backend DIST_BACKEND]
                    [--dist_init_method DIST_INIT_METHOD]
                    [--dist_world_size DIST_WORLD_SIZE]
                    [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                    [--dist_master_addr DIST_MASTER_ADDR]
                    [--dist_master_port DIST_MASTER_PORT]
                    [--dist_launcher {slurm,mpi,None}]
                    [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                    [--unused_parameters UNUSED_PARAMETERS]
                    [--sharded_ddp SHARDED_DDP]
                    [--cudnn_enabled CUDNN_ENABLED]
                    [--cudnn_benchmark CUDNN_BENCHMARK]
                    [--cudnn_deterministic CUDNN_DETERMINISTIC]
                    [--collect_stats COLLECT_STATS]
                    [--write_collected_feats WRITE_COLLECTED_FEATS]
                    [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                    [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                    [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                    [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                    [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                    [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                    [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                    [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                    [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                    [--train_dtype {float16,float32,float64}]
                    [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                    [--use_matplotlib USE_MATPLOTLIB]
                    [--use_tensorboard USE_TENSORBOARD]
                    [--create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD]
                    [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                    [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                    [--wandb_name WANDB_NAME]
                    [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                    [--detect_anomaly DETECT_ANOMALY]
                    [--use_adapter USE_ADAPTER] [--adapter {lora,houlsby}]
                    [--save_strategy {all,adapter_only,required_grad_only}]
                    [--adapter_conf ADAPTER_CONF]
                    [--pretrain_path PRETRAIN_PATH]
                    [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                    [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                    [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                    [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                    [--batch_size BATCH_SIZE]
                    [--valid_batch_size VALID_BATCH_SIZE]
                    [--batch_bins BATCH_BINS]
                    [--valid_batch_bins VALID_BATCH_BINS]
                    [--train_shape_file TRAIN_SHAPE_FILE]
                    [--valid_shape_file VALID_SHAPE_FILE]
                    [--batch_type {unsorted,sorted,folded,length,numel}]
                    [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                    [--fold_length FOLD_LENGTH]
                    [--sort_in_batch {descending,ascending}]
                    [--shuffle_within_batch SHUFFLE_WITHIN_BATCH]
                    [--sort_batch {descending,ascending}]
                    [--multiple_iterator MULTIPLE_ITERATOR]
                    [--chunk_length CHUNK_LENGTH]
                    [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                    [--num_cache_chunks NUM_CACHE_CHUNKS]
                    [--chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]]
                    [--chunk_default_fs CHUNK_DEFAULT_FS]
                    [--chunk_max_abs_length CHUNK_MAX_ABS_LENGTH]
                    [--chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES]
                    [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                    [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                    [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                    [--max_cache_size MAX_CACHE_SIZE]
                    [--max_cache_fd MAX_CACHE_FD]
                    [--allow_multi_rates ALLOW_MULTI_RATES]
                    [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                    [--exclude_weight_decay EXCLUDE_WEIGHT_DECAY]
                    [--exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF]
                    [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}]
                    [--optim_conf OPTIM_CONF]
                    [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}]
                    [--scheduler_conf SCHEDULER_CONF]
                    [--token_list TOKEN_LIST] [--odim ODIM]
                    [--model_conf MODEL_CONF]
                    [--use_preprocessor USE_PREPROCESSOR]
                    [--token_type {bpe,char,word,phn}] [--bpemodel BPEMODEL]
                    [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                    [--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
                    [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}]
                    [--feats_extract {fbank,spectrogram,linear_spectrogram}]
                    [--feats_extract_conf FEATS_EXTRACT_CONF]
                    [--normalize {global_mvn,None}]
                    [--normalize_conf NORMALIZE_CONF]
                    [--tts {tacotron2,transformer,fastspeech,fastspeech2,prodiff,vits,joint_text2wav,jets}]
                    [--tts_conf TTS_CONF] [--pitch_extract {dio,None}]
                    [--pitch_extract_conf PITCH_EXTRACT_CONF]
                    [--pitch_normalize {global_mvn,None}]
                    [--pitch_normalize_conf PITCH_NORMALIZE_CONF]
                    [--energy_extract {energy,None}]
                    [--energy_extract_conf ENERGY_EXTRACT_CONF]
                    [--energy_normalize {global_mvn,None}]
                    [--energy_normalize_conf ENERGY_NORMALIZE_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,pypinyin_g2p_phone_without_prosody,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,espeak_ng_italian,espeak_ng_ukrainian,espeak_ng_polish,g2pk,g2pk_no_space,g2pk_explicit_space,korean_jaso,korean_jaso_no_space,g2p_is}
                        Specify g2p method if --token_type=phn (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --drop_last_iter DROP_LAST_ITER
                        Exclude the minibatch with leftovers. (default: False)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: sequence)
  --valid_iterator_type {sequence,category,chunk,task,none}
                        Specify iterator type (default: None)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --create_graph_in_tensorboard CREATE_GRAPH_IN_TENSORBOARD
                        Whether to create graph in tensorboard (default: False)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)
  --use_adapter USE_ADAPTER
                        Enable efficient finetuning, see (https://arxiv.org/abs/2106.09685) for large pre-trained foundation models, like Whisper and SSL models (default: False)
  --adapter {lora,houlsby}
                        Adapter Name (default: lora)
  --save_strategy {all,adapter_only,required_grad_only}
                        The strategy to save parameters. Default: 'all'
                        'all': save all parameters
                        'adapter_only': save only adapter parameters, without other parameters like downstream model
                        'required_grad_only': save only parameters with requires_grad=True
                         (default: all)
  --adapter_conf ADAPTER_CONF
                        Configuration for efficient finetuning (default: {})

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded', or 'catbel'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --shuffle_within_batch SHUFFLE_WITHIN_BATCH
                        Shuffles wholes batches in sample-wise. Required forClassification tasks normally. (default: False)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)
  --chunk_excluded_key_prefixes CHUNK_EXCLUDED_KEY_PREFIXES [CHUNK_EXCLUDED_KEY_PREFIXES ...]
                        List of key prefixes. Keys that satisfy either condition below will be excluded from the length consistency check in ChunkIterFactory:
                          - exactly match one of the prefixes in `chunk_excluded_key_prefixes`
                          - have one of the prefixes in `chunk_excluded_key_prefixes` and end with numbers (default: [])
  --chunk_default_fs CHUNK_DEFAULT_FS
                        Default sampling rate used for the chunk length. Will be used to adaptively adjust the chunk length for data of different sampling rates. (If None, the chunk length will be fixed.) (default: None)
  --chunk_max_abs_length CHUNK_MAX_ABS_LENGTH
                        Maximum number of samples per chunk for all sampling rates (default: None)
  --chunk_discard_short_samples CHUNK_DISCARD_SHORT_SAMPLES
                        Discard samples shorter than the minimum chunk length (default: True)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "multi_columns_sound":
                        Enable multi columns wav.scp. The following text file can be loaded as multi channels audio data

                           utterance_id_a a.wav a2.wav
                           utterance_id_b b.wav b2.wav
                           ...

                        "variable_columns_sound":
                        Loading variable numbers (columns) of audios in wav.scp. The following text file can be loaded as stacked audio data

                           utterance_id_a a1.wav a2.wav a3.wav
                           utterance_id_b b1.wav
                           utterance_id_c c1.wav c2.wav
                           ...

                        Note that audios of different lengths will be right-padded with np.nan to the longest audio in the sample.
                        A preprocessor must be used to remove these paddings.

                        "score":
                        Return text as is. The text contains tempo and note info.
                        For each note, 'start' 'end' 'syllabel' 'midi' and 'phones' are included.

                           utterance_id_A tempo_a start_1 end_1 syllable_1 midi_1 phones_1 ...
                           utterance_id_B tempo_b start_1 end_1 syllable_1 midi_1 phones_1 ...
                           ...

                        "duration":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           utterance_id_B start_1 end_1 phone_1 start_2 end_2 phone_2 ...
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "random_text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           hello world
                           foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --allow_multi_rates ALLOW_MULTI_RATES
                        Whether to allow audios to have different sampling rates (default: False)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --exclude_weight_decay EXCLUDE_WEIGHT_DECAY
                        Exclude weight decay in optimizer for model bias, normalization, or other special parameters (default: False)
  --exclude_weight_decay_conf EXCLUDE_WEIGHT_DECAY_CONF
                        The keyword arguments for configuring weight decay in optimizer. e.g., 'bias_weight_decay': False will set zero weight decay for bias params. See also espnet2.optimizers.optim_groups.configure_optimizer. (default: {})
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam,accagd,adabound,adamod,diffgrad,lamb,novograd,pid,qhm,sgdw,yogi}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,piecewiselinearwarmuplr,warmupsteplr,warmupreducelronplateau,cycliclr,onecyclelr,cosineannealingwarmrestarts,cosineannealingwarmuprestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --odim ODIM           The number of dimension of output feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: phn)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --feats_extract {fbank,spectrogram,linear_spectrogram}
                        The feats_extract type (default: fbank)
  --feats_extract_conf FEATS_EXTRACT_CONF
                        The keyword arguments for feats_extract (default: {})
  --normalize {global_mvn,None}
                        The normalize type (default: global_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --tts {tacotron2,transformer,fastspeech,fastspeech2,prodiff,vits,joint_text2wav,jets}
                        The tts type (default: tacotron2)
  --tts_conf TTS_CONF   The keyword arguments for tts (default: {})
  --pitch_extract {dio,None}
                        The pitch_extract type (default: None)
  --pitch_extract_conf PITCH_EXTRACT_CONF
                        The keyword arguments for pitch_extract (default: {})
  --pitch_normalize {global_mvn,None}
                        The pitch_normalize type (default: None)
  --pitch_normalize_conf PITCH_NORMALIZE_CONF
                        The keyword arguments for pitch_normalize (default: {})
  --energy_extract {energy,None}
                        The energy_extract type (default: None)
  --energy_extract_conf ENERGY_EXTRACT_CONF
                        The keyword arguments for energy_extract (default: {})
  --energy_normalize {global_mvn,None}
                        The energy_normalize type (default: None)
  --energy_normalize_conf ENERGY_NORMALIZE_CONF
                        The keyword arguments for energy_normalize (default: {})
/home/runner/work/espnet/espnet/tools/venv/lib/python3.8/site-packages/whisper/timing.py:57: NumbaDeprecationWarning: [1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.[0m
  def backtrace(trace: np.ndarray):