espnet.nets package

Initialize sub package.

espnet.nets.mt_interface

MT Interface module.

class espnet.nets.mt_interface.MTInterface[source]

Bases: object

MT Interface for ESPnet model implementation.

static add_arguments(parser)[source]

Add arguments to parser.

property attention_plot_class

Get attention plot class.

classmethod build(idim: int, odim: int, **kwargs)[source]

Initialize this class with python-level args.

Parameters
  • idim (int) – The number of an input feature dim.

  • odim (int) – The number of output vocab.

Returns

A new instance of ASRInterface.

Return type

ASRinterface

calculate_all_attentions(xs, ilens, ys)[source]

Caluculate attention.

Parameters
  • xs_pad (list) – list of padded input sequences [(T1, idim), (T2, idim), …]

  • ilens (ndarray) – batch of lengths of input sequences (B)

  • ys (list) – list of character id sequence tensor [(L1), (L2), (L3), …]

Returns

attention weights (B, Lmax, Tmax)

Return type

float ndarray

forward(xs, ilens, ys)[source]

Compute loss for training.

Parameters
  • xs – For pytorch, batch of padded source sequences torch.Tensor (B, Tmax, idim) For chainer, list of source sequences chainer.Variable

  • ilens – batch of lengths of source sequences (B) For pytorch, torch.Tensor For chainer, list of int

  • ys – For pytorch, batch of padded source sequences torch.Tensor (B, Lmax) For chainer, list of source sequences chainer.Variable

Returns

loss value

Return type

torch.Tensor for pytorch, chainer.Variable for chainer

translate(x, trans_args, char_list=None, rnnlm=None)[source]

Translate x for evaluation.

Parameters
  • x (ndarray) – input acouctic feature (B, T, D) or (T, D)

  • trans_args (namespace) – argment namespace contraining options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

translate_batch(x, trans_args, char_list=None, rnnlm=None)[source]

Beam search implementation for batch.

Parameters
  • x (torch.Tensor) – encoder hidden state sequences (B, Tmax, Henc)

  • trans_args (namespace) – argument namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

espnet.nets.asr_interface

ASR Interface module.

class espnet.nets.asr_interface.ASRInterface[source]

Bases: object

ASR Interface for ESPnet model implementation.

static add_arguments(parser)[source]

Add arguments to parser.

property attention_plot_class

Get attention plot class.

classmethod build(idim: int, odim: int, **kwargs)[source]

Initialize this class with python-level args.

Parameters
  • idim (int) – The number of an input feature dim.

  • odim (int) – The number of output vocab.

Returns

A new instance of ASRInterface.

Return type

ASRinterface

calculate_all_attentions(xs, ilens, ys)[source]

Caluculate attention.

Parameters
  • xs_pad (list) – list of padded input sequences [(T1, idim), (T2, idim), …]

  • ilens (ndarray) – batch of lengths of input sequences (B)

  • ys (list) – list of character id sequence tensor [(L1), (L2), (L3), …]

Returns

attention weights (B, Lmax, Tmax)

Return type

float ndarray

calculate_all_ctc_probs(xs, ilens, ys)[source]

Caluculate CTC probability.

Parameters
  • xs_pad (list) – list of padded input sequences [(T1, idim), (T2, idim), …]

  • ilens (ndarray) – batch of lengths of input sequences (B)

  • ys (list) – list of character id sequence tensor [(L1), (L2), (L3), …]

Returns

CTC probabilities (B, Tmax, vocab)

Return type

float ndarray

property ctc_plot_class

Get CTC plot class.

encode(feat)[source]

Encode feature in beam_search (optional).

Parameters

x (numpy.ndarray) – input feature (T, D)

Returns

encoded feature (T, D)

Return type

torch.Tensor for pytorch, chainer.Variable for chainer

forward(xs, ilens, ys)[source]

Compute loss for training.

Parameters
  • xs – For pytorch, batch of padded source sequences torch.Tensor (B, Tmax, idim) For chainer, list of source sequences chainer.Variable

  • ilens – batch of lengths of source sequences (B) For pytorch, torch.Tensor For chainer, list of int

  • ys – For pytorch, batch of padded source sequences torch.Tensor (B, Lmax) For chainer, list of source sequences chainer.Variable

Returns

loss value

Return type

torch.Tensor for pytorch, chainer.Variable for chainer

recognize(x, recog_args, char_list=None, rnnlm=None)[source]

Recognize x for evaluation.

Parameters
  • x (ndarray) – input acouctic feature (B, T, D) or (T, D)

  • recog_args (namespace) – argment namespace contraining options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

recognize_batch(x, recog_args, char_list=None, rnnlm=None)[source]

Beam search implementation for batch.

Parameters
  • x (torch.Tensor) – encoder hidden state sequences (B, Tmax, Henc)

  • recog_args (namespace) – argument namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

scorers()[source]

Get scorers for beam_search (optional).

Returns

dict of ScorerInterface objects

Return type

dict[str, ScorerInterface]

espnet.nets.asr_interface.dynamic_import_asr(module, backend)[source]

Import ASR models dynamically.

Parameters
  • module (str) – module_name:class_name or alias in predefined_asr

  • backend (str) – NN backend. e.g., pytorch, chainer

Returns

ASR class

Return type

type

espnet.nets.e2e_mt_common

Common functions for ST and MT.

class espnet.nets.e2e_mt_common.ErrorCalculator(char_list, sym_space, sym_pad, report_bleu=False)[source]

Bases: object

Calculate BLEU for ST and MT models during training.

Parameters
  • y_hats – numpy array with predicted text

  • y_pads – numpy array with true (target) text

  • char_list – vocabulary list

  • sym_space – space symbol

  • sym_pad – pad symbol

  • report_bleu – report BLUE score if True

Construct an ErrorCalculator object.

calculate_corpus_bleu(ys_hat, ys_pad)[source]

Calculate corpus-level BLEU score in a mini-batch.

Parameters
  • seqs_hat (torch.Tensor) – prediction (batch, seqlen)

  • seqs_true (torch.Tensor) – reference (batch, seqlen)

Returns

corpus-level BLEU score

:rtype float

espnet.nets.lm_interface

Language model interface.

class espnet.nets.lm_interface.LMInterface[source]

Bases: espnet.nets.scorer_interface.ScorerInterface

LM Interface for ESPnet model implementation.

static add_arguments(parser)[source]

Add arguments to command line argument parser.

classmethod build(n_vocab: int, **kwargs)[source]

Initialize this class with python-level args.

Parameters

idim (int) – The number of vocabulary.

Returns

A new instance of LMInterface.

Return type

LMinterface

forward(x, t)[source]

Compute LM loss value from buffer sequences.

Parameters
  • x (torch.Tensor) – Input ids. (batch, len)

  • t (torch.Tensor) – Target ids. (batch, len)

Returns

Tuple of

loss to backward (scalar), negative log-likelihood of t: -log p(t) (scalar) and the number of elements in x (scalar)

Return type

tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Notes

The last two return values are used in perplexity: p(t)^{-n} = exp(-log p(t) / n)

espnet.nets.lm_interface.dynamic_import_lm(module, backend)[source]

Import LM class dynamically.

Parameters
  • module (str) – module_name:class_name or alias in predefined_lms

  • backend (str) – NN backend. e.g., pytorch, chainer

Returns

LM class

Return type

type

espnet.nets.tts_interface

TTS Interface realted modules.

class espnet.nets.tts_interface.Reporter(**links)[source]

Bases: chainer.link.Chain

Reporter module.

report(dicts)[source]

Report values from a given dict.

class espnet.nets.tts_interface.TTSInterface[source]

Bases: object

TTS Interface for ESPnet model implementation.

Initilize TTS module.

static add_arguments(parser)[source]

Add model specific argments to parser.

property attention_plot_class

Plot attention weights.

property base_plot_keys

Return base key names to plot during training.

The keys should match what chainer.reporter reports. if you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.

Returns

Base keys to plot during training.

Return type

list[str]

calculate_all_attentions(*args, **kwargs)[source]

Calculate TTS attention weights.

Parameters

Tensor – Batch of attention weights (B, Lmax, Tmax).

forward(*args, **kwargs)[source]

Calculate TTS forward propagation.

Returns

Loss value.

Return type

Tensor

inference(*args, **kwargs)[source]

Generate the sequence of features given the sequences of characters.

Returns

The sequence of generated features (L, odim). Tensor: The sequence of stop probabilities (L,). Tensor: The sequence of attention weights (L, T).

Return type

Tensor

load_pretrained_model(model_path)[source]

Load pretrained model parameters.

espnet.nets.ctc_prefix_score

class espnet.nets.ctc_prefix_score.CTCPrefixScore(x, blank, eos, xp)[source]

Bases: object

Compute CTC label sequence scores

which is based on Algorithm 2 in WATANABE et al. “HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,” but extended to efficiently compute the probablities of multiple labels simultaneously

initial_state()[source]

Obtain an initial CTC state

Returns

CTC state

class espnet.nets.ctc_prefix_score.CTCPrefixScoreTH(x, xlens, blank, eos, margin=0)[source]

Bases: object

Batch processing of CTCPrefixScore

which is based on Algorithm 2 in WATANABE et al. “HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,” but extended to efficiently compute the label probablities for multiple hypotheses simultaneously See also Seki et al. “Vectorized Beam Search for CTC-Attention-Based Speech Recognition,” In INTERSPEECH (pp. 3825-3829), 2019.

Construct CTC prefix scorer

Parameters
  • x (torch.Tensor) – input label posterior sequences (B, T, O)

  • xlens (torch.Tensor) – input lengths (B,)

  • blank (int) – blank label id

  • eos (int) – end-of-sequence id

  • margin (int) – margin parameter for windowing (0 means no windowing)

index_select_state(state, best_ids)[source]

Select CTC states according to best ids

:param state : CTC state :param best_ids : index numbers selected by beam pruning (B, W) :return selected_state

espnet.nets.e2e_asr_common

Common functions for ASR.

class espnet.nets.e2e_asr_common.ErrorCalculator(char_list, sym_space, sym_blank, report_cer=False, report_wer=False)[source]

Bases: object

Calculate CER and WER for E2E_ASR and CTC models during training.

Parameters
  • y_hats – numpy array with predicted text

  • y_pads – numpy array with true (target) text

  • char_list

  • sym_space

  • sym_blank

Returns

Construct an ErrorCalculator object.

calculate_cer(seqs_hat, seqs_true)[source]

Calculate sentence-level CER score.

Parameters
  • seqs_hat (list) – prediction

  • seqs_true (list) – reference

Returns

average sentence-level CER score

:rtype float

calculate_cer_ctc(ys_hat, ys_pad)[source]

Calculate sentence-level CER score for CTC.

Parameters
  • ys_hat (torch.Tensor) – prediction (batch, seqlen)

  • ys_pad (torch.Tensor) – reference (batch, seqlen)

Returns

average sentence-level CER score

:rtype float

calculate_wer(seqs_hat, seqs_true)[source]

Calculate sentence-level WER score.

Parameters
  • seqs_hat (list) – prediction

  • seqs_true (list) – reference

Returns

average sentence-level WER score

:rtype float

convert_to_char(ys_hat, ys_pad)[source]

Convert index to character.

Parameters
  • seqs_hat (torch.Tensor) – prediction (batch, seqlen)

  • seqs_true (torch.Tensor) – reference (batch, seqlen)

Returns

token list of prediction

:rtype list :return: token list of reference :rtype list

class espnet.nets.e2e_asr_common.ErrorCalculatorTransducer(decoder, token_list, sym_space, sym_blank, report_cer=False, report_wer=False)[source]

Bases: object

Calculate CER and WER for transducer models.

Parameters
  • decoder (AbsDecoder) – decoder module

  • token_list (list) – list of tokens

  • sym_space (str) – space symbol

  • sym_blank (str) – blank symbol

  • report_cer (boolean) – compute CER option

  • report_wer (boolean) – compute WER option

Construct an ErrorCalculator object for transducer model.

calculate_cer(seqs_hat, seqs_true)[source]

Calculate sentence-level CER score for transducer model.

Parameters
  • seqs_hat (torch.Tensor) – prediction (batch, seqlen)

  • seqs_true (torch.Tensor) – reference (batch, seqlen)

Returns

average sentence-level CER score

Return type

(float)

calculate_wer(seqs_hat, seqs_true)[source]

Calculate sentence-level WER score for transducer model.

Parameters
  • seqs_hat (torch.Tensor) – prediction (batch, seqlen)

  • seqs_true (torch.Tensor) – reference (batch, seqlen)

Returns

average sentence-level WER score

Return type

(float)

convert_to_char(ys_hat, ys_pad)[source]

Convert index to character.

Parameters
  • ys_hat (torch.Tensor) – prediction (batch, seqlen)

  • ys_pad (torch.Tensor) – reference (batch, seqlen)

Returns

token list of prediction (list): token list of reference

Return type

(list)

espnet.nets.e2e_asr_common.end_detect(ended_hyps, i, M=3, D_end=-10.0)[source]

End detection.

desribed in Eq. (50) of S. Watanabe et al “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition”

Parameters
  • ended_hyps

  • i

  • M

  • D_end

Returns

espnet.nets.e2e_asr_common.get_vgg2l_odim(idim, in_channel=3, out_channel=128)[source]

Return the output size of the VGG frontend.

Parameters
  • in_channel – input channel size

  • out_channel – output channel size

Returns

output size

:rtype int

espnet.nets.e2e_asr_common.label_smoothing_dist(odim, lsm_type, transcript=None, blank=0)[source]

Obtain label distribution for loss smoothing.

Parameters
  • odim

  • lsm_type

  • blank

  • transcript

Returns

espnet.nets.beam_search_transducer

Search algorithms for transducer models.

class espnet.nets.beam_search_transducer.BeamSearchTransducer(decoder: Union[espnet2.asr.decoder.abs_decoder.AbsDecoder, torch.nn.modules.module.Module], beam_size: int, lm: torch.nn.modules.module.Module = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 2, u_max: int = 50, nstep: int = 1, prefix_alpha: int = 1, score_norm: bool = True)[source]

Bases: object

Beam search implementation for transducer.

Initialize transducer beam search.

Parameters
  • decoder – Decoder class to use

  • beam_size – Number of hypotheses kept during search

  • lm – LM class to use

  • lm_weight – lm weight for soft fusion

  • search_type – type of algorithm to use for search

  • max_sym_exp – number of maximum symbol expansions at each time step (“tsd”)

  • u_max – maximum output sequence length (“alsd”)

  • nstep – number of maximum expansion steps at each time step (“nsc”)

  • prefix_alpha – maximum prefix length in prefix search (“nsc”)

  • score_norm – normalize final scores by length (“default”)

align_length_sync_decoding(h: torch.Tensor) → List[espnet.nets.beam_search_transducer.Hypothesis][source]

Alignment-length synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters

h – Encoded speech features (T_max, D_enc)

Returns

N-best decoding results

Return type

nbest_hyps

Beam search implementation.

Parameters

x – Encoded speech features (T_max, D_enc)

Returns

N-best decoding results

Return type

nbest_hyps

Greedy search implementation for transformer-transducer.

Parameters

h – Encoded speech features (T_max, D_enc)

Returns

1-best decoding results

Return type

hyp

N-step constrained beam search implementation.

Based and modified from https://arxiv.org/pdf/2002.03577.pdf. Please reference ESPnet (b-flo, PR #2444) for any usage outside ESPnet until further modifications.

Note: the algorithm is not in his “complete” form but works almost as intended.

Parameters

h – Encoded speech features (T_max, D_enc)

Returns

N-best decoding results

Return type

nbest_hyps

sort_nbest(hyps: List[espnet.nets.beam_search_transducer.Hypothesis]) → List[espnet.nets.beam_search_transducer.Hypothesis][source]

Sort hypotheses by score or score given sequence length.

Parameters

hyps – list of hypotheses

Returns

sorted list of hypotheses

Return type

hyps

time_sync_decoding(h: torch.Tensor) → List[espnet.nets.beam_search_transducer.Hypothesis][source]

Time synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters

h – Encoded speech features (T_max, D_enc)

Returns

N-best decoding results

Return type

nbest_hyps

class espnet.nets.beam_search_transducer.Hypothesis(score: float, yseq: List[int], dec_state: Union[List[List[torch.Tensor]], List[torch.Tensor]], y: List[None._VariableFunctionsClass.tensor] = None, lm_state: Union[Dict[str, Any], List[Any]] = None, lm_scores: torch.Tensor = None)[source]

Bases: object

Hypothesis class for beam search algorithms.

lm_scores = None
lm_state = None
y = None

espnet.nets.scorer_interface

Scorer interface module.

class espnet.nets.scorer_interface.BatchPartialScorerInterface[source]

Bases: espnet.nets.scorer_interface.BatchScorerInterface, espnet.nets.scorer_interface.PartialScorerInterface

Batch partial scorer interface for beam search.

batch_score_partial(ys: torch.Tensor, next_tokens: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, Any][source]

Score new token (required).

Parameters
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • next_tokens (torch.Tensor) – torch.int64 tokens to score (n_batch, n_token).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns

Tuple of a score tensor for ys that has a shape (n_batch, n_vocab) and next states for ys

Return type

tuple[torch.Tensor, Any]

class espnet.nets.scorer_interface.BatchScorerInterface[source]

Bases: espnet.nets.scorer_interface.ScorerInterface

Batch scorer interface.

batch_init_state(x: torch.Tensor) → Any[source]

Get an initial state for decoding (optional).

Parameters

x (torch.Tensor) – The encoded feature tensor

Returns: initial state

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch (required).

Parameters
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type

tuple[torch.Tensor, List[Any]]

class espnet.nets.scorer_interface.PartialScorerInterface[source]

Bases: espnet.nets.scorer_interface.ScorerInterface

Partial scorer interface for beam search.

The partial scorer performs scoring when non-partial scorer finished scoring, and recieves pre-pruned next tokens to score because it is too heavy to score all the tokens.

Examples

score_partial(y: torch.Tensor, next_tokens: torch.Tensor, state: Any, x: torch.Tensor) → Tuple[torch.Tensor, Any][source]

Score new token (required).

Parameters
  • y (torch.Tensor) – 1D prefix token

  • next_tokens (torch.Tensor) – torch.int64 next token to score

  • state – decoder state for prefix tokens

  • x (torch.Tensor) – The encoder feature that generates ys

Returns

Tuple of a score tensor for y that has a shape (len(next_tokens),) and next state for ys

Return type

tuple[torch.Tensor, Any]

class espnet.nets.scorer_interface.ScorerInterface[source]

Bases: object

Scorer interface for beam search.

The scorer performs scoring of the all tokens in vocabulary.

Examples

final_score(state: Any) → float[source]

Score eos (optional).

Parameters

state – Scorer state for prefix tokens

Returns

final score

Return type

float

init_state(x: torch.Tensor) → Any[source]

Get an initial state for decoding (optional).

Parameters

x (torch.Tensor) – The encoded feature tensor

Returns: initial state

score(y: torch.Tensor, state: Any, x: torch.Tensor) → Tuple[torch.Tensor, Any][source]

Score new token (required).

Parameters
  • y (torch.Tensor) – 1D torch.int64 prefix tokens.

  • state – Scorer state for prefix tokens

  • x (torch.Tensor) – The encoder feature that generates ys.

Returns

Tuple of

scores for next token that has a shape of (n_vocab) and next state for ys

Return type

tuple[torch.Tensor, Any]

select_state(state: Any, i: int, new_id: int = None) → Any[source]

Select state with relative ids in the main beam search.

Parameters
  • state – Decoder state for prefix tokens

  • i (int) – Index to select a state in the main beam search

  • new_id (int) – New label index to select a state if necessary

Returns

pruned state

Return type

state

espnet.nets.__init__

Initialize sub package.

espnet.nets.st_interface

ST Interface module.

class espnet.nets.st_interface.STInterface[source]

Bases: espnet.nets.asr_interface.ASRInterface

ST Interface for ESPnet model implementation.

NOTE: This class is inherited from ASRInterface to enable joint translation and recognition when performing multi-task learning with the ASR task.

translate(x, trans_args, char_list=None, rnnlm=None, ensemble_models=[])[source]

Recognize x for evaluation.

Parameters
  • x (ndarray) – input acouctic feature (B, T, D) or (T, D)

  • trans_args (namespace) – argment namespace contraining options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

translate_batch(x, trans_args, char_list=None, rnnlm=None)[source]

Beam search implementation for batch.

Parameters
  • x (torch.Tensor) – encoder hidden state sequences (B, Tmax, Henc)

  • trans_args (namespace) – argument namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

espnet.nets.st_interface.dynamic_import_st(module, backend)[source]

Import ST models dynamically.

Parameters
  • module (str) – module_name:class_name or alias in predefined_st

  • backend (str) – NN backend. e.g., pytorch, chainer

Returns

ST class

Return type

type

espnet.nets.transducer_decoder_interface

Transducer decoder interface module.

class espnet.nets.transducer_decoder_interface.TransducerDecoderInterface[source]

Bases: object

Decoder interface for transducer models.

batch_score(hyps: List[espnet.nets.beam_search_transducer.Hypothesis], batch_states: Union[Tuple[Any], List[torch.Tensor]], cache: Dict[str, Any]) → Union[Tuple[Any], List[torch.Tensor], torch.Tensor][source]

Forward batch one step.

Parameters
  • hyps – batch of hypothesis

  • batch_states – batch of decoder states (and optionnally attention states)

  • cache – pairs of (y, state) for each token sequence (key)

  • init_tensor – initial tensor for attention decoder

Returns

decoder outputs batch_states: batch of decoder states (and optionnally attention states) lm_tokens: batch of token ids for LM

Return type

batch_y

create_batch_states(batch_states: Union[Tuple[Any], List[torch.Tensor]], l_states: Union[List[Tuple[Any]], List[List[torch.Tensor]]], l_tokens: List[List[int]] = None) → Union[Tuple[Any], List[torch.Tensor]][source]

Create batch of decoder states.

Parameters
  • batch_states – batch of decoder (and optionnally attention states)

  • l_states – list of decoder states (and optionnally attention states)

  • l_tokens – list of token sequences for batch

Returns

batch of decoder and attention states

Return type

batch_states

init_state(init_tensor: torch.Tensor = None) → Union[Tuple[Any], torch.Tensor][source]

Initialize decoder (and optionnally attention states).

Parameters

init_tensor – input features

Returns

initial state

Return type

state

score(hyp: espnet.nets.beam_search_transducer.Hypothesis, cache: Dict[str, Any], init_tensor: torch.Tensor = None) → Union[Tuple[Any], List[torch.Tensor], torch.Tensor][source]

Forward one step.

Parameters
  • hyp – hypothese

  • cache – pairs of (y, state) for each token sequence (key)

  • init_tensor – initial tensor for attention decoder

Returns

decoder outputs new_state: decoder and attention states lm_tokens: token id for LM

Return type

y

select_state(batch_states: Union[Tuple[Any], List[torch.Tensor]], idx: int) → Union[Tuple[Any], List[torch.Tensor]][source]

Get decoder state from batch for given id.

Parameters
  • batch_states – batch of decoder and optionnally attention states

  • idx – index to extract state from batch of states

Returns

decoder states (and optionnally attention states) for given id

Return type

state_idx

espnet.nets.chainer_backend.asr_interface

ASR Interface module.

class espnet.nets.chainer_backend.asr_interface.ChainerASRInterface(**links)[source]

Bases: espnet.nets.asr_interface.ASRInterface, chainer.link.Chain

ASR Interface for ESPnet model implementation.

static custom_converter(*args, **kw)[source]

Get customconverter of the model (Chainer only).

static custom_parallel_updater(*args, **kw)[source]

Get custom_parallel_updater of the model (Chainer only).

static custom_updater(*args, **kw)[source]

Get custom_updater of the model (Chainer only).

espnet.nets.chainer_backend.e2e_asr_transformer

Transformer-based model for End-to-end ASR.

class espnet.nets.chainer_backend.e2e_asr_transformer.E2E(idim, odim, args, ignore_id=-1, flag_return=True)[source]

Bases: espnet.nets.chainer_backend.asr_interface.ChainerASRInterface

E2E module.

Parameters
  • idim (int) – Input dimmensions.

  • odim (int) – Output dimmensions.

  • args (Namespace) – Training config.

  • ignore_id (int, optional) – Id for ignoring a character.

  • flag_return (bool, optional) – If true, return a list with (loss,

  • loss_att, acc) in forward. Otherwise, return loss. (loss_ctc,) –

Initialize the transformer.

static add_arguments(parser)[source]

Customize flags for transformer setup.

Parameters

parser (Namespace) – Training config.

property attention_plot_class

Attention plot function.

Redirects to PlotAttentionReport

Returns

PlotAttentionReport

calculate_all_attentions(xs, ilens, ys)[source]

E2E attention calculation.

Parameters
  • xs_pad (List[tuple()]) – List of padded input sequences. [(T1, idim), (T2, idim), …]

  • ilens (ndarray) – Batch of lengths of input sequences. (B)

  • ys (List) – List of character id sequence tensor. [(L1), (L2), (L3), …]

Returns

Attention weights. (B, Lmax, Tmax)

Return type

float ndarray

calculate_attentions(xs, x_mask, ys_pad)[source]

Calculate Attentions.

static custom_converter(subsampling_factor=0)[source]

Get customconverter of the model.

static custom_parallel_updater(iters, optimizer, converter, devices, accum_grad=1)[source]

Get custom_parallel_updater of the model.

static custom_updater(iters, optimizer, converter, device=-1, accum_grad=1)[source]

Get custom_updater of the model.

forward(xs, ilens, ys_pad, calculate_attentions=False)[source]

E2E forward propagation.

Parameters
  • xs (chainer.Variable) – Batch of padded charactor ids. (B, Tmax)

  • ilens (chainer.Variable) – Batch of length of each input batch. (B,)

  • ys (chainer.Variable) – Batch of padded target features. (B, Lmax, odim)

  • calculate_attentions (bool) – If true, return value is the output of encoder.

Returns

Training loss. float (optional): Training loss for ctc. float (optional): Training loss for attention. float (optional): Accuracy. chainer.Variable (Optional): Output of the encoder.

Return type

float

recognize(x_block, recog_args, char_list=None, rnnlm=None)[source]

E2E recognition function.

Parameters
  • x (ndarray) – Input acouctic feature (B, T, D) or (T, D).

  • recog_args (Namespace) – Argment namespace contraining options.

  • char_list (List[str]) – List of characters.

  • rnnlm (chainer.Chain) – Language model module defined at

  • espnet.lm.chainer_backend.lm.

Returns

N-best decoding results.

Return type

List

recognize_beam(h, lpz, recog_args, char_list=None, rnnlm=None)[source]

E2E beam search.

Parameters
  • h (ndarray) – Encoder ouput features (B, T, D) or (T, D).

  • lpz (ndarray) – Log probabilities from CTC.

  • recog_args (Namespace) – Argment namespace contraining options.

  • char_list (List[str]) – List of characters.

  • rnnlm (chainer.Chain) – Language model module defined at

  • espnet.lm.chainer_backend.lm.

Returns

N-best decoding results.

Return type

List

reset_parameters(args)[source]

Initialize the Weight according to the give initialize-type.

Parameters

args (Namespace) – Transformer config.

espnet.nets.chainer_backend.deterministic_embed_id

class espnet.nets.chainer_backend.deterministic_embed_id.EmbedID(in_size, out_size, initialW=None, ignore_label=None)[source]

Bases: chainer.link.Link

Efficient linear layer for one-hot input.

This is a link that wraps the embed_id() function. This link holds the ID (word) embedding matrix W as a parameter.

Parameters
  • in_size (int) – Number of different identifiers (a.k.a. vocabulary size).

  • out_size (int) – Output dimension.

  • initialW (Initializer) – Initializer to initialize the weight.

  • ignore_label (int) – If ignore_label is an int value, i-th column of return value is filled with 0.

embed_id()

W

Embedding parameter matrix.

Type

Variable

Examples

>>> W = np.array([[0, 0, 0],
...               [1, 1, 1],
...               [2, 2, 2]]).astype('f')
>>> W
array([[ 0.,  0.,  0.],
       [ 1.,  1.,  1.],
       [ 2.,  2.,  2.]], dtype=float32)
>>> l = L.EmbedID(W.shape[0], W.shape[1], initialW=W)
>>> x = np.array([2, 1]).astype('i')
>>> x
array([2, 1], dtype=int32)
>>> y = l(x)
>>> y.data
array([[ 2.,  2.,  2.],
       [ 1.,  1.,  1.]], dtype=float32)
ignore_label = None
class espnet.nets.chainer_backend.deterministic_embed_id.EmbedIDFunction(ignore_label=None)[source]

Bases: chainer.function_node.FunctionNode

backward(indexes, grad_outputs)[source]

Computes gradients w.r.t. specified inputs given output gradients.

This method is used to compute one step of the backpropagation corresponding to the forward computation of this function node. Given the gradients w.r.t. output variables, this method computes the gradients w.r.t. specified input variables. Note that this method does not need to compute any input gradients not specified by target_input_indices.

Unlike Function.backward(), gradients are given as Variable objects and this method itself has to return input gradients as Variable objects. It enables the function node to return the input gradients with the full computational history, in which case it supports differentiable backpropagation or higher-order differentiation.

The default implementation returns None s, which means the function is not differentiable.

Parameters
  • target_input_indexes (tuple of int) – Sorted indices of the input variables w.r.t. which the gradients are required. It is guaranteed that this tuple contains at least one element.

  • grad_outputs (tuple of Variables) – Gradients w.r.t. the output variables. If the gradient w.r.t. an output variable is not given, the corresponding element is None.

Returns

Tuple of variables that represent the gradients w.r.t. specified input variables. The length of the tuple can be same as either len(target_input_indexes) or the number of inputs. In the latter case, the elements not specified by target_input_indexes will be discarded.

See also

backward_accumulate() provides an alternative interface that allows you to implement the backward computation fused with the gradient accumulation.

check_type_forward(in_types)[source]

Checks types of input data before forward propagation.

This method is called before forward() and validates the types of input variables using the type checking utilities.

Parameters

in_types (TypeInfoTuple) – The type information of input variables for forward().

forward(inputs)[source]

Computes the output arrays from the input arrays.

It delegates the procedure to forward_cpu() or forward_gpu() by default. Which of them this method selects is determined by the type of input arrays. Implementations of FunctionNode must implement either CPU/GPU methods or this method.

Parameters

inputs – Tuple of input array(s).

Returns

Tuple of output array(s).

Warning

Implementations of FunctionNode must take care that the return value must be a tuple even if it returns only one array.

class espnet.nets.chainer_backend.deterministic_embed_id.EmbedIDGrad(w_shape, ignore_label=None)[source]

Bases: chainer.function_node.FunctionNode

backward(indexes, grads)[source]

Computes gradients w.r.t. specified inputs given output gradients.

This method is used to compute one step of the backpropagation corresponding to the forward computation of this function node. Given the gradients w.r.t. output variables, this method computes the gradients w.r.t. specified input variables. Note that this method does not need to compute any input gradients not specified by target_input_indices.

Unlike Function.backward(), gradients are given as Variable objects and this method itself has to return input gradients as Variable objects. It enables the function node to return the input gradients with the full computational history, in which case it supports differentiable backpropagation or higher-order differentiation.

The default implementation returns None s, which means the function is not differentiable.

Parameters
  • target_input_indexes (tuple of int) – Sorted indices of the input variables w.r.t. which the gradients are required. It is guaranteed that this tuple contains at least one element.

  • grad_outputs (tuple of Variables) – Gradients w.r.t. the output variables. If the gradient w.r.t. an output variable is not given, the corresponding element is None.

Returns

Tuple of variables that represent the gradients w.r.t. specified input variables. The length of the tuple can be same as either len(target_input_indexes) or the number of inputs. In the latter case, the elements not specified by target_input_indexes will be discarded.

See also

backward_accumulate() provides an alternative interface that allows you to implement the backward computation fused with the gradient accumulation.

forward(inputs)[source]

Computes the output arrays from the input arrays.

It delegates the procedure to forward_cpu() or forward_gpu() by default. Which of them this method selects is determined by the type of input arrays. Implementations of FunctionNode must implement either CPU/GPU methods or this method.

Parameters

inputs – Tuple of input array(s).

Returns

Tuple of output array(s).

Warning

Implementations of FunctionNode must take care that the return value must be a tuple even if it returns only one array.

espnet.nets.chainer_backend.deterministic_embed_id.embed_id(x, W, ignore_label=None)[source]

Efficient linear function for one-hot input.

This function implements so called word embeddings. It takes two arguments: a set of IDs (words) x in \(B\) dimensional integer vector, and a set of all ID (word) embeddings W in \(V \\times d\) float32 matrix. It outputs \(B \\times d\) matrix whose i-th column is the x[i]-th column of W. This function is only differentiable on the input W.

Parameters
  • x (chainer.Variable | np.ndarray) – Batch vectors of IDs. Each element must be signed integer.

  • W (chainer.Variable | np.ndarray) – Distributed representation of each ID (a.k.a. word embeddings).

  • ignore_label (int) – If ignore_label is an int value, i-th column of return value is filled with 0.

Returns

Embedded variable.

Return type

chainer.Variable

EmbedID

Examples

>>> x = np.array([2, 1]).astype('i')
>>> x
array([2, 1], dtype=int32)
>>> W = np.array([[0, 0, 0],
...               [1, 1, 1],
...               [2, 2, 2]]).astype('f')
>>> W
array([[ 0.,  0.,  0.],
       [ 1.,  1.,  1.],
       [ 2.,  2.,  2.]], dtype=float32)
>>> F.embed_id(x, W).data
array([[ 2.,  2.,  2.],
       [ 1.,  1.,  1.]], dtype=float32)
>>> F.embed_id(x, W, ignore_label=1).data
array([[ 2.,  2.,  2.],
       [ 0.,  0.,  0.]], dtype=float32)

espnet.nets.chainer_backend.e2e_asr

RNN sequence-to-sequence speech recognition model (chainer).

class espnet.nets.chainer_backend.e2e_asr.E2E(idim, odim, args, flag_return=True)[source]

Bases: espnet.nets.chainer_backend.asr_interface.ChainerASRInterface

E2E module for chainer backend.

Parameters
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • args (parser.args) – Training config.

  • flag_return (bool) – If True, train() would return additional metrics in addition to the training loss.

Construct an E2E object.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

static add_arguments(parser)[source]

Add arguments.

calculate_all_attentions(xs, ilens, ys)[source]

E2E attention calculation.

Parameters
  • xs (List) – List of padded input sequences. [(T1, idim), (T2, idim), …]

  • ilens (np.ndarray) – Batch of lengths of input sequences. (B)

  • ys (List) – List of character id sequence tensor. [(L1), (L2), (L3), …]

Returns

Attention weights. (B, Lmax, Tmax)

Return type

float np.ndarray

static custom_converter(subsampling_factor=0)[source]

Get customconverter of the model.

static custom_parallel_updater(iters, optimizer, converter, devices, accum_grad=1)[source]

Get custom_parallel_updater of the model.

static custom_updater(iters, optimizer, converter, device=-1, accum_grad=1)[source]

Get custom_updater of the model.

forward(xs, ilens, ys)[source]

E2E forward propagation.

Parameters
  • xs (chainer.Variable) – Batch of padded charactor ids. (B, Tmax)

  • ilens (chainer.Variable) – Batch of length of each input batch. (B,)

  • ys (chainer.Variable) – Batch of padded target features. (B, Lmax, odim)

Returns

Loss that calculated by attention and ctc loss. float (optional): Ctc loss. float (optional): Attention loss. float (optional): Accuracy.

Return type

float

recognize(x, recog_args, char_list, rnnlm=None)[source]

E2E greedy/beam search.

Parameters
  • x (chainer.Variable) – Input tensor for recognition.

  • recog_args (parser.args) – Arguments of config file.

  • char_list (List[str]) – List of Charactors.

  • rnnlm (Module) – RNNLM module defined at espnet.lm.chainer_backend.lm.

Returns

Result of recognition.

Return type

List[Dict[str, Any]]

espnet.nets.chainer_backend.nets_utils

espnet.nets.chainer_backend.ctc

class espnet.nets.chainer_backend.ctc.CTC(odim, eprojs, dropout_rate)[source]

Bases: chainer.link.Chain

Chainer implementation of ctc layer.

Parameters
  • odim (int) – The output dimension.

  • eprojs (int | None) – Dimension of input vectors from encoder.

  • dropout_rate (float) – Dropout rate.

log_softmax(hs)[source]

Log_softmax of frame activations.

Parameters

hs (list of chainer.Variable | N-dimension array) – Input variable from encoder.

Returns

A n-dimension float array.

Return type

chainer.Variable

class espnet.nets.chainer_backend.ctc.WarpCTC(odim, eprojs, dropout_rate)[source]

Bases: chainer.link.Chain

Chainer implementation of warp-ctc layer.

Parameters
  • odim (int) – The output dimension.

  • eproj (int | None) – Dimension of input vector from encoder.

  • dropout_rate (float) – Dropout rate.

argmax(hs_pad)[source]

argmax of frame activations

Parameters

variable hs_pad (chainer) – 3d tensor (B, Tmax, eprojs)

Returns

argmax applied 2d tensor (B, Tmax)

Return type

chainer.Variable

log_softmax(hs)[source]

Log_softmax of frame activations.

Parameters

hs (list of chainer.Variable | N-dimension array) – Input variable from encoder.

Returns

A n-dimension float array.

Return type

chainer.Variable

espnet.nets.chainer_backend.ctc.ctc_for(args, odim)[source]

Return the CTC layer corresponding to the args.

Parameters
  • args (Namespace) – The program arguments.

  • odim (int) – The output dimension.

Returns

The CTC module.

espnet.nets.chainer_backend.__init__

Initialize sub package.

espnet.nets.chainer_backend.transformer.training

Class Declaration of Transformer’s Training Subprocess.

class espnet.nets.chainer_backend.transformer.training.CustomConverter[source]

Bases: object

Custom Converter.

Parameters

subsampling_factor (int) – The subsampling factor.

Initialize subsampling.

class espnet.nets.chainer_backend.transformer.training.CustomParallelUpdater(train_iters, optimizer, converter, devices, accum_grad=1)[source]

Bases: chainer.training.updaters.multiprocess_parallel_updater.MultiprocessParallelUpdater

Custom Parallel Updater for chainer.

Defines the main update routine.

Parameters
  • train_iter (iterator | dict[str, iterator]) – Dataset iterator for the training dataset. It can also be a dictionary that maps strings to iterators. If this is just an iterator, then the iterator is registered by the name 'main'.

  • optimizer (optimizer | dict[str, optimizer]) – Optimizer to update parameters. It can also be a dictionary that maps strings to optimizers. If this is just an optimizer, then the optimizer is registered by the name 'main'.

  • converter (espnet.asr.chainer_backend.asr.CustomConverter) – Converter function to build input arrays. Each batch extracted by the main iterator and the device option are passed to this function. chainer.dataset.concat_examples() is used by default.

  • device (torch.device) – Device to which the training data is sent. Negative value indicates the host memory (CPU).

  • accum_grad (int) – The number of gradient accumulation. if set to 2, the network parameters will be updated once in twice, i.e. actual batchsize will be doubled.

Initialize custom parallel updater.

update()[source]

Update step for Custom Parallel Updater.

update_core()[source]

Process main update routine for Custom Parallel Updater.

class espnet.nets.chainer_backend.transformer.training.CustomUpdater(train_iter, optimizer, converter, device, accum_grad=1)[source]

Bases: chainer.training.updaters.standard_updater.StandardUpdater

Custom updater for chainer.

Parameters
  • train_iter (iterator | dict[str, iterator]) – Dataset iterator for the training dataset. It can also be a dictionary that maps strings to iterators. If this is just an iterator, then the iterator is registered by the name 'main'.

  • optimizer (optimizer | dict[str, optimizer]) – Optimizer to update parameters. It can also be a dictionary that maps strings to optimizers. If this is just an optimizer, then the optimizer is registered by the name 'main'.

  • converter (espnet.asr.chainer_backend.asr.CustomConverter) – Converter function to build input arrays. Each batch extracted by the main iterator and the device option are passed to this function. chainer.dataset.concat_examples() is used by default.

  • device (int or dict) – The destination device info to send variables. In the case of cpu or single gpu, device=-1 or 0, respectively. In the case of multi-gpu, device={“main”:0, “sub_1”: 1, …}.

  • accum_grad (int) – The number of gradient accumulation. if set to 2, the network parameters will be updated once in twice, i.e. actual batchsize will be doubled.

Initialize Custom Updater.

update()[source]

Update step for Custom Updater.

update_core()[source]

Process main update routine for Custom Updater.

class espnet.nets.chainer_backend.transformer.training.VaswaniRule(attr, d, warmup_steps=4000, init=None, target=None, optimizer=None, scale=1.0)[source]

Bases: chainer.training.extension.Extension

Trainer extension to shift an optimizer attribute magically by Vaswani.

Parameters
  • attr (str) – Name of the attribute to shift.

  • rate (float) – Rate of the exponential shift. This value is multiplied to the attribute at each call.

  • init (float) – Initial value of the attribute. If it is None, the extension extracts the attribute at the first call and uses it as the initial value.

  • target (float) – Target value of the attribute. If the attribute reaches this value, the shift stops.

  • optimizer (Optimizer) – Target optimizer to adjust the attribute. If it is None, the main optimizer of the updater is used.

Initialize Vaswani rule extension.

initialize(trainer)[source]

Initialize Optimizer values.

serialize(serializer)[source]

Serialize extension.

espnet.nets.chainer_backend.transformer.training.sum_sqnorm(arr)[source]

Calculate the norm of the array.

Parameters

arr (numpy.ndarray) –

Returns

Sum of the norm calculated from the given array.

Return type

Float

espnet.nets.chainer_backend.transformer.attention

Class Declaration of Transformer’s Attention.

class espnet.nets.chainer_backend.transformer.attention.MultiHeadAttention(n_units, h=8, dropout=0.1, initialW=None, initial_bias=None)[source]

Bases: chainer.link.Chain

Multi Head Attention Layer.

Parameters
  • n_units (int) – Number of input units.

  • h (int) – Number of attention heads.

  • dropout (float) – Dropout rate.

  • initialW – Initializer to initialize the weight.

  • initial_bias – Initializer to initialize the bias.

  • h – the number of heads

  • n_units – the number of features

  • dropout_rate (float) – dropout rate

Initialize MultiHeadAttention.

forward(e_var, s_var=None, mask=None, batch=1)[source]

Core function of the Multi-head attention layer.

Parameters
  • e_var (chainer.Variable) – Variable of input array.

  • s_var (chainer.Variable) – Variable of source array from encoder.

  • mask (chainer.Variable) – Attention mask.

  • batch (int) – Batch size.

Returns

Outout of multi-head attention layer.

Return type

chainer.Variable

espnet.nets.chainer_backend.transformer.embedding

Class Declaration of Transformer’s Positional Encoding.

class espnet.nets.chainer_backend.transformer.embedding.PositionalEncoding(n_units, dropout=0.1, length=5000)[source]

Bases: chainer.link.Chain

Positional encoding module.

Parameters
  • n_units (int) – embedding dim

  • dropout (float) – dropout rate

  • length (int) – maximum input length

Initialize Positional Encoding.

forward(e)[source]

Forward Positional Encoding.

espnet.nets.chainer_backend.transformer.layer_norm

Class Declaration of Transformer’s Label Smootion loss.

class espnet.nets.chainer_backend.transformer.layer_norm.LayerNorm(dims, eps=1e-12)[source]

Bases: chainer.links.normalization.layer_normalization.LayerNormalization

Redirect to L.LayerNormalization.

Initialize LayerNorm.

espnet.nets.chainer_backend.transformer.label_smoothing_loss

Class Declaration of Transformer’s Label Smootion loss.

class espnet.nets.chainer_backend.transformer.label_smoothing_loss.LabelSmoothingLoss(smoothing, n_target_vocab, normalize_length=False, ignore_id=-1)[source]

Bases: chainer.link.Chain

Label Smoothing Loss.

Parameters
  • smoothing (float) – smoothing rate (0.0 means the conventional CE).

  • n_target_vocab (int) – number of classes.

  • normalize_length (bool) – normalize loss by sequence length if True.

Initialize Loss.

forward(ys_block, ys_pad)[source]

Forward Loss.

Parameters
  • ys_block (chainer.Variable) – Predicted labels.

  • ys_pad (chainer.Variable) – Target (true) labels.

Returns

Training loss.

Return type

float

espnet.nets.chainer_backend.transformer.decoder

Class Declaration of Transformer’s Decoder.

class espnet.nets.chainer_backend.transformer.decoder.Decoder(odim, args, initialW=None, initial_bias=None)[source]

Bases: chainer.link.Chain

Decoder layer.

Parameters
  • odim (int) – The output dimension.

  • n_layers (int) – Number of ecoder layers.

  • n_units (int) – Number of attention units.

  • d_units (int) – Dimension of input vector of decoder.

  • h (int) – Number of attention heads.

  • dropout (float) – Dropout rate.

  • initialW (Initializer) – Initializer to initialize the weight.

  • initial_bias (Initializer) – Initializer to initialize teh bias.

Initialize Decoder.

forward(ys_pad, source, x_mask)[source]

Forward decoder.

Parameters
  • e (xp.array) – input token ids, int64 (batch, maxlen_out)

  • yy_mask (xp.array) – input token mask, uint8 (batch, maxlen_out)

  • source (xp.array) – encoded memory, float32 (batch, maxlen_in, feat)

  • xy_mask (xp.array) – encoded memory mask, uint8 (batch, maxlen_in)

Return e

decoded token score before softmax (batch, maxlen_out, token)

Return type

chainer.Variable

make_attention_mask(source_block, target_block)[source]

Prepare the attention mask.

Parameters
  • source_block (ndarray) – Source block with dimensions: (B x S).

  • target_block (ndarray) – Target block with dimensions: (B x T).

Returns

Mask with dimensions (B, S, T).

Return type

ndarray

recognize(e, yy_mask, source)[source]

Process recognition function.

espnet.nets.chainer_backend.transformer.decoder_layer

Class Declaration of Transformer’s Decoder Block.

class espnet.nets.chainer_backend.transformer.decoder_layer.DecoderLayer(n_units, d_units=0, h=8, dropout=0.1, initialW=None, initial_bias=None)[source]

Bases: chainer.link.Chain

Single decoder layer module.

Parameters
  • n_units (int) – Number of input/output dimension of a FeedForward layer.

  • d_units (int) – Number of units of hidden layer in a FeedForward layer.

  • h (int) – Number of attention heads.

  • dropout (float) – Dropout rate

Initialize DecoderLayer.

forward(e, s, xy_mask, yy_mask, batch)[source]

Compute Encoder layer.

Parameters
  • e (chainer.Variable) – Batch of padded features. (B, Lmax)

  • s (chainer.Variable) – Batch of padded character. (B, Tmax)

Returns

Computed variable of decoder.

Return type

chainer.Variable

espnet.nets.chainer_backend.transformer.plot

Class Declaration of Transformer’s Attention Plot.

class espnet.nets.chainer_backend.transformer.plot.PlotAttentionReport(att_vis_fn, data, outdir, converter, transform, device, reverse=False, ikey='input', iaxis=0, okey='output', oaxis=0)[source]

Bases: espnet.asr.asr_utils.PlotAttentionReport

Plot an attention reporter.

Parameters
  • att_vis_fn (espnet.nets.*_backend.e2e_asr.E2E.calculate_all_attentions) –

  • of attention visualization. (Function) –

  • data (list[tuple(str, dict[str, list[Any]])]) – List json utt key items.

  • outdir (str) – Directory to save figures.

  • converter (espnet.asr.*_backend.asr.CustomConverter) – Function to convert data.

  • device (int | torch.device) – Device.

  • reverse (bool) – If True, input and output length are reversed.

  • ikey (str) – Key to access input (for ASR ikey=”input”, for MT ikey=”output”.)

  • iaxis (int) – Dimension to access input (for ASR iaxis=0, for MT iaxis=1.)

  • okey (str) – Key to access output (for ASR okey=”input”, MT okay=”output”.)

get_attention_weights()[source]

Return attention weights.

Returns

attention weights.float. Its shape would be

differ from backend. * pytorch-> 1) multi-head case => (B, H, Lmax, Tmax), 2)

other case => (B, Lmax, Tmax).

  • chainer-> (B, Lmax, Tmax)

Return type

numpy.ndarray

log_attentions(logger, step)[source]

Add image files of att_ws matrix to the tensorboard.

espnet.nets.chainer_backend.transformer.plot.plot_multi_head_attention(data, attn_dict, outdir, suffix='png', savefn=<function savefig>)[source]

Plot multi head attentions.

Parameters
  • data (dict) – utts info from json file

  • torch.Tensor] attn_dict (dict[str,) – multi head attention dict. values should be torch.Tensor (head, input_length, output_length)

  • outdir (str) – dir to save fig

  • suffix (str) – filename suffix including image type (e.g., png)

  • savefn – function to save

espnet.nets.chainer_backend.transformer.plot.savefig(plot, filename)[source]

Save a figure.

espnet.nets.chainer_backend.transformer.ctc

Class Declaration of Transformer’s CTC.

class espnet.nets.chainer_backend.transformer.ctc.CTC(odim, eprojs, dropout_rate)[source]

Bases: chainer.link.Chain

Chainer implementation of ctc layer.

Parameters
  • odim (int) – The output dimension.

  • eprojs (int | None) – Dimension of input vectors from encoder.

  • dropout_rate (float) – Dropout rate.

Initialize CTC.

log_softmax(hs)[source]

Log_softmax of frame activations.

Parameters

hs (list of chainer.Variable | N-dimension array) – Input variable from encoder.

Returns

A n-dimension float array.

Return type

chainer.Variable

class espnet.nets.chainer_backend.transformer.ctc.WarpCTC(odim, eprojs, dropout_rate)[source]

Bases: chainer.link.Chain

Chainer implementation of warp-ctc layer.

Parameters
  • odim (int) – The output dimension.

  • eproj (int | None) – Dimension of input vector from encoder.

  • dropout_rate (float) – Dropout rate.

Initialize WarpCTC.

argmax(hs_pad)[source]

Argmax of frame activations.

Parameters

variable hs_pad (chainer) – 3d tensor (B, Tmax, eprojs)

Returns

argmax applied 2d tensor (B, Tmax)

Return type

chainer.Variable.

forward(hs, ys)[source]

Core function of the Warp-CTC layer.

Parameters
  • hs (iterable of chainer.Variable | N-dimention array) – Input variable from encoder.

  • ys (iterable of N-dimension array) – Input variable of decoder.

Returns

A variable holding a scalar value of the CTC loss.

Return type

chainer.Variable

log_softmax(hs)[source]

Log_softmax of frame activations.

Parameters

hs (list of chainer.Variable | N-dimension array) – Input variable from encoder.

Returns

A n-dimension float array.

Return type

chainer.Variable

espnet.nets.chainer_backend.transformer.subsampling

Class Declaration of Transformer’s Input layers.

class espnet.nets.chainer_backend.transformer.subsampling.Conv2dSubsampling(channels, idim, dims, dropout=0.1, initialW=None, initial_bias=None)[source]

Bases: chainer.link.Chain

Convolutional 2D subsampling (to 1/4 length).

Parameters
  • idim (int) – input dim

  • odim (int) – output dim

  • dropout_rate (flaot) – dropout rate

Initialize Conv2dSubsampling.

forward(xs, ilens)[source]

Subsample x.

Parameters

x (chainer.Variable) – input tensor

Returns

subsampled x and mask

class espnet.nets.chainer_backend.transformer.subsampling.LinearSampling(idim, dims, dropout=0.1, initialW=None, initial_bias=None)[source]

Bases: chainer.link.Chain

Linear 1D subsampling.

Parameters
  • idim (int) – input dim

  • odim (int) – output dim

  • dropout_rate (flaot) – dropout rate

Initialize LinearSampling.

forward(xs, ilens)[source]

Subsample x.

Parameters

x (chainer.Variable) – input tensor

Returns

subsampled x and mask

espnet.nets.chainer_backend.transformer.positionwise_feed_forward

Class Declaration of Transformer’s Positionwise Feedforward.

class espnet.nets.chainer_backend.transformer.positionwise_feed_forward.PositionwiseFeedForward(n_units, d_units=0, dropout=0.1, initialW=None, initial_bias=None)[source]

Bases: chainer.link.Chain

Positionwise feed forward.

:param : param int idim: input dimenstion :param : param int hidden_units: number of hidden units :param : param float dropout_rate: dropout rate

Initialize PositionwiseFeedForward.

Parameters
  • n_units (int) – Input dimension.

  • d_units (int, optional) – Output dimension of hidden layer.

  • dropout (float, optional) – Dropout ratio.

  • initialW (int, optional) – Initializer to initialize the weight.

  • initial_bias (bool, optional) – Initializer to initialize the bias.

espnet.nets.chainer_backend.transformer.__init__

Initialize sub package.

espnet.nets.chainer_backend.transformer.encoder_layer

Class Declaration of Transformer’s Encoder Block.

class espnet.nets.chainer_backend.transformer.encoder_layer.EncoderLayer(n_units, d_units=0, h=8, dropout=0.1, initialW=None, initial_bias=None)[source]

Bases: chainer.link.Chain

Single encoder layer module.

Parameters
  • n_units (int) – Number of input/output dimension of a FeedForward layer.

  • d_units (int) – Number of units of hidden layer in a FeedForward layer.

  • h (int) – Number of attention heads.

  • dropout (float) – Dropout rate

Initialize EncoderLayer.

forward(e, xx_mask, batch)[source]

Forward Positional Encoding.

espnet.nets.chainer_backend.transformer.mask

Create mask for subsequent steps.

espnet.nets.chainer_backend.transformer.mask.make_history_mask(xp, block)[source]

Prepare the history mask.

Parameters

block (ndarray) – Block with dimensions: (B x S).

Returns

History mask with dimensions (B, S, S).

Return type

ndarray, np.ndarray

espnet.nets.chainer_backend.transformer.encoder

Class Declaration of Transformer’s Encoder.

class espnet.nets.chainer_backend.transformer.encoder.Encoder(idim, attention_dim=256, attention_heads=4, linear_units=2048, num_blocks=6, dropout_rate=0.1, positional_dropout_rate=0.1, attention_dropout_rate=0.0, input_layer='conv2d', pos_enc_class=<class 'espnet.nets.chainer_backend.transformer.embedding.PositionalEncoding'>, initialW=None, initial_bias=None)[source]

Bases: chainer.link.Chain

Encoder.

Parameters
  • input_type (str) – Sampling type. input_type must be conv2d or ‘linear’ currently.

  • idim (int) – Dimension of inputs.

  • n_layers (int) – Number of encoder layers.

  • n_units (int) – Number of input/output dimension of a FeedForward layer.

  • d_units (int) – Number of units of hidden layer in a FeedForward layer.

  • h (int) – Number of attention heads.

  • dropout (float) – Dropout rate

Initialize Encoder.

Parameters
  • idim (int) – Input dimension.

  • args (Namespace) – Training config.

  • initialW (int, optional) – Initializer to initialize the weight.

  • initial_bias (bool, optional) – Initializer to initialize the bias.

forward(e, ilens)[source]

Compute Encoder layer.

Parameters
  • e (chainer.Variable) – Batch of padded charactor. (B, Tmax)

  • ilens (chainer.Variable) – Batch of length of each input batch. (B,)

Returns

Computed variable of encoder. numpy.array: Mask. chainer.Variable: Batch of lengths of each encoder outputs.

Return type

chainer.Variable

espnet.nets.chainer_backend.rnn.training

class espnet.nets.chainer_backend.rnn.training.CustomConverter(subsampling_factor=1)[source]

Bases: object

Custom Converter.

Parameters

subsampling_factor (int) – The subsampling factor.

class espnet.nets.chainer_backend.rnn.training.CustomParallelUpdater(train_iters, optimizer, converter, devices, accum_grad=1)[source]

Bases: chainer.training.updaters.multiprocess_parallel_updater.MultiprocessParallelUpdater

Custom Parallel Updater for chainer.

Defines the main update routine.

Parameters
  • train_iter (iterator | dict[str, iterator]) – Dataset iterator for the training dataset. It can also be a dictionary that maps strings to iterators. If this is just an iterator, then the iterator is registered by the name 'main'.

  • optimizer (optimizer | dict[str, optimizer]) – Optimizer to update parameters. It can also be a dictionary that maps strings to optimizers. If this is just an optimizer, then the optimizer is registered by the name 'main'.

  • converter (espnet.asr.chainer_backend.asr.CustomConverter) – Converter function to build input arrays. Each batch extracted by the main iterator and the device option are passed to this function. chainer.dataset.concat_examples() is used by default.

  • device (torch.device) – Device to which the training data is sent. Negative value indicates the host memory (CPU).

  • accum_grad (int) – The number of gradient accumulation. if set to 2, the network parameters will be updated once in twice, i.e. actual batchsize will be doubled.

update()[source]

Updates the parameters of the target model.

This method implements an update formula for the training task, including data loading, forward/backward computations, and actual updates of parameters.

This method is called once at each iteration of the training loop.

update_core()[source]

Main Update routine of the custom parallel updater.

class espnet.nets.chainer_backend.rnn.training.CustomUpdater(train_iter, optimizer, converter, device, accum_grad=1)[source]

Bases: chainer.training.updaters.standard_updater.StandardUpdater

Custom updater for chainer.

Parameters
  • train_iter (iterator | dict[str, iterator]) – Dataset iterator for the training dataset. It can also be a dictionary that maps strings to iterators. If this is just an iterator, then the iterator is registered by the name 'main'.

  • optimizer (optimizer | dict[str, optimizer]) – Optimizer to update parameters. It can also be a dictionary that maps strings to optimizers. If this is just an optimizer, then the optimizer is registered by the name 'main'.

  • converter (espnet.asr.chainer_backend.asr.CustomConverter) – Converter function to build input arrays. Each batch extracted by the main iterator and the device option are passed to this function. chainer.dataset.concat_examples() is used by default.

  • device (int or dict) – The destination device info to send variables. In the case of cpu or single gpu, device=-1 or 0, respectively. In the case of multi-gpu, device={“main”:0, “sub_1”: 1, …}.

  • accum_grad (int) – The number of gradient accumulation. if set to 2, the network parameters will be updated once in twice, i.e. actual batchsize will be doubled.

update()[source]

Updates the parameters of the target model.

This method implements an update formula for the training task, including data loading, forward/backward computations, and actual updates of parameters.

This method is called once at each iteration of the training loop.

update_core()[source]

Main update routine for Custom Updater.

espnet.nets.chainer_backend.rnn.training.sum_sqnorm(arr)[source]

Calculate the norm of the array.

Parameters

arr (numpy.ndarray) –

Returns

Sum of the norm calculated from the given array.

Return type

Float

espnet.nets.chainer_backend.rnn.encoders

class espnet.nets.chainer_backend.rnn.encoders.Encoder(etype, idim, elayers, eunits, eprojs, subsample, dropout, in_channel=1)[source]

Bases: chainer.link.Chain

Encoder network class.

Parameters
  • etype (str) – Type of encoder network.

  • idim (int) – Number of dimensions of encoder network.

  • elayers (int) – Number of layers of encoder network.

  • eunits (int) – Number of lstm units of encoder network.

  • eprojs (int) – Number of projection units of encoder network.

  • subsample (np.array) – Subsampling number. e.g. 1_2_2_2_1

  • dropout (float) – Dropout rate.

class espnet.nets.chainer_backend.rnn.encoders.RNN(idim, elayers, cdim, hdim, dropout, typ='lstm')[source]

Bases: chainer.link.Chain

RNN Module.

Parameters
  • idim (int) – Dimension of the imput.

  • elayers (int) – Number of encoder layers.

  • cdim (int) – Number of rnn units.

  • hdim (int) – Number of projection units.

  • dropout (float) – Dropout rate.

  • typ (str) – Rnn type.

class espnet.nets.chainer_backend.rnn.encoders.RNNP(idim, elayers, cdim, hdim, subsample, dropout, typ='blstm')[source]

Bases: chainer.link.Chain

RNN with projection layer module.

Parameters
  • idim (int) – Dimension of inputs.

  • elayers (int) – Number of encoder layers.

  • cdim (int) – Number of rnn units. (resulted in cdim * 2 if bidirectional)

  • hdim (int) – Number of projection units.

  • subsample (np.ndarray) – List to use sabsample the input array.

  • dropout (float) – Dropout rate.

  • typ (str) – The RNN type.

class espnet.nets.chainer_backend.rnn.encoders.VGG2L(in_channel=1)[source]

Bases: chainer.link.Chain

VGG motibated cnn layers.

Parameters

in_channel (int) – Number of channels.

espnet.nets.chainer_backend.rnn.encoders.encoder_for(args, idim, subsample)[source]

Return the Encoder module.

Parameters
  • idim (int) – Dimension of input array.

  • subsample (numpy.array) – Subsample number. egs).1_2_2_2_1

Return

chainer.nn.Module: Encoder module.

espnet.nets.chainer_backend.rnn.decoders

class espnet.nets.chainer_backend.rnn.decoders.Decoder(eprojs, odim, dtype, dlayers, dunits, sos, eos, att, verbose=0, char_list=None, labeldist=None, lsm_weight=0.0, sampling_probability=0.0)[source]

Bases: chainer.link.Chain

Decoder layer.

Parameters
  • eprojs (int) – Dimension of input variables from encoder.

  • odim (int) – The output dimension.

  • dtype (str) – Decoder type.

  • dlayers (int) – Number of layers for decoder.

  • dunits (int) – Dimension of input vector of decoder.

  • sos (int) – Number to indicate the start of sequences.

  • eos (int) – Number to indicate the end of sequences.

  • att (Module) – Attention module defined at espnet.espnet.nets.chainer_backend.attentions.

  • verbose (int) – Verbosity level.

  • char_list (List[str]) – List of all charactors.

  • labeldist (numpy.array) – Distributed array of counted transcript length.

  • lsm_weight (float) – Weight to use when calculating the training loss.

  • sampling_probability (float) – Threshold for scheduled sampling.

calculate_all_attentions(hs, ys)[source]

Calculate all of attentions.

Parameters
  • hs (list of chainer.Variable | N-dimensional array) – Input variable from encoder.

  • ys (list of chainer.Variable | N-dimensional array) – Input variable of decoder.

Returns

List of attention weights.

Return type

chainer.Variable

recognize_beam(h, lpz, recog_args, char_list, rnnlm=None)[source]

Beam search implementation.

Parameters
  • h (chainer.Variable) – One of the output from the encoder.

  • lpz (chainer.Variable | None) – Result of net propagation.

  • recog_args (Namespace) – The argument.

  • char_list (List[str]) – List of all charactors.

  • rnnlm (Module) – RNNLM module. Defined at espnet.lm.chainer_backend.lm

Returns

Result of recognition.

Return type

List[Dict[str,Any]]

rnn_forward(ey, z_list, c_list, z_prev, c_prev)[source]
espnet.nets.chainer_backend.rnn.decoders.decoder_for(args, odim, sos, eos, att, labeldist)[source]

Return the decoding layer corresponding to the args.

Parameters
  • args (Namespace) – The program arguments.

  • odim (int) – The output dimension.

  • sos (int) – Number to indicate the start of sequences.

  • eos (int) –

  • att (Module) – Attention module defined at espnet.nets.chainer_backend.attentions.

  • labeldist (numpy.array) – Distributed array of length od transcript.

Returns

The decoder module.

Return type

chainer.Chain

espnet.nets.chainer_backend.rnn.attentions

class espnet.nets.chainer_backend.rnn.attentions.AttDot(eprojs, dunits, att_dim)[source]

Bases: chainer.link.Chain

Compute attention based on dot product.

Parameters
  • eprojs (int | None) – Dimension of input vectors from encoder.

  • dunits (int | None) – Dimension of input vectors for decoder.

  • att_dim (int) – Dimension of input vectors for attention.

reset()[source]

Reset states.

class espnet.nets.chainer_backend.rnn.attentions.AttLoc(eprojs, dunits, att_dim, aconv_chans, aconv_filts)[source]

Bases: chainer.link.Chain

Compute location-based attention.

Parameters
  • eprojs (int | None) – Dimension of input vectors from encoder.

  • dunits (int | None) – Dimension of input vectors for decoder.

  • att_dim (int) – Dimension of input vectors for attention.

  • aconv_chans (int) – Number of channels of output arrays from convolutional layer.

  • aconv_filts (int) – Size of filters of convolutional layer.

reset()[source]

Reset states.

class espnet.nets.chainer_backend.rnn.attentions.NoAtt[source]

Bases: chainer.link.Chain

Compute non-attention layer.

This layer is a dummy attention layer to be compatible with other attention-based models.

reset()[source]

Reset states.

espnet.nets.chainer_backend.rnn.attentions.att_for(args)[source]

Returns an attention layer given the program arguments.

Parameters

args (Namespace) – The arguments.

Returns

The corresponding attention module.

Return type

chainer.Chain

espnet.nets.chainer_backend.rnn.__init__

Initialize sub package.

espnet.nets.scorers.ngram

Ngram lm implement.

class espnet.nets.scorers.ngram.NgramFullScorer(ngram_model, token_list)[source]

Bases: espnet.nets.scorers.ngram.Ngrambase, espnet.nets.scorer_interface.BatchScorerInterface

Fullscorer for ngram.

Initialize Ngrambase.

Parameters
  • ngram_model – ngram model path

  • token_list – token list from dict or model.json

score(y, state, x)[source]

Score interface for both full and partial scorer.

Parameters
  • y – previous char

  • state – previous state

  • x – encoded feature

Returns

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type

tuple[torch.Tensor, List[Any]]

class espnet.nets.scorers.ngram.NgramPartScorer(ngram_model, token_list)[source]

Bases: espnet.nets.scorers.ngram.Ngrambase, espnet.nets.scorer_interface.PartialScorerInterface

Partialscorer for ngram.

Initialize Ngrambase.

Parameters
  • ngram_model – ngram model path

  • token_list – token list from dict or model.json

score_partial(y, next_token, state, x)[source]

Score interface for both full and partial scorer.

Parameters
  • y – previous char

  • next_token – next token need to be score

  • state – previous state

  • x – encoded feature

Returns

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type

tuple[torch.Tensor, List[Any]]

select_state(state, i)[source]

Empty select state for scorer interface.

class espnet.nets.scorers.ngram.Ngrambase(ngram_model, token_list)[source]

Bases: abc.ABC

Ngram base implemented throught ScorerInterface.

Initialize Ngrambase.

Parameters
  • ngram_model – ngram model path

  • token_list – token list from dict or model.json

init_state(x)[source]

Initialize tmp state.

score_partial_(y, next_token, state, x)[source]

Score interface for both full and partial scorer.

Parameters
  • y – previous char

  • next_token – next token need to be score

  • state – previous state

  • x – encoded feature

Returns

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type

tuple[torch.Tensor, List[Any]]

espnet.nets.scorers.ctc

ScorerInterface implementation for CTC.

class espnet.nets.scorers.ctc.CTCPrefixScorer(ctc: torch.nn.modules.module.Module, eos: int)[source]

Bases: espnet.nets.scorer_interface.BatchPartialScorerInterface

Decoder interface wrapper for CTCPrefixScore.

Initialize class.

Parameters
batch_init_state(x: torch.Tensor)[source]

Get an initial state for decoding.

Parameters

x (torch.Tensor) – The encoded feature tensor

Returns: initial state

batch_score_partial(y, ids, state, x)[source]

Score new token.

Parameters
  • y (torch.Tensor) – 1D prefix token

  • ids (torch.Tensor) – torch.int64 next token to score

  • state – decoder state for prefix tokens

  • x (torch.Tensor) – 2D encoder feature that generates ys

Returns

Tuple of a score tensor for y that has a shape (len(next_tokens),) and next state for ys

Return type

tuple[torch.Tensor, Any]

init_state(x: torch.Tensor)[source]

Get an initial state for decoding.

Parameters

x (torch.Tensor) – The encoded feature tensor

Returns: initial state

score_partial(y, ids, state, x)[source]

Score new token.

Parameters
  • y (torch.Tensor) – 1D prefix token

  • next_tokens (torch.Tensor) – torch.int64 next token to score

  • state – decoder state for prefix tokens

  • x (torch.Tensor) – 2D encoder feature that generates ys

Returns

Tuple of a score tensor for y that has a shape (len(next_tokens),) and next state for ys

Return type

tuple[torch.Tensor, Any]

select_state(state, i, new_id=None)[source]

Select state with relative ids in the main beam search.

Parameters
  • state – Decoder state for prefix tokens

  • i (int) – Index to select a state in the main beam search

  • new_id (int) – New label id to select a state if necessary

Returns

pruned state

Return type

state

espnet.nets.scorers.length_bonus

Length bonus module.

class espnet.nets.scorers.length_bonus.LengthBonus(n_vocab: int)[source]

Bases: espnet.nets.scorer_interface.BatchScorerInterface

Length bonus in beam search.

Initialize class.

Parameters

n_vocab (int) – The number of tokens in vocabulary for beam search

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch.

Parameters
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type

tuple[torch.Tensor, List[Any]]

score(y, state, x)[source]

Score new token.

Parameters
  • y (torch.Tensor) – 1D torch.int64 prefix tokens.

  • state – Scorer state for prefix tokens

  • x (torch.Tensor) – 2D encoder feature that generates ys.

Returns

Tuple of

torch.float32 scores for next token (n_vocab) and None

Return type

tuple[torch.Tensor, Any]

espnet.nets.scorers.__init__

Initialize sub package.

espnet.nets.pytorch_backend.e2e_st_transformer

Transformer speech recognition model (pytorch).

class espnet.nets.pytorch_backend.e2e_st_transformer.E2E(idim, odim, args, ignore_id=-1)[source]

Bases: espnet.nets.st_interface.STInterface, torch.nn.modules.module.Module

E2E module.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

Construct an E2E object.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

static add_arguments(parser)[source]

Add arguments.

property attention_plot_class

Return PlotAttentionReport.

calculate_all_attentions(xs_pad, ilens, ys_pad, ys_pad_src)[source]

E2E attention calculation.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

  • ys_pad_src (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

Returns

attention weights (B, H, Lmax, Tmax)

Return type

float ndarray

calculate_all_ctc_probs(xs_pad, ilens, ys_pad, ys_pad_src)[source]

E2E CTC probability calculation.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

  • ys_pad_src (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

Returns

CTC probability (B, Tmax, vocab)

Return type

float ndarray

encode(x)[source]

Encode source acoustic features.

Parameters

x (ndarray) – source acoustic feature (T, D)

Returns

encoder outputs

Return type

torch.Tensor

forward(xs_pad, ilens, ys_pad, ys_pad_src)[source]

E2E forward.

Parameters
  • xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of source sequences (B)

  • ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

  • ys_pad_src (torch.Tensor) – batch of padded target sequences (B, Lmax)

Returns

ctc loss value

Return type

torch.Tensor

Returns

attention loss value

Return type

torch.Tensor

Returns

accuracy in attention decoder

Return type

float

forward_asr(hs_pad, hs_mask, ys_pad)[source]

Forward pass in the auxiliary ASR task.

Parameters
  • hs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)

  • hs_mask (torch.Tensor) – batch of input token mask (B, Lmax)

  • ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

Returns

ASR attention loss value

Return type

torch.Tensor

Returns

accuracy in ASR attention decoder

Return type

float

Returns

ASR CTC loss value

Return type

torch.Tensor

Returns

character error rate from CTC prediction

Return type

float

Returns

character error rate from attetion decoder prediction

Return type

float

Returns

word error rate from attetion decoder prediction

Return type

float

forward_mt(xs_pad, ys_in_pad, ys_out_pad, ys_mask)[source]

Forward pass in the auxiliary MT task.

Parameters
  • xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)

  • ys_in_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

  • ys_out_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

  • ys_mask (torch.Tensor) – batch of input token mask (B, Lmax)

Returns

MT loss value

Return type

torch.Tensor

Returns

accuracy in MT decoder

Return type

float

reset_parameters(args)[source]

Initialize parameters.

scorers()[source]

Scorers.

translate(x, trans_args, char_list=None)[source]

Translate input speech.

Parameters
  • x (ndnarray) – input acoustic feature (B, T, D) or (T, D)

  • trans_args (Namespace) – argment Namespace contraining options

  • char_list (list) – list of characters

Returns

N-best decoding results

Return type

list

espnet.nets.pytorch_backend.e2e_asr_transducer

Transducer speech recognition model (pytorch).

class espnet.nets.pytorch_backend.e2e_asr_transducer.E2E(idim, odim, args, ignore_id=-1, blank_id=0)[source]

Bases: espnet.nets.asr_interface.ASRInterface, torch.nn.modules.module.Module

E2E module for transducer models.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

  • ignore_id (int) – padding symbol id

  • blank_id (int) – blank symbol id

Construct an E2E object for transducer model.

static add_arguments(parser)[source]

Extend arguments for transducer models.

Both Transformer and RNN modules are supported. General options encapsulate both modules options.

property attention_plot_class

Get attention plot class.

calculate_all_attentions(xs_pad, ilens, ys_pad)[source]

E2E attention calculation.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)

Returns

attention weights with the following shape,
  1. multi-head case => attention weights (B, H, Lmax, Tmax),

  2. other case => attention weights (B, Lmax, Tmax).

Return type

ret (ndarray)

default_parameters(args)[source]

Initialize/reset parameters for transducer.

Parameters

args (Namespace) – argument Namespace containing options

encode_rnn(x)[source]

Encode acoustic features.

Parameters

x (ndarray) – input acoustic feature (T, D)

Returns

encoded features (T, attention_dim)

Return type

x (torch.Tensor)

encode_transformer(x)[source]

Encode acoustic features.

Parameters

x (ndarray) – input acoustic feature (T, D)

Returns

encoded features (T, attention_dim)

Return type

x (torch.Tensor)

forward(xs_pad, ilens, ys_pad)[source]

E2E forward.

Parameters
  • xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

Returns

transducer loss value

Return type

loss (torch.Tensor)

recognize(x, beam_search)[source]

Recognize input features.

Parameters
  • x (ndarray) – input acoustic feature (T, D)

  • beam_search (class) – beam search class

Returns

n-best decoding results

Return type

nbest_hyps (list)

class espnet.nets.pytorch_backend.e2e_asr_transducer.Reporter(**links)[source]

Bases: chainer.link.Chain

A chainer reporter wrapper for transducer models.

report(loss, cer, wer)[source]

Instantiate reporter attributes.

espnet.nets.pytorch_backend.e2e_tts_fastspeech

FastSpeech related modules.

class espnet.nets.pytorch_backend.e2e_tts_fastspeech.FeedForwardTransformer(idim, odim, args=None)[source]

Bases: espnet.nets.tts_interface.TTSInterface, torch.nn.modules.module.Module

Feed Forward Transformer for TTS a.k.a. FastSpeech.

This is a module of FastSpeech, feed-forward Transformer with duration predictor described in FastSpeech: Fast, Robust and Controllable Text to Speech, which does not require any auto-regressive processing during inference, resulting in fast decoding compared with auto-regressive Transformer.

Initialize feed-forward Transformer module.

Parameters
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • args (Namespace, optional) –

    • elayers (int): Number of encoder layers.

    • eunits (int): Number of encoder hidden units.

    • adim (int): Number of attention transformation dimensions.

    • aheads (int): Number of heads for multi head attention.

    • dlayers (int): Number of decoder layers.

    • dunits (int): Number of decoder hidden units.

    • use_scaled_pos_enc (bool):

      Whether to use trainable scaled positional encoding.

    • encoder_normalize_before (bool):

      Whether to perform layer normalization before encoder block.

    • decoder_normalize_before (bool):

      Whether to perform layer normalization before decoder block.

    • encoder_concat_after (bool): Whether to concatenate attention

      layer’s input and output in encoder.

    • decoder_concat_after (bool): Whether to concatenate attention

      layer’s input and output in decoder.

    • duration_predictor_layers (int): Number of duration predictor layers.

    • duration_predictor_chans (int): Number of duration predictor channels.

    • duration_predictor_kernel_size (int):

      Kernel size of duration predictor.

    • spk_embed_dim (int): Number of speaker embedding dimensions.

    • spk_embed_integration_type: How to integrate speaker embedding.

    • teacher_model (str): Teacher auto-regressive transformer model path.

    • reduction_factor (int): Reduction factor.

    • transformer_init (float): How to initialize transformer parameters.

    • transformer_lr (float): Initial value of learning rate.

    • transformer_warmup_steps (int): Optimizer warmup steps.

    • transformer_enc_dropout_rate (float):

      Dropout rate in encoder except attention & positional encoding.

    • transformer_enc_positional_dropout_rate (float):

      Dropout rate after encoder positional encoding.

    • transformer_enc_attn_dropout_rate (float):

      Dropout rate in encoder self-attention module.

    • transformer_dec_dropout_rate (float):

      Dropout rate in decoder except attention & positional encoding.

    • transformer_dec_positional_dropout_rate (float):

      Dropout rate after decoder positional encoding.

    • transformer_dec_attn_dropout_rate (float):

      Dropout rate in deocoder self-attention module.

    • transformer_enc_dec_attn_dropout_rate (float):

      Dropout rate in encoder-deocoder attention module.

    • use_masking (bool):

      Whether to apply masking for padded part in loss calculation.

    • use_weighted_masking (bool):

      Whether to apply weighted masking in loss calculation.

    • transfer_encoder_from_teacher:

      Whether to transfer encoder using teacher encoder parameters.

    • transferred_encoder_module:

      Encoder module to be initialized using teacher parameters.

static add_arguments(parser)[source]

Add model-specific arguments to the parser.

property attention_plot_class

Return plot class for attention weight plot.

property base_plot_keys

Return base key names to plot during training.

keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.

Returns

List of strings which are base keys to plot during training.

Return type

list

calculate_all_attentions(xs, ilens, ys, olens, spembs=None, extras=None, *args, **kwargs)[source]

Calculate all of the attention weights.

Parameters
  • xs (Tensor) – Batch of padded character ids (B, Tmax).

  • ilens (LongTensor) – Batch of lengths of each input batch (B,).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).

  • extras (Tensor, optional) – Batch of precalculated durations (B, Tmax, 1).

Returns

Dict of attention weights and outputs.

Return type

dict

forward(xs, ilens, ys, olens, spembs=None, extras=None, *args, **kwargs)[source]

Calculate forward propagation.

Parameters
  • xs (Tensor) – Batch of padded character ids (B, Tmax).

  • ilens (LongTensor) – Batch of lengths of each input batch (B,).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).

  • extras (Tensor, optional) – Batch of precalculated durations (B, Tmax, 1).

Returns

Loss value.

Return type

Tensor

inference(x, inference_args, spemb=None, *args, **kwargs)[source]

Generate the sequence of features given the sequences of characters.

Parameters
  • x (Tensor) – Input sequence of characters (T,).

  • inference_args (Namespace) – Dummy for compatibility.

  • spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).

Returns

Output sequence of features (L, odim). None: Dummy for compatibility. None: Dummy for compatibility.

Return type

Tensor

class espnet.nets.pytorch_backend.e2e_tts_fastspeech.FeedForwardTransformerLoss(use_masking=True, use_weighted_masking=False)[source]

Bases: torch.nn.modules.module.Module

Loss function module for feed-forward Transformer.

Initialize feed-forward Transformer loss module.

Parameters
  • use_masking (bool) – Whether to apply masking for padded part in loss calculation.

  • use_weighted_masking (bool) – Whether to weighted masking in loss calculation.

forward(after_outs, before_outs, d_outs, ys, ds, ilens, olens)[source]

Calculate forward propagation.

Parameters
  • after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).

  • before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).

  • d_outs (Tensor) – Batch of outputs of duration predictor (B, Tmax).

  • ys (Tensor) – Batch of target features (B, Lmax, odim).

  • ds (Tensor) – Batch of durations (B, Tmax).

  • ilens (LongTensor) – Batch of the lengths of each input (B,).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

Returns

L1 loss value. Tensor: Duration predictor loss value.

Return type

Tensor

espnet.nets.pytorch_backend.e2e_tts_tacotron2

Tacotron 2 related modules.

class espnet.nets.pytorch_backend.e2e_tts_tacotron2.GuidedAttentionLoss(sigma=0.4, alpha=1.0, reset_always=True)[source]

Bases: torch.nn.modules.module.Module

Guided attention loss function module.

This module calculates the guided attention loss described in Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention, which forces the attention to be diagonal.

Initialize guided attention loss module.

Parameters
  • sigma (float, optional) – Standard deviation to control how close attention to a diagonal.

  • alpha (float, optional) – Scaling coefficient (lambda).

  • reset_always (bool, optional) – Whether to always reset masks.

forward(att_ws, ilens, olens)[source]

Calculate forward propagation.

Parameters
  • att_ws (Tensor) – Batch of attention weights (B, T_max_out, T_max_in).

  • ilens (LongTensor) – Batch of input lenghts (B,).

  • olens (LongTensor) – Batch of output lenghts (B,).

Returns

Guided attention loss value.

Return type

Tensor

class espnet.nets.pytorch_backend.e2e_tts_tacotron2.Tacotron2(idim, odim, args=None)[source]

Bases: espnet.nets.tts_interface.TTSInterface, torch.nn.modules.module.Module

Tacotron2 module for end-to-end text-to-speech (E2E-TTS).

This is a module of Spectrogram prediction network in Tacotron2 described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, which converts the sequence of characters into the sequence of Mel-filterbanks.

Initialize Tacotron2 module.

Parameters
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • args (Namespace, optional) –

    • spk_embed_dim (int): Dimension of the speaker embedding.

    • embed_dim (int): Dimension of character embedding.

    • elayers (int): The number of encoder blstm layers.

    • eunits (int): The number of encoder blstm units.

    • econv_layers (int): The number of encoder conv layers.

    • econv_filts (int): The number of encoder conv filter size.

    • econv_chans (int): The number of encoder conv filter channels.

    • dlayers (int): The number of decoder lstm layers.

    • dunits (int): The number of decoder lstm units.

    • prenet_layers (int): The number of prenet layers.

    • prenet_units (int): The number of prenet units.

    • postnet_layers (int): The number of postnet layers.

    • postnet_filts (int): The number of postnet filter size.

    • postnet_chans (int): The number of postnet filter channels.

    • output_activation (int): The name of activation function for outputs.

    • adim (int): The number of dimension of mlp in attention.

    • aconv_chans (int): The number of attention conv filter channels.

    • aconv_filts (int): The number of attention conv filter size.

    • cumulate_att_w (bool): Whether to cumulate previous attention weight.

    • use_batch_norm (bool): Whether to use batch normalization.

    • use_concate (int): Whether to concatenate encoder embedding

      with decoder lstm outputs.

    • dropout_rate (float): Dropout rate.

    • zoneout_rate (float): Zoneout rate.

    • reduction_factor (int): Reduction factor.

    • spk_embed_dim (int): Number of speaker embedding dimenstions.

    • spc_dim (int): Number of spectrogram embedding dimenstions

      (only for use_cbhg=True).

    • use_cbhg (bool): Whether to use CBHG module.

    • cbhg_conv_bank_layers (int): The number of convoluional banks in CBHG.

    • cbhg_conv_bank_chans (int): The number of channels of

      convolutional bank in CBHG.

    • cbhg_proj_filts (int):

      The number of filter size of projection layeri in CBHG.

    • cbhg_proj_chans (int):

      The number of channels of projection layer in CBHG.

    • cbhg_highway_layers (int):

      The number of layers of highway network in CBHG.

    • cbhg_highway_units (int):

      The number of units of highway network in CBHG.

    • cbhg_gru_units (int): The number of units of GRU in CBHG.

    • use_masking (bool):

      Whether to apply masking for padded part in loss calculation.

    • use_weighted_masking (bool):

      Whether to apply weighted masking in loss calculation.

    • bce_pos_weight (float):

      Weight of positive sample of stop token (only for use_masking=True).

    • use-guided-attn-loss (bool): Whether to use guided attention loss.

    • guided-attn-loss-sigma (float) Sigma in guided attention loss.

    • guided-attn-loss-lamdba (float): Lambda in guided attention loss.

static add_arguments(parser)[source]

Add model-specific arguments to the parser.

property base_plot_keys

Return base key names to plot during training.

keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.

Returns

List of strings which are base keys to plot during training.

Return type

list

calculate_all_attentions(xs, ilens, ys, spembs=None, keep_tensor=False, *args, **kwargs)[source]

Calculate all of the attention weights.

Parameters
  • xs (Tensor) – Batch of padded character ids (B, Tmax).

  • ilens (LongTensor) – Batch of lengths of each input batch (B,).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).

  • keep_tensor (bool, optional) – Whether to keep original tensor.

Returns

Batch of attention weights (B, Lmax, Tmax).

Return type

Union[ndarray, Tensor]

forward(xs, ilens, ys, labels, olens, spembs=None, extras=None, *args, **kwargs)[source]

Calculate forward propagation.

Parameters
  • xs (Tensor) – Batch of padded character ids (B, Tmax).

  • ilens (LongTensor) – Batch of lengths of each input batch (B,).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).

  • extras (Tensor, optional) – Batch of groundtruth spectrograms (B, Lmax, spc_dim).

Returns

Loss value.

Return type

Tensor

inference(x, inference_args, spemb=None, *args, **kwargs)[source]

Generate the sequence of features given the sequences of characters.

Parameters
  • x (Tensor) – Input sequence of characters (T,).

  • inference_args (Namespace) –

    • threshold (float): Threshold in inference.

    • minlenratio (float): Minimum length ratio in inference.

    • maxlenratio (float): Maximum length ratio in inference.

  • spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).

Returns

Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).

Return type

Tensor

class espnet.nets.pytorch_backend.e2e_tts_tacotron2.Tacotron2Loss(use_masking=True, use_weighted_masking=False, bce_pos_weight=20.0)[source]

Bases: torch.nn.modules.module.Module

Loss function module for Tacotron2.

Initialize Tactoron2 loss module.

Parameters
  • use_masking (bool) – Whether to apply masking for padded part in loss calculation.

  • use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.

  • bce_pos_weight (float) – Weight of positive sample of stop token.

forward(after_outs, before_outs, logits, ys, labels, olens)[source]

Calculate forward propagation.

Parameters
  • after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).

  • before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).

  • logits (Tensor) – Batch of stop logits (B, Lmax).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • labels (LongTensor) – Batch of the sequences of stop token labels (B, Lmax).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

Returns

L1 loss value. Tensor: Mean square error loss value. Tensor: Binary cross entropy loss value.

Return type

Tensor

espnet.nets.pytorch_backend.e2e_st_conformer

Conformer speech translation model (pytorch).

It is a fusion of e2e_st_transformer.py Refer to: https://arxiv.org/abs/2005.08100

class espnet.nets.pytorch_backend.e2e_st_conformer.E2E(idim, odim, args, ignore_id=-1)[source]

Bases: espnet.nets.pytorch_backend.e2e_st_transformer.E2E

E2E module.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

Construct an E2E object.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

static add_arguments(parser)[source]

Add arguments.

static add_conformer_arguments(parser)[source]

Add arguments for conformer model.

espnet.nets.pytorch_backend.e2e_asr_transformer

Transformer speech recognition model (pytorch).

class espnet.nets.pytorch_backend.e2e_asr_transformer.E2E(idim, odim, args, ignore_id=-1)[source]

Bases: espnet.nets.asr_interface.ASRInterface, torch.nn.modules.module.Module

E2E module.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

Construct an E2E object.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

static add_arguments(parser)[source]

Add arguments.

property attention_plot_class

Return PlotAttentionReport.

calculate_all_attentions(xs_pad, ilens, ys_pad)[source]

E2E attention calculation.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

Returns

attention weights (B, H, Lmax, Tmax)

Return type

float ndarray

calculate_all_ctc_probs(xs_pad, ilens, ys_pad)[source]

E2E CTC probability calculation.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

Returns

CTC probability (B, Tmax, vocab)

Return type

float ndarray

encode(x)[source]

Encode acoustic features.

Parameters

x (ndarray) – source acoustic feature (T, D)

Returns

encoder outputs

Return type

torch.Tensor

forward(xs_pad, ilens, ys_pad)[source]

E2E forward.

Parameters
  • xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of source sequences (B)

  • ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

Returns

ctc loss value

Return type

torch.Tensor

Returns

attention loss value

Return type

torch.Tensor

Returns

accuracy in attention decoder

Return type

float

recognize(x, recog_args, char_list=None, rnnlm=None, use_jit=False)[source]

Recognize input speech.

Parameters
  • x (ndnarray) – input acoustic feature (B, T, D) or (T, D)

  • recog_args (Namespace) – argment Namespace contraining options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

reset_parameters(args)[source]

Initialize parameters.

scorers()[source]

Scorers.

espnet.nets.pytorch_backend.e2e_vc_tacotron2

Tacotron2-VC related modules.

class espnet.nets.pytorch_backend.e2e_vc_tacotron2.Tacotron2(idim, odim, args=None)[source]

Bases: espnet.nets.tts_interface.TTSInterface, torch.nn.modules.module.Module

VC Tacotron2 module for VC.

This is a module of Tacotron2-based VC model, which convert the sequence of acoustic features into the sequence of acoustic features.

Initialize Tacotron2 module.

Parameters
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • args (Namespace, optional) –

    • spk_embed_dim (int): Dimension of the speaker embedding.

    • elayers (int): The number of encoder blstm layers.

    • eunits (int): The number of encoder blstm units.

    • econv_layers (int): The number of encoder conv layers.

    • econv_filts (int): The number of encoder conv filter size.

    • econv_chans (int): The number of encoder conv filter channels.

    • dlayers (int): The number of decoder lstm layers.

    • dunits (int): The number of decoder lstm units.

    • prenet_layers (int): The number of prenet layers.

    • prenet_units (int): The number of prenet units.

    • postnet_layers (int): The number of postnet layers.

    • postnet_filts (int): The number of postnet filter size.

    • postnet_chans (int): The number of postnet filter channels.

    • output_activation (int): The name of activation function for outputs.

    • adim (int): The number of dimension of mlp in attention.

    • aconv_chans (int): The number of attention conv filter channels.

    • aconv_filts (int): The number of attention conv filter size.

    • cumulate_att_w (bool): Whether to cumulate previous attention weight.

    • use_batch_norm (bool): Whether to use batch normalization.

    • use_concate (int):

      Whether to concatenate encoder embedding with decoder lstm outputs.

    • dropout_rate (float): Dropout rate.

    • zoneout_rate (float): Zoneout rate.

    • reduction_factor (int): Reduction factor.

    • spk_embed_dim (int): Number of speaker embedding dimenstions.

    • spc_dim (int): Number of spectrogram embedding dimenstions

      (only for use_cbhg=True).

    • use_cbhg (bool): Whether to use CBHG module.

    • cbhg_conv_bank_layers (int):

      The number of convoluional banks in CBHG.

    • cbhg_conv_bank_chans (int):

      The number of channels of convolutional bank in CBHG.

    • cbhg_proj_filts (int):

      The number of filter size of projection layeri in CBHG.

    • cbhg_proj_chans (int):

      The number of channels of projection layer in CBHG.

    • cbhg_highway_layers (int):

      The number of layers of highway network in CBHG.

    • cbhg_highway_units (int):

      The number of units of highway network in CBHG.

    • cbhg_gru_units (int): The number of units of GRU in CBHG.

    • use_masking (bool): Whether to mask padded part in loss calculation.

    • bce_pos_weight (float): Weight of positive sample of stop token

      (only for use_masking=True).

    • use-guided-attn-loss (bool): Whether to use guided attention loss.

    • guided-attn-loss-sigma (float) Sigma in guided attention loss.

    • guided-attn-loss-lamdba (float): Lambda in guided attention loss.

static add_arguments(parser)[source]

Add model-specific arguments to the parser.

property base_plot_keys

Return base key names to plot during training.

keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss

and validation/main/loss values.

also loss.png will be created as a figure visulizing main/loss

and validation/main/loss values.

Returns

List of strings which are base keys to plot during training.

Return type

list

calculate_all_attentions(xs, ilens, ys, spembs=None, *args, **kwargs)[source]

Calculate all of the attention weights.

Parameters
  • xs (Tensor) – Batch of padded acoustic features (B, Tmax, idim).

  • ilens (LongTensor) – Batch of lengths of each input batch (B,).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).

Returns

Batch of attention weights (B, Lmax, Tmax).

Return type

numpy.ndarray

forward(xs, ilens, ys, labels, olens, spembs=None, spcs=None, *args, **kwargs)[source]

Calculate forward propagation.

Parameters
  • xs (Tensor) – Batch of padded acoustic features (B, Tmax, idim).

  • ilens (LongTensor) – Batch of lengths of each input batch (B,).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).

  • spcs (Tensor, optional) – Batch of groundtruth spectrograms (B, Lmax, spc_dim).

Returns

Loss value.

Return type

Tensor

inference(x, inference_args, spemb=None, *args, **kwargs)[source]

Generate the sequence of features given the sequences of characters.

Parameters
  • x (Tensor) – Input sequence of acoustic features (T, idim).

  • inference_args (Namespace) –

    • threshold (float): Threshold in inference.

    • minlenratio (float): Minimum length ratio in inference.

    • maxlenratio (float): Maximum length ratio in inference.

  • spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).

Returns

Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).

Return type

Tensor

espnet.nets.pytorch_backend.e2e_asr_mulenc

Define e2e module for multi-encoder network. https://arxiv.org/pdf/1811.04903.pdf.

class espnet.nets.pytorch_backend.e2e_asr_mulenc.E2E(idims, odim, args)[source]

Bases: espnet.nets.asr_interface.ASRInterface, torch.nn.modules.module.Module

E2E module.

Parameters
  • idims (List) – List of dimensions of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

Initialize this class with python-level args.

Parameters
  • idims (list) – list of the number of an input feature dim.

  • odim (int) – The number of output vocab.

  • args (Namespace) – arguments

static add_arguments(parser)[source]

Add arguments for multi-encoder setting.

static attention_add_arguments(parser)[source]

Add arguments for attentions in multi-encoder setting.

calculate_all_attentions(xs_pad_list, ilens_list, ys_pad)[source]

E2E attention calculation.

Parameters
  • xs_pad_list (List) – list of batch (torch.Tensor) of padded input sequences [(B, Tmax_1, idim), (B, Tmax_2, idim),..]

  • ilens_list (List) – list of batch (torch.Tensor) of lengths of input sequences [(B), (B), ..]

  • ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)

Returns

attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) multi-encoder case

=> [(B, Lmax, Tmax1), (B, Lmax, Tmax2), …, (B, Lmax, NumEncs)]

  1. other case => attention weights (B, Lmax, Tmax).

Return type

float ndarray or list

calculate_all_ctc_probs(xs_pad_list, ilens_list, ys_pad)[source]

E2E CTC probability calculation.

Parameters
  • xs_pad_list (List) – list of batch (torch.Tensor) of padded input sequences [(B, Tmax_1, idim), (B, Tmax_2, idim),..]

  • ilens_list (List) – list of batch (torch.Tensor) of lengths of input sequences [(B), (B), ..]

  • ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)

Returns

CTC probability (B, Tmax, vocab)

Return type

float ndarray or list

static ctc_add_arguments(parser)[source]

Add arguments for ctc in multi-encoder setting.

static decoder_add_arguments(parser)[source]

Add arguments for decoder in multi-encoder setting.

encode(x_list)[source]

Encode feature.

Parameters

x_list (list) – input feature [(T1, D), (T2, D), … ]

Returns

list

encoded feature [(T1, D), (T2, D), … ]

static encoder_add_arguments(parser)[source]

Add arguments for encoders in multi-encoder setting.

forward(xs_pad_list, ilens_list, ys_pad)[source]

E2E forward.

Parameters
  • xs_pad_list (List) – list of batch (torch.Tensor) of padded input sequences [(B, Tmax_1, idim), (B, Tmax_2, idim),..]

  • ilens_list (List) – list of batch (torch.Tensor) of lengths of input sequences [(B), (B), ..]

  • ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)

Returns

loss value

Return type

torch.Tensor

init_like_chainer()[source]

Initialize weight like chainer.

chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0 pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5)

however, there are two exceptions as far as I know. - EmbedID.W ~ Normal(0, 1) - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)

recognize(x_list, recog_args, char_list, rnnlm=None)[source]

E2E beam search.

Parameters
  • of ndarray x (list) – list of input acoustic feature [(T1, D), (T2,D),…]

  • recog_args (Namespace) – argument Namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

recognize_batch(xs_list, recog_args, char_list, rnnlm=None)[source]

E2E beam search.

Parameters
  • xs_list (list) – list of list of input acoustic feature arrays [[(T1_1, D), (T1_2, D), …],[(T2_1, D), (T2_2, D), …], …]

  • recog_args (Namespace) – argument Namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

scorers()[source]

Get scorers for beam_search (optional).

Returns

dict of ScorerInterface objects

Return type

dict[str, ScorerInterface]

class espnet.nets.pytorch_backend.e2e_asr_mulenc.Reporter(**links)[source]

Bases: chainer.link.Chain

Define a chainer reporter wrapper.

report(loss_ctc_list, loss_att, acc, cer_ctc_list, cer, wer, mtl_loss)[source]

Define a chainer reporter function.

espnet.nets.pytorch_backend.e2e_st

RNN sequence-to-sequence speech translation model (pytorch).

class espnet.nets.pytorch_backend.e2e_st.E2E(idim, odim, args)[source]

Bases: espnet.nets.st_interface.STInterface, torch.nn.modules.module.Module

E2E module.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

Construct an E2E object.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

static add_arguments(parser)[source]

Add arguments.

static attention_add_arguments(parser)[source]

Add arguments for the attention.

calculate_all_attentions(xs_pad, ilens, ys_pad, ys_pad_src)[source]

E2E attention calculation.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

  • ys_pad_src (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

Returns

attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).

Return type

float ndarray

calculate_all_ctc_probs(xs_pad, ilens, ys_pad, ys_pad_src)[source]

E2E CTC probability calculation.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

:param torch.Tensor

ys_pad_src: batch of padded token id sequence tensor (B, Lmax)

Returns

CTC probability (B, Tmax, vocab)

Return type

float ndarray

static decoder_add_arguments(parser)[source]

Add arguments for the decoder.

encode(x)[source]

Encode acoustic features.

Parameters

x (ndarray) – input acoustic feature (T, D)

Returns

encoder outputs

Return type

torch.Tensor

static encoder_add_arguments(parser)[source]

Add arguments for the encoder.

forward(xs_pad, ilens, ys_pad, ys_pad_src)[source]

E2E forward.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

Returns

loss value

Return type

torch.Tensor

forward_asr(hs_pad, hlens, ys_pad)[source]

Forward pass in the auxiliary ASR task.

Parameters
  • hs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)

  • hlens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

Returns

ASR attention loss value

Return type

torch.Tensor

Returns

accuracy in ASR attention decoder

Return type

float

Returns

ASR CTC loss value

Return type

torch.Tensor

Returns

character error rate from CTC prediction

Return type

float

Returns

character error rate from attetion decoder prediction

Return type

float

Returns

word error rate from attetion decoder prediction

Return type

float

forward_mt(xs_pad, ys_pad)[source]

Forward pass in the auxiliary MT task.

Parameters
  • xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)

  • ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

Returns

MT loss value

Return type

torch.Tensor

Returns

accuracy in MT decoder

Return type

float

init_like_chainer()[source]

Initialize weight like chainer.

chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0 pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5) however, there are two exceptions as far as I know. - EmbedID.W ~ Normal(0, 1) - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)

scorers()[source]

Scorers.

subsample_frames(x)[source]

Subsample speeh frames in the encoder.

translate(x, trans_args, char_list, rnnlm=None)[source]

E2E beam search.

Parameters
  • x (ndarray) – input acoustic feature (T, D)

  • trans_args (Namespace) – argument Namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

translate_batch(xs, trans_args, char_list, rnnlm=None)[source]

E2E batch beam search.

Parameters
  • xs (list) – list of input acoustic feature arrays [(T_1, D), (T_2, D), …]

  • trans_args (Namespace) – argument Namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

class espnet.nets.pytorch_backend.e2e_st.Reporter(**links)[source]

Bases: chainer.link.Chain

A chainer reporter wrapper.

report(loss_asr, loss_mt, loss_st, acc_asr, acc_mt, acc, cer_ctc, cer, wer, bleu, mtl_loss)[source]

Report at every step.

espnet.nets.pytorch_backend.e2e_asr_conformer

Conformer speech recognition model (pytorch).

It is a fusion of e2e_asr_transformer.py Refer to: https://arxiv.org/abs/2005.08100

class espnet.nets.pytorch_backend.e2e_asr_conformer.E2E(idim, odim, args, ignore_id=-1)[source]

Bases: espnet.nets.pytorch_backend.e2e_asr_transformer.E2E

E2E module.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

Construct an E2E object.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

static add_arguments(parser)[source]

Add arguments.

static add_conformer_arguments(parser)[source]

Add arguments for conformer model.

espnet.nets.pytorch_backend.e2e_tts_transformer

TTS-Transformer related modules.

class espnet.nets.pytorch_backend.e2e_tts_transformer.GuidedMultiHeadAttentionLoss(sigma=0.4, alpha=1.0, reset_always=True)[source]

Bases: espnet.nets.pytorch_backend.e2e_tts_tacotron2.GuidedAttentionLoss

Guided attention loss function module for multi head attention.

Parameters
  • sigma (float, optional) – Standard deviation to control

  • close attention to a diagonal. (how) –

  • alpha (float, optional) – Scaling coefficient (lambda).

  • reset_always (bool, optional) – Whether to always reset masks.

Initialize guided attention loss module.

Parameters
  • sigma (float, optional) – Standard deviation to control how close attention to a diagonal.

  • alpha (float, optional) – Scaling coefficient (lambda).

  • reset_always (bool, optional) – Whether to always reset masks.

forward(att_ws, ilens, olens)[source]

Calculate forward propagation.

Parameters
  • att_ws (Tensor) – Batch of multi head attention weights (B, H, T_max_out, T_max_in).

  • ilens (LongTensor) – Batch of input lenghts (B,).

  • olens (LongTensor) – Batch of output lenghts (B,).

Returns

Guided attention loss value.

Return type

Tensor

class espnet.nets.pytorch_backend.e2e_tts_transformer.TTSPlot(att_vis_fn, data, outdir, converter, transform, device, reverse=False, ikey='input', iaxis=0, okey='output', oaxis=0)[source]

Bases: espnet.nets.pytorch_backend.transformer.plot.PlotAttentionReport

Attention plot module for TTS-Transformer.

plotfn(data, attn_dict, outdir, suffix='png', savefn=None)[source]

Plot multi head attentions.

Parameters
  • data (dict) – Utts info from json file.

  • attn_dict (dict) – Multi head attention dict. Values should be numpy.ndarray (H, L, T)

  • outdir (str) – Directory name to save figures.

  • suffix (str) – Filename suffix including image type (e.g., png).

  • savefn (function) – Function to save figures.

class espnet.nets.pytorch_backend.e2e_tts_transformer.Transformer(idim, odim, args=None)[source]

Bases: espnet.nets.tts_interface.TTSInterface, torch.nn.modules.module.Module

Text-to-Speech Transformer module.

This is a module of text-to-speech Transformer described in Neural Speech Synthesis with Transformer Network, which convert the sequence of characters or phonemes into the sequence of Mel-filterbanks.

Initialize TTS-Transformer module.

Parameters
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • args (Namespace, optional) –

    • embed_dim (int): Dimension of character embedding.

    • eprenet_conv_layers (int):

      Number of encoder prenet convolution layers.

    • eprenet_conv_chans (int):

      Number of encoder prenet convolution channels.

    • eprenet_conv_filts (int): Filter size of encoder prenet convolution.

    • dprenet_layers (int): Number of decoder prenet layers.

    • dprenet_units (int): Number of decoder prenet hidden units.

    • elayers (int): Number of encoder layers.

    • eunits (int): Number of encoder hidden units.

    • adim (int): Number of attention transformation dimensions.

    • aheads (int): Number of heads for multi head attention.

    • dlayers (int): Number of decoder layers.

    • dunits (int): Number of decoder hidden units.

    • postnet_layers (int): Number of postnet layers.

    • postnet_chans (int): Number of postnet channels.

    • postnet_filts (int): Filter size of postnet.

    • use_scaled_pos_enc (bool):

      Whether to use trainable scaled positional encoding.

    • use_batch_norm (bool):

      Whether to use batch normalization in encoder prenet.

    • encoder_normalize_before (bool):

      Whether to perform layer normalization before encoder block.

    • decoder_normalize_before (bool):

      Whether to perform layer normalization before decoder block.

    • encoder_concat_after (bool): Whether to concatenate attention

      layer’s input and output in encoder.

    • decoder_concat_after (bool): Whether to concatenate attention

      layer’s input and output in decoder.

    • reduction_factor (int): Reduction factor.

    • spk_embed_dim (int): Number of speaker embedding dimenstions.

    • spk_embed_integration_type: How to integrate speaker embedding.

    • transformer_init (float): How to initialize transformer parameters.

    • transformer_lr (float): Initial value of learning rate.

    • transformer_warmup_steps (int): Optimizer warmup steps.

    • transformer_enc_dropout_rate (float):

      Dropout rate in encoder except attention & positional encoding.

    • transformer_enc_positional_dropout_rate (float):

      Dropout rate after encoder positional encoding.

    • transformer_enc_attn_dropout_rate (float):

      Dropout rate in encoder self-attention module.

    • transformer_dec_dropout_rate (float):

      Dropout rate in decoder except attention & positional encoding.

    • transformer_dec_positional_dropout_rate (float):

      Dropout rate after decoder positional encoding.

    • transformer_dec_attn_dropout_rate (float):

      Dropout rate in deocoder self-attention module.

    • transformer_enc_dec_attn_dropout_rate (float):

      Dropout rate in encoder-deocoder attention module.

    • eprenet_dropout_rate (float): Dropout rate in encoder prenet.

    • dprenet_dropout_rate (float): Dropout rate in decoder prenet.

    • postnet_dropout_rate (float): Dropout rate in postnet.

    • use_masking (bool):

      Whether to apply masking for padded part in loss calculation.

    • use_weighted_masking (bool):

      Whether to apply weighted masking in loss calculation.

    • bce_pos_weight (float): Positive sample weight in bce calculation

      (only for use_masking=true).

    • loss_type (str): How to calculate loss.

    • use_guided_attn_loss (bool): Whether to use guided attention loss.

    • num_heads_applied_guided_attn (int):

      Number of heads in each layer to apply guided attention loss.

    • num_layers_applied_guided_attn (int):

      Number of layers to apply guided attention loss.

    • modules_applied_guided_attn (list):

      List of module names to apply guided attention loss.

    • guided-attn-loss-sigma (float) Sigma in guided attention loss.

    • guided-attn-loss-lambda (float): Lambda in guided attention loss.

static add_arguments(parser)[source]

Add model-specific arguments to the parser.

property attention_plot_class

Return plot class for attention weight plot.

property base_plot_keys

Return base key names to plot during training.

keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.

Returns

List of strings which are base keys to plot during training.

Return type

list

calculate_all_attentions(xs, ilens, ys, olens, spembs=None, skip_output=False, keep_tensor=False, *args, **kwargs)[source]

Calculate all of the attention weights.

Parameters
  • xs (Tensor) – Batch of padded character ids (B, Tmax).

  • ilens (LongTensor) – Batch of lengths of each input batch (B,).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).

  • skip_output (bool, optional) – Whether to skip calculate the final output.

  • keep_tensor (bool, optional) – Whether to keep original tensor.

Returns

Dict of attention weights and outputs.

Return type

dict

forward(xs, ilens, ys, labels, olens, spembs=None, *args, **kwargs)[source]

Calculate forward propagation.

Parameters
  • xs (Tensor) – Batch of padded character ids (B, Tmax).

  • ilens (LongTensor) – Batch of lengths of each input batch (B,).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).

Returns

Loss value.

Return type

Tensor

inference(x, inference_args, spemb=None, *args, **kwargs)[source]

Generate the sequence of features given the sequences of characters.

Parameters
  • x (Tensor) – Input sequence of characters (T,).

  • inference_args (Namespace) –

    • threshold (float): Threshold in inference.

    • minlenratio (float): Minimum length ratio in inference.

    • maxlenratio (float): Maximum length ratio in inference.

  • spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).

Returns

Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Encoder-decoder (source) attention weights (#layers, #heads, L, T).

Return type

Tensor

espnet.nets.pytorch_backend.e2e_asr

RNN sequence-to-sequence speech recognition model (pytorch).

class espnet.nets.pytorch_backend.e2e_asr.E2E(idim, odim, args)[source]

Bases: espnet.nets.asr_interface.ASRInterface, torch.nn.modules.module.Module

E2E module.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

Construct an E2E object.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

static add_arguments(parser)[source]

Add arguments.

static attention_add_arguments(parser)[source]

Add arguments for the attention.

calculate_all_attentions(xs_pad, ilens, ys_pad)[source]

E2E attention calculation.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

Returns

attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).

Return type

float ndarray

calculate_all_ctc_probs(xs_pad, ilens, ys_pad)[source]

E2E CTC probability calculation.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

Returns

CTC probability (B, Tmax, vocab)

Return type

float ndarray

static decoder_add_arguments(parser)[source]

Add arguments for the decoder.

encode(x)[source]

Encode acoustic features.

Parameters

x (ndarray) – input acoustic feature (T, D)

Returns

encoder outputs

Return type

torch.Tensor

static encoder_add_arguments(parser)[source]

Add arguments for the encoder.

enhance(xs)[source]

Forward only in the frontend stage.

Parameters

xs (ndarray) – input acoustic feature (T, C, F)

Returns

enhaned feature

Return type

torch.Tensor

forward(xs_pad, ilens, ys_pad)[source]

E2E forward.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

Returns

loss value

Return type

torch.Tensor

init_like_chainer()[source]

Initialize weight like chainer.

chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0 pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5) however, there are two exceptions as far as I know. - EmbedID.W ~ Normal(0, 1) - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)

recognize(x, recog_args, char_list, rnnlm=None)[source]

E2E beam search.

Parameters
  • x (ndarray) – input acoustic feature (T, D)

  • recog_args (Namespace) – argument Namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

recognize_batch(xs, recog_args, char_list, rnnlm=None)[source]

E2E batch beam search.

Parameters
  • xs (list) – list of input acoustic feature arrays [(T_1, D), (T_2, D), …]

  • recog_args (Namespace) – argument Namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

scorers()[source]

Scorers.

subsample_frames(x)[source]

Subsample speeh frames in the encoder.

class espnet.nets.pytorch_backend.e2e_asr.Reporter(**links)[source]

Bases: chainer.link.Chain

A chainer reporter wrapper.

report(loss_ctc, loss_att, acc, cer_ctc, cer, wer, mtl_loss)[source]

Report at every step.

espnet.nets.pytorch_backend.nets_utils

Network related utility tools.

espnet.nets.pytorch_backend.nets_utils.get_activation(act)[source]

Return activation function.

espnet.nets.pytorch_backend.nets_utils.get_subsample(train_args, mode, arch)[source]

Parse the subsampling factors from the args for the specified mode and arch.

Parameters
  • train_args – argument Namespace containing options.

  • mode – one of (‘asr’, ‘mt’, ‘st’)

  • arch – one of (‘rnn’, ‘rnn-t’, ‘rnn_mix’, ‘rnn_mulenc’, ‘transformer’)

Returns

subsampling factors.

Return type

np.ndarray / List[np.ndarray]

espnet.nets.pytorch_backend.nets_utils.make_non_pad_mask(lengths, xs=None, length_dim=-1)[source]

Make mask tensor containing indices of non-padded part.

Parameters
  • lengths (LongTensor or List) – Batch of lengths (B,).

  • xs (Tensor, optional) – The reference tensor. If set, masks will be the same shape as this tensor.

  • length_dim (int, optional) – Dimension indicator of the above tensor. See the example.

Returns

mask tensor containing indices of padded part.

dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (including 1.2)

Return type

ByteTensor

Examples

With only lengths.

>>> lengths = [5, 3, 2]
>>> make_non_pad_mask(lengths)
masks = [[1, 1, 1, 1 ,1],
         [1, 1, 1, 0, 0],
         [1, 1, 0, 0, 0]]

With the reference tensor.

>>> xs = torch.zeros((3, 2, 4))
>>> make_non_pad_mask(lengths, xs)
tensor([[[1, 1, 1, 1],
         [1, 1, 1, 1]],
        [[1, 1, 1, 0],
         [1, 1, 1, 0]],
        [[1, 1, 0, 0],
         [1, 1, 0, 0]]], dtype=torch.uint8)
>>> xs = torch.zeros((3, 2, 6))
>>> make_non_pad_mask(lengths, xs)
tensor([[[1, 1, 1, 1, 1, 0],
         [1, 1, 1, 1, 1, 0]],
        [[1, 1, 1, 0, 0, 0],
         [1, 1, 1, 0, 0, 0]],
        [[1, 1, 0, 0, 0, 0],
         [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)

With the reference tensor and dimension indicator.

>>> xs = torch.zeros((3, 6, 6))
>>> make_non_pad_mask(lengths, xs, 1)
tensor([[[1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 0]],
        [[1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0]],
        [[1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0]]], dtype=torch.uint8)
>>> make_non_pad_mask(lengths, xs, 2)
tensor([[[1, 1, 1, 1, 1, 0],
         [1, 1, 1, 1, 1, 0],
         [1, 1, 1, 1, 1, 0],
         [1, 1, 1, 1, 1, 0],
         [1, 1, 1, 1, 1, 0],
         [1, 1, 1, 1, 1, 0]],
        [[1, 1, 1, 0, 0, 0],
         [1, 1, 1, 0, 0, 0],
         [1, 1, 1, 0, 0, 0],
         [1, 1, 1, 0, 0, 0],
         [1, 1, 1, 0, 0, 0],
         [1, 1, 1, 0, 0, 0]],
        [[1, 1, 0, 0, 0, 0],
         [1, 1, 0, 0, 0, 0],
         [1, 1, 0, 0, 0, 0],
         [1, 1, 0, 0, 0, 0],
         [1, 1, 0, 0, 0, 0],
         [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
espnet.nets.pytorch_backend.nets_utils.make_pad_mask(lengths, xs=None, length_dim=-1)[source]

Make mask tensor containing indices of padded part.

Parameters
  • lengths (LongTensor or List) – Batch of lengths (B,).

  • xs (Tensor, optional) – The reference tensor. If set, masks will be the same shape as this tensor.

  • length_dim (int, optional) – Dimension indicator of the above tensor. See the example.

Returns

Mask tensor containing indices of padded part.

dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (including 1.2)

Return type

Tensor

Examples

With only lengths.

>>> lengths = [5, 3, 2]
>>> make_non_pad_mask(lengths)
masks = [[0, 0, 0, 0 ,0],
         [0, 0, 0, 1, 1],
         [0, 0, 1, 1, 1]]

With the reference tensor.

>>> xs = torch.zeros((3, 2, 4))
>>> make_pad_mask(lengths, xs)
tensor([[[0, 0, 0, 0],
         [0, 0, 0, 0]],
        [[0, 0, 0, 1],
         [0, 0, 0, 1]],
        [[0, 0, 1, 1],
         [0, 0, 1, 1]]], dtype=torch.uint8)
>>> xs = torch.zeros((3, 2, 6))
>>> make_pad_mask(lengths, xs)
tensor([[[0, 0, 0, 0, 0, 1],
         [0, 0, 0, 0, 0, 1]],
        [[0, 0, 0, 1, 1, 1],
         [0, 0, 0, 1, 1, 1]],
        [[0, 0, 1, 1, 1, 1],
         [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)

With the reference tensor and dimension indicator.

>>> xs = torch.zeros((3, 6, 6))
>>> make_pad_mask(lengths, xs, 1)
tensor([[[0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1]],
        [[0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1]],
        [[0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1]]], dtype=torch.uint8)
>>> make_pad_mask(lengths, xs, 2)
tensor([[[0, 0, 0, 0, 0, 1],
         [0, 0, 0, 0, 0, 1],
         [0, 0, 0, 0, 0, 1],
         [0, 0, 0, 0, 0, 1],
         [0, 0, 0, 0, 0, 1],
         [0, 0, 0, 0, 0, 1]],
        [[0, 0, 0, 1, 1, 1],
         [0, 0, 0, 1, 1, 1],
         [0, 0, 0, 1, 1, 1],
         [0, 0, 0, 1, 1, 1],
         [0, 0, 0, 1, 1, 1],
         [0, 0, 0, 1, 1, 1]],
        [[0, 0, 1, 1, 1, 1],
         [0, 0, 1, 1, 1, 1],
         [0, 0, 1, 1, 1, 1],
         [0, 0, 1, 1, 1, 1],
         [0, 0, 1, 1, 1, 1],
         [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
espnet.nets.pytorch_backend.nets_utils.mask_by_length(xs, lengths, fill=0)[source]

Mask tensor according to length.

Parameters
  • xs (Tensor) – Batch of input tensor (B, *).

  • lengths (LongTensor or List) – Batch of lengths (B,).

  • fill (int or float) – Value to fill masked part.

Returns

Batch of masked input tensor (B, *).

Return type

Tensor

Examples

>>> x = torch.arange(5).repeat(3, 1) + 1
>>> x
tensor([[1, 2, 3, 4, 5],
        [1, 2, 3, 4, 5],
        [1, 2, 3, 4, 5]])
>>> lengths = [5, 3, 2]
>>> mask_by_length(x, lengths)
tensor([[1, 2, 3, 4, 5],
        [1, 2, 3, 0, 0],
        [1, 2, 0, 0, 0]])
espnet.nets.pytorch_backend.nets_utils.pad_list(xs, pad_value)[source]

Perform padding for the list of tensors.

Parameters
  • xs (List) – List of Tensors [(T_1, *), (T_2, *), …, (T_B, *)].

  • pad_value (float) – Value for padding.

Returns

Padded tensor (B, Tmax, *).

Return type

Tensor

Examples

>>> x = [torch.ones(4), torch.ones(2), torch.ones(1)]
>>> x
[tensor([1., 1., 1., 1.]), tensor([1., 1.]), tensor([1.])]
>>> pad_list(x, 0)
tensor([[1., 1., 1., 1.],
        [1., 1., 0., 0.],
        [1., 0., 0., 0.]])
espnet.nets.pytorch_backend.nets_utils.rename_state_dict(old_prefix: str, new_prefix: str, state_dict: Dict[str, torch.Tensor])[source]

Replace keys of old prefix with new prefix in state dict.

espnet.nets.pytorch_backend.nets_utils.th_accuracy(pad_outputs, pad_targets, ignore_label)[source]

Calculate accuracy.

Parameters
  • pad_outputs (Tensor) – Prediction tensors (B * Lmax, D).

  • pad_targets (LongTensor) – Target label tensors (B, Lmax, D).

  • ignore_label (int) – Ignore label id.

Returns

Accuracy value (0.0 - 1.0).

Return type

float

espnet.nets.pytorch_backend.nets_utils.to_device(m, x)[source]

Send tensor into the device of the module.

Parameters
  • m (torch.nn.Module) – Torch module.

  • x (Tensor) – Torch tensor.

Returns

Torch tensor located in the same place as torch module.

Return type

Tensor

espnet.nets.pytorch_backend.nets_utils.to_torch_tensor(x)[source]

Change to torch.Tensor or ComplexTensor from numpy.ndarray.

Parameters

x – Inputs. It should be one of numpy.ndarray, Tensor, ComplexTensor, and dict.

Returns

Type converted inputs.

Return type

Tensor or ComplexTensor

Examples

>>> xs = np.ones(3, dtype=np.float32)
>>> xs = to_torch_tensor(xs)
tensor([1., 1., 1.])
>>> xs = torch.ones(3, 4, 5)
>>> assert to_torch_tensor(xs) is xs
>>> xs = {'real': xs, 'imag': xs}
>>> to_torch_tensor(xs)
ComplexTensor(
Real:
tensor([1., 1., 1.])
Imag;
tensor([1., 1., 1.])
)

espnet.nets.pytorch_backend.ctc

class espnet.nets.pytorch_backend.ctc.CTC(odim, eprojs, dropout_rate, ctc_type='warpctc', reduce=True)[source]

Bases: torch.nn.modules.module.Module

CTC module

Parameters
  • odim (int) – dimension of outputs

  • eprojs (int) – number of encoder projection units

  • dropout_rate (float) – dropout rate (0.0 ~ 1.0)

  • ctc_type (str) – builtin or warpctc

  • reduce (bool) – reduce the CTC loss into a scalar

argmax(hs_pad)[source]

argmax of frame activations

Parameters

hs_pad (torch.Tensor) – 3d tensor (B, Tmax, eprojs)

Returns

argmax applied 2d tensor (B, Tmax)

Return type

torch.Tensor

forced_align(h, y, blank_id=0)[source]

forced alignment.

Parameters
  • h (torch.Tensor) – hidden state sequence, 2d tensor (T, D)

  • y (int) – id sequence tensor 1d tensor (L)

  • y – blank symbol index

Returns

best alignment results

Return type

list

forward(hs_pad, hlens, ys_pad)[source]

CTC forward

Parameters
  • hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D)

  • hlens (torch.Tensor) – batch of lengths of hidden state sequences (B)

  • ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)

Returns

ctc loss value

Return type

torch.Tensor

log_softmax(hs_pad)[source]

log_softmax of frame activations

Parameters

hs_pad (torch.Tensor) – 3d tensor (B, Tmax, eprojs)

Returns

log softmax applied 3d tensor (B, Tmax, odim)

Return type

torch.Tensor

loss_fn(th_pred, th_target, th_ilen, th_olen)[source]
softmax(hs_pad)[source]

softmax of frame activations

Parameters

hs_pad (torch.Tensor) – 3d tensor (B, Tmax, eprojs)

Returns

log softmax applied 3d tensor (B, Tmax, odim)

Return type

torch.Tensor

espnet.nets.pytorch_backend.ctc.ctc_for(args, odim, reduce=True)[source]

Returns the CTC module for the given args and output dimension

Parameters

args (Namespace) – the program args

:param int odim : The output dimension :param bool reduce : return the CTC loss in a scalar :return: the corresponding CTC module

espnet.nets.pytorch_backend.e2e_vc_transformer

Voice Transformer Network (Transformer-VC) related modules.

class espnet.nets.pytorch_backend.e2e_vc_transformer.Transformer(idim, odim, args=None)[source]

Bases: espnet.nets.tts_interface.TTSInterface, torch.nn.modules.module.Module

VC Transformer module.

This is a module of the Voice Transformer Network (a.k.a. VTN or Transformer-VC) described in Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining, which convert the sequence of acoustic features into the sequence of acoustic features.

Initialize Transformer-VC module.

Parameters
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • args (Namespace, optional) –

    • eprenet_conv_layers (int):

      Number of encoder prenet convolution layers.

    • eprenet_conv_chans (int):

      Number of encoder prenet convolution channels.

    • eprenet_conv_filts (int):

      Filter size of encoder prenet convolution.

    • transformer_input_layer (str): Input layer before the encoder.

    • dprenet_layers (int): Number of decoder prenet layers.

    • dprenet_units (int): Number of decoder prenet hidden units.

    • elayers (int): Number of encoder layers.

    • eunits (int): Number of encoder hidden units.

    • adim (int): Number of attention transformation dimensions.

    • aheads (int): Number of heads for multi head attention.

    • dlayers (int): Number of decoder layers.

    • dunits (int): Number of decoder hidden units.

    • postnet_layers (int): Number of postnet layers.

    • postnet_chans (int): Number of postnet channels.

    • postnet_filts (int): Filter size of postnet.

    • use_scaled_pos_enc (bool):

      Whether to use trainable scaled positional encoding.

    • use_batch_norm (bool):

      Whether to use batch normalization in encoder prenet.

    • encoder_normalize_before (bool):

      Whether to perform layer normalization before encoder block.

    • decoder_normalize_before (bool):

      Whether to perform layer normalization before decoder block.

    • encoder_concat_after (bool): Whether to concatenate

      attention layer’s input and output in encoder.

    • decoder_concat_after (bool): Whether to concatenate

      attention layer’s input and output in decoder.

    • reduction_factor (int): Reduction factor (for decoder).

    • encoder_reduction_factor (int): Reduction factor (for encoder).

    • spk_embed_dim (int): Number of speaker embedding dimenstions.

    • spk_embed_integration_type: How to integrate speaker embedding.

    • transformer_init (float): How to initialize transformer parameters.

    • transformer_lr (float): Initial value of learning rate.

    • transformer_warmup_steps (int): Optimizer warmup steps.

    • transformer_enc_dropout_rate (float):

      Dropout rate in encoder except attention & positional encoding.

    • transformer_enc_positional_dropout_rate (float):

      Dropout rate after encoder positional encoding.

    • transformer_enc_attn_dropout_rate (float):

      Dropout rate in encoder self-attention module.

    • transformer_dec_dropout_rate (float):

      Dropout rate in decoder except attention & positional encoding.

    • transformer_dec_positional_dropout_rate (float):

      Dropout rate after decoder positional encoding.

    • transformer_dec_attn_dropout_rate (float):

      Dropout rate in deocoder self-attention module.

    • transformer_enc_dec_attn_dropout_rate (float):

      Dropout rate in encoder-deocoder attention module.

    • eprenet_dropout_rate (float): Dropout rate in encoder prenet.

    • dprenet_dropout_rate (float): Dropout rate in decoder prenet.

    • postnet_dropout_rate (float): Dropout rate in postnet.

    • use_masking (bool):

      Whether to apply masking for padded part in loss calculation.

    • use_weighted_masking (bool):

      Whether to apply weighted masking in loss calculation.

    • bce_pos_weight (float): Positive sample weight in bce calculation

      (only for use_masking=true).

    • loss_type (str): How to calculate loss.

    • use_guided_attn_loss (bool): Whether to use guided attention loss.

    • num_heads_applied_guided_attn (int):

      Number of heads in each layer to apply guided attention loss.

    • num_layers_applied_guided_attn (int):

      Number of layers to apply guided attention loss.

    • modules_applied_guided_attn (list):

      List of module names to apply guided attention loss.

    • guided-attn-loss-sigma (float) Sigma in guided attention loss.

    • guided-attn-loss-lambda (float): Lambda in guided attention loss.

static add_arguments(parser)[source]

Add model-specific arguments to the parser.

property attention_plot_class

Return plot class for attention weight plot.

property base_plot_keys

Return base key names to plot during training.

keys should match what chainer.reporter reports. If you add the key loss, the reporter will report main/loss

and validation/main/loss values.

also loss.png will be created as a figure visulizing main/loss

and validation/main/loss values.

Returns

List of strings which are base keys to plot during training.

Return type

list

calculate_all_attentions(xs, ilens, ys, olens, spembs=None, skip_output=False, keep_tensor=False, *args, **kwargs)[source]

Calculate all of the attention weights.

Parameters
  • xs (Tensor) – Batch of padded acoustic features (B, Tmax, idim).

  • ilens (LongTensor) – Batch of lengths of each input batch (B,).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).

  • skip_output (bool, optional) – Whether to skip calculate the final output.

  • keep_tensor (bool, optional) – Whether to keep original tensor.

Returns

Dict of attention weights and outputs.

Return type

dict

forward(xs, ilens, ys, labels, olens, spembs=None, *args, **kwargs)[source]

Calculate forward propagation.

Parameters
  • xs (Tensor) – Batch of padded acoustic features (B, Tmax, idim).

  • ilens (LongTensor) – Batch of lengths of each input batch (B,).

  • ys (Tensor) – Batch of padded target features (B, Lmax, odim).

  • olens (LongTensor) – Batch of the lengths of each target (B,).

  • spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).

Returns

Loss value.

Return type

Tensor

inference(x, inference_args, spemb=None, *args, **kwargs)[source]

Generate the sequence of features given the sequences of acoustic features.

Parameters
  • x (Tensor) – Input sequence of acoustic features (T, idim).

  • inference_args (Namespace) –

    • threshold (float): Threshold in inference.

    • minlenratio (float): Minimum length ratio in inference.

    • maxlenratio (float): Maximum length ratio in inference.

  • spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).

Returns

Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Encoder-decoder (source) attention weights (#layers, #heads, L, T).

Return type

Tensor

espnet.nets.pytorch_backend.wavenet

This code is based on https://github.com/kan-bayashi/PytorchWaveNetVocoder.

class espnet.nets.pytorch_backend.wavenet.CausalConv1d(in_channels, out_channels, kernel_size, dilation=1, bias=True)[source]

Bases: torch.nn.modules.module.Module

1D dilated causal convolution.

forward(x)[source]

Calculate forward propagation.

Parameters

x (Tensor) – Input tensor with the shape (B, in_channels, T).

Returns

Tensor with the shape (B, out_channels, T)

Return type

Tensor

class espnet.nets.pytorch_backend.wavenet.OneHot(depth)[source]

Bases: torch.nn.modules.module.Module

Convert to one-hot vector.

Parameters

depth (int) – Dimension of one-hot vector.

forward(x)[source]

Calculate forward propagation.

Parameters

x (LongTensor) – long tensor variable with the shape (B, T)

Returns

float tensor variable with the shape (B, depth, T)

Return type

Tensor

class espnet.nets.pytorch_backend.wavenet.UpSampling(upsampling_factor, bias=True)[source]

Bases: torch.nn.modules.module.Module

Upsampling layer with deconvolution.

Parameters

upsampling_factor (int) – Upsampling factor.

forward(x)[source]

Calculate forward propagation.

Parameters

x (Tensor) – Input tensor with the shape (B, C, T)

Returns

Tensor with the shape (B, C, T’) where T’ = T * upsampling_factor.

Return type

Tensor

class espnet.nets.pytorch_backend.wavenet.WaveNet(n_quantize=256, n_aux=28, n_resch=512, n_skipch=256, dilation_depth=10, dilation_repeat=3, kernel_size=2, upsampling_factor=0)[source]

Bases: torch.nn.modules.module.Module

Conditional wavenet.

Parameters
  • n_quantize (int) – Number of quantization.

  • n_aux (int) – Number of aux feature dimension.

  • n_resch (int) – Number of filter channels for residual block.

  • n_skipch (int) – Number of filter channels for skip connection.

  • dilation_depth (int) – Number of dilation depth (e.g. if set 10, max dilation = 2^(10-1)).

  • dilation_repeat (int) – Number of dilation repeat.

  • kernel_size (int) – Filter size of dilated causal convolution.

  • upsampling_factor (int) – Upsampling factor.

forward(x, h)[source]

Calculate forward propagation.

Parameters
  • x (LongTensor) – Quantized input waveform tensor with the shape (B, T).

  • h (Tensor) – Auxiliary feature tensor with the shape (B, n_aux, T).

Returns

Logits with the shape (B, T, n_quantize).

Return type

Tensor

generate(x, h, n_samples, interval=None, mode='sampling')[source]

Generate a waveform with fast genration algorithm.

This generation based on Fast WaveNet Generation Algorithm.

Parameters
  • x (LongTensor) – Initial waveform tensor with the shape (T,).

  • h (Tensor) – Auxiliary feature tensor with the shape (n_samples + T, n_aux).

  • n_samples (int) – Number of samples to be generated.

  • interval (int, optional) – Log interval.

  • mode (str, optional) – “sampling” or “argmax”.

Returns

Generated quantized waveform (n_samples).

Return type

ndarray

espnet.nets.pytorch_backend.wavenet.decode_mu_law(y, mu=256)[source]

Perform mu-law decoding.

Parameters
  • x (ndarray) – Quantized audio signal with the range from 0 to mu - 1.

  • mu (int) – Quantized level.

Returns

Audio signal with the range from -1 to 1.

Return type

ndarray

espnet.nets.pytorch_backend.wavenet.encode_mu_law(x, mu=256)[source]

Perform mu-law encoding.

Parameters
  • x (ndarray) – Audio signal with the range from -1 to 1.

  • mu (int) – Quantized level.

Returns

Quantized audio signal with the range from 0 to mu - 1.

Return type

ndarray

espnet.nets.pytorch_backend.wavenet.initialize(m)[source]

Initilize conv layers with xavier.

Parameters

m (torch.nn.Module) – Torch module.

espnet.nets.pytorch_backend.e2e_asr_maskctc

Mask CTC based non-autoregressive speech recognition model (pytorch).

See https://arxiv.org/abs/2005.08700 for the detail.

class espnet.nets.pytorch_backend.e2e_asr_maskctc.E2E(idim, odim, args, ignore_id=-1)[source]

Bases: espnet.nets.pytorch_backend.e2e_asr_transformer.E2E

E2E module.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

Construct an E2E object.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

static add_arguments(parser)[source]

Add arguments.

static add_maskctc_arguments(parser)[source]

Add arguments for maskctc model.

forward(xs_pad, ilens, ys_pad)[source]

E2E forward.

Parameters
  • xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of source sequences (B)

  • ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

Returns

ctc loss value

Return type

torch.Tensor

Returns

attention loss value

Return type

torch.Tensor

Returns

accuracy in attention decoder

Return type

float

recognize(x, recog_args, char_list=None, rnnlm=None)[source]

Recognize input speech.

Parameters
  • x (ndnarray) – input acoustic feature (B, T, D) or (T, D)

  • recog_args (Namespace) – argment Namespace contraining options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

decoding result

Return type

list

espnet.nets.pytorch_backend.e2e_asr_mix_transformer

Transformer speech recognition model for single-channel multi-speaker mixture speech.

It is a fusion of e2e_asr_mix.py and e2e_asr_transformer.py. Refer to:

https://arxiv.org/pdf/2002.03921.pdf

  1. The Transformer-based Encoder now consists of three stages:

    (a): Enc_mix: encoding input mixture speech; (b): Enc_SD: separating mixed speech representations; (c): Enc_rec: transforming each separated speech representation.

  2. PIT is used in CTC to determine the permutation with minimum loss.

class espnet.nets.pytorch_backend.e2e_asr_mix_transformer.E2E(idim, odim, args, ignore_id=-1)[source]

Bases: espnet.nets.pytorch_backend.e2e_asr_transformer.E2E, espnet.nets.asr_interface.ASRInterface, torch.nn.modules.module.Module

E2E module.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

Construct an E2E object.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

static add_arguments(parser)[source]

Add arguments.

decoder_and_attention(hs_pad, hs_mask, ys_pad, batch_size)[source]

Forward decoder and attention loss.

encode(x)[source]

Encode acoustic features.

Parameters

x (ndarray) – source acoustic feature (T, D)

Returns

encoder outputs

Return type

torch.Tensor

forward(xs_pad, ilens, ys_pad)[source]

E2E forward.

Parameters
  • xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of source sequences (B)

  • ys_pad (torch.Tensor) – batch of padded target sequences (B, num_spkrs, Lmax)

Returns

ctc loass value

Return type

torch.Tensor

Returns

attention loss value

Return type

torch.Tensor

Returns

accuracy in attention decoder

Return type

float

recog(enc_output, recog_args, char_list=None, rnnlm=None, use_jit=False)[source]

Recognize input speech of each speaker.

Parameters
  • enc_output (ndnarray) – encoder outputs (B, T, D) or (T, D)

  • recog_args (Namespace) – argment Namespace contraining options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

recognize(x, recog_args, char_list=None, rnnlm=None, use_jit=False)[source]

Recognize input speech of each speaker.

Parameters
  • x (ndnarray) – input acoustic feature (B, T, D) or (T, D)

  • recog_args (Namespace) – argment Namespace contraining options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

espnet.nets.pytorch_backend.__init__

Initialize sub package.

espnet.nets.pytorch_backend.e2e_mt_transformer

Transformer text translation model (pytorch).

class espnet.nets.pytorch_backend.e2e_mt_transformer.E2E(idim, odim, args, ignore_id=-1)[source]

Bases: espnet.nets.mt_interface.MTInterface, torch.nn.modules.module.Module

E2E module.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

Construct an E2E object.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

static add_arguments(parser)[source]

Add arguments.

property attention_plot_class

Return PlotAttentionReport.

calculate_all_attentions(xs_pad, ilens, ys_pad)[source]

E2E attention calculation.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

Returns

attention weights (B, H, Lmax, Tmax)

Return type

float ndarray

encode(xs)[source]

Encode source sentences.

forward(xs_pad, ilens, ys_pad)[source]

E2E forward.

Parameters
  • xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax)

  • ilens (torch.Tensor) – batch of lengths of source sequences (B)

  • ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

Return type

torch.Tensor

Returns

attention loss value

Return type

torch.Tensor

Returns

accuracy in attention decoder

Return type

float

reset_parameters(args)[source]

Initialize parameters.

scorers()[source]

Scorers.

target_forcing(xs_pad, ys_pad=None, tgt_lang=None)[source]

Prepend target language IDs to source sentences for multilingual MT.

These tags are prepended in source/target sentences as pre-processing.

Parameters

xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax)

Returns

source text without language IDs

Return type

torch.Tensor

Returns

target text without language IDs

Return type

torch.Tensor

Returns

target language IDs

Return type

torch.Tensor (B, 1)

translate(x, trans_args, char_list=None)[source]

Translate source text.

Parameters
  • x (list) – input source text feature (T,)

  • trans_args (Namespace) – argment Namespace contraining options

  • char_list (list) – list of characters

Returns

N-best decoding results

Return type

list

espnet.nets.pytorch_backend.e2e_mt

RNN sequence-to-sequence text translation model (pytorch).

class espnet.nets.pytorch_backend.e2e_mt.E2E(idim, odim, args)[source]

Bases: espnet.nets.mt_interface.MTInterface, torch.nn.modules.module.Module

E2E module.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

Construct an E2E object.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

static add_arguments(parser)[source]

Add arguments.

static attention_add_arguments(parser)[source]

Add arguments for the attention.

calculate_all_attentions(xs_pad, ilens, ys_pad)[source]

E2E attention calculation.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

Returns

attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).

Return type

float ndarray

static decoder_add_arguments(parser)[source]

Add arguments for the decoder.

static encoder_add_arguments(parser)[source]

Add arguments for the encoder.

forward(xs_pad, ilens, ys_pad)[source]

E2E forward.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded token id sequence tensor (B, Lmax)

Returns

loss value

Return type

torch.Tensor

init_like_fairseq()[source]

Initialize weight like Fairseq.

Fairseq basically uses W, b, EmbedID.W ~ Uniform(-0.1, 0.1),

target_language_biasing(xs_pad, ilens, ys_pad)[source]

Prepend target language IDs to source sentences for multilingual MT.

These tags are prepended in source/target sentences as pre-processing.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

Returns

source text without language IDs

Return type

torch.Tensor

Returns

target text without language IDs

Return type

torch.Tensor

Returns

target language IDs

Return type

torch.Tensor (B, 1)

translate(x, trans_args, char_list, rnnlm=None)[source]

E2E beam search.

Parameters
  • x (ndarray) – input source text feature (B, T, D)

  • trans_args (Namespace) – argument Namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

translate_batch(xs, trans_args, char_list, rnnlm=None)[source]

E2E batch beam search.

Parameters
  • xs (list) – list of input source text feature arrays [(T_1, D), (T_2, D), …]

  • trans_args (Namespace) – argument Namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

class espnet.nets.pytorch_backend.e2e_mt.Reporter(**links)[source]

Bases: chainer.link.Chain

A chainer reporter wrapper.

report(loss, acc, ppl, bleu)[source]

Report at every step.

espnet.nets.pytorch_backend.e2e_asr_mix

This script is used to construct End-to-End models of multi-speaker ASR.

Copyright 2017 Johns Hopkins University (Shinji Watanabe)

Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)

class espnet.nets.pytorch_backend.e2e_asr_mix.E2E(idim, odim, args)[source]

Bases: espnet.nets.asr_interface.ASRInterface, torch.nn.modules.module.Module

E2E module.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • args (Namespace) – argument Namespace containing options

Initialize multi-speaker E2E module.

static add_arguments(parser)[source]

Add arguments.

calculate_all_attentions(xs_pad, ilens, ys_pad)[source]

E2E attention calculation.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, num_spkrs, Lmax)

Returns

attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).

Return type

float ndarray

static encoder_mix_add_arguments(parser)[source]

Add arguments for multi-speaker encoder.

enhance(xs)[source]

Forward only the frontend stage.

Parameters

xs (ndarray) – input acoustic feature (T, C, F)

forward(xs_pad, ilens, ys_pad)[source]

E2E forward.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, num_spkrs, Lmax)

Returns

ctc loss value

Return type

torch.Tensor

Returns

attention loss value

Return type

torch.Tensor

Returns

accuracy in attention decoder

Return type

float

init_like_chainer()[source]

Initialize weight like chainer.

chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0 pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5)

however, there are two exceptions as far as I know. - EmbedID.W ~ Normal(0, 1) - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)

recognize(x, recog_args, char_list, rnnlm=None)[source]

E2E beam search.

Parameters
  • x (ndarray) – input acoustic feature (T, D)

  • recog_args (Namespace) – argument Namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

recognize_batch(xs, recog_args, char_list, rnnlm=None)[source]

E2E beam search.

Parameters
  • xs (ndarray) – input acoustic feature (T, D)

  • recog_args (Namespace) – argument Namespace containing options

  • char_list (list) – list of characters

  • rnnlm (torch.nn.Module) – language model module

Returns

N-best decoding results

Return type

list

class espnet.nets.pytorch_backend.e2e_asr_mix.EncoderMix(etype, idim, elayers_sd, elayers_rec, eunits, eprojs, subsample, dropout, num_spkrs=2, in_channel=1)[source]

Bases: torch.nn.modules.module.Module

Encoder module for the case of multi-speaker mixture speech.

Parameters
  • etype (str) – type of encoder network

  • idim (int) – number of dimensions of encoder network

  • elayers_sd (int) – number of layers of speaker differentiate part in encoder network

  • elayers_rec (int) – number of layers of shared recognition part in encoder network

  • eunits (int) – number of lstm units of encoder network

  • eprojs (int) – number of projection units of encoder network

  • subsample (np.ndarray) – list of subsampling numbers

  • dropout (float) – dropout rate

  • in_channel (int) – number of input channels

  • num_spkrs (int) – number of number of speakers

Initialize the encoder of single-channel multi-speaker ASR.

forward(xs_pad, ilens)[source]

Encodermix forward.

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, D)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

Returns

list: batch of hidden state sequences [num_spkrs x (B, Tmax, eprojs)]

Return type

torch.Tensor

class espnet.nets.pytorch_backend.e2e_asr_mix.PIT(num_spkrs)[source]

Bases: object

Permutation Invariant Training (PIT) module.

Parameters

num_spkrs (int) – number of speakers for PIT process (2 or 3)

Initialize PIT module.

min_pit_sample(loss)[source]

Compute the PIT loss for each sample.

Parameters

torch.Tensor loss (1-D) – list of losses for one sample, including [h1r1, h1r2, h2r1, h2r2] or [h1r1, h1r2, h1r3, h2r1, h2r2, h2r3, h3r1, h3r2, h3r3]

:return minimum loss of best permutation :rtype torch.Tensor (1) :return the best permutation :rtype List: len=2

permutationDFS(source, start)[source]

Get permutations with DFS.

The final result is all permutations of the ‘source’ sequence. e.g. [[1, 2], [2, 1]] or

[[1, 2, 3], [1, 3, 2], [2, 1, 3], [2, 3, 1], [3, 2, 1], [3, 1, 2]]

Parameters
  • source (np.ndarray) – (num_spkrs, 1), e.g. [1, 2, …, N]

  • start (int) – the start point to permute

pit_process(losses)[source]

Compute the PIT loss for a batch.

Parameters

losses (torch.Tensor) – losses (B, 1|4|9)

:return minimum losses of a batch with best permutation :rtype torch.Tensor (B) :return the best permutation :rtype torch.LongTensor (B, 1|2|3)

espnet.nets.pytorch_backend.e2e_asr_mix.encoder_for(args, idim, subsample)[source]

Construct the encoder.

espnet.nets.pytorch_backend.initialization

Initialization functions for RNN sequence-to-sequence models.

espnet.nets.pytorch_backend.initialization.lecun_normal_init_parameters(module)[source]

Initialize parameters in the LeCun’s manner.

espnet.nets.pytorch_backend.initialization.set_forget_bias_to_one(bias)[source]

Initialize a bias vector in the forget gate with one.

espnet.nets.pytorch_backend.initialization.uniform_init_parameters(module)[source]

Initialize parameters with an uniform distribution.

espnet.nets.pytorch_backend.frontends.feature_transform

class espnet.nets.pytorch_backend.frontends.feature_transform.FeatureTransform(fs: int = 16000, n_fft: int = 512, n_mels: int = 80, fmin: float = 0.0, fmax: float = None, stats_file: str = None, apply_uttmvn: bool = True, uttmvn_norm_means: bool = True, uttmvn_norm_vars: bool = False)[source]

Bases: torch.nn.modules.module.Module

forward(x: torch_complex.tensor.ComplexTensor, ilens: Union[torch.LongTensor, numpy.ndarray, List[int]]) → Tuple[torch.Tensor, torch.LongTensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet.nets.pytorch_backend.frontends.feature_transform.GlobalMVN(stats_file: str, norm_means: bool = True, norm_vars: bool = True, eps: float = 1e-20)[source]

Bases: torch.nn.modules.module.Module

Apply global mean and variance normalization

Parameters
  • stats_file (str) – npy file of 1-dim array or text file. From the _first element to the {(len(array) - 1) / 2}th element are treated as the sum of features, and the rest excluding the last elements are treated as the sum of the square value of features, and the last elements eqauls to the number of samples.

  • std_floor (float) –

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x: torch.Tensor, ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet.nets.pytorch_backend.frontends.feature_transform.LogMel(fs: int = 16000, n_fft: int = 512, n_mels: int = 80, fmin: float = 0.0, fmax: float = None, htk: bool = False, norm=1)[source]

Bases: torch.nn.modules.module.Module

Convert STFT to fbank feats

The arguments is same as librosa.filters.mel

Parameters
  • fs – number > 0 [scalar] sampling rate of the incoming signal

  • n_fft – int > 0 [scalar] number of FFT components

  • n_mels – int > 0 [scalar] number of Mel bands to generate

  • fmin – float >= 0 [scalar] lowest frequency (in Hz)

  • fmax – float >= 0 [scalar] highest frequency (in Hz). If None, use fmax = fs / 2.0

  • htk – use HTK formula instead of Slaney

  • norm – {None, 1, np.inf} [scalar] if 1, divide the triangular mel weights by the width of the mel band (area normalization). Otherwise, leave all the triangles aiming for a peak value of 1.0

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(feat: torch.Tensor, ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet.nets.pytorch_backend.frontends.feature_transform.UtteranceMVN(norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20)[source]

Bases: torch.nn.modules.module.Module

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x: torch.Tensor, ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet.nets.pytorch_backend.frontends.feature_transform.feature_transform_for(args, n_fft)[source]
espnet.nets.pytorch_backend.frontends.feature_transform.utterance_mvn(x: torch.Tensor, ilens: torch.LongTensor, norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20) → Tuple[torch.Tensor, torch.LongTensor][source]

Apply utterance mean and variance normalization

Parameters
  • x – (B, T, D), assumed zero padded

  • ilens – (B, T, D)

  • norm_means

  • norm_vars

  • eps

espnet.nets.pytorch_backend.frontends.dnn_beamformer

class espnet.nets.pytorch_backend.frontends.dnn_beamformer.AttentionReference(bidim, att_dim)[source]

Bases: torch.nn.modules.module.Module

forward(psd_in: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor, scaling: float = 2.0) → Tuple[torch.Tensor, torch.LongTensor][source]

The forward function

Parameters
  • psd_in (ComplexTensor) – (B, F, C, C)

  • ilens (torch.Tensor) – (B,)

  • scaling (float) –

Returns

(B, C) ilens (torch.Tensor): (B,)

Return type

u (torch.Tensor)

class espnet.nets.pytorch_backend.frontends.dnn_beamformer.DNN_Beamformer(bidim, btype='blstmp', blayers=3, bunits=300, bprojs=320, bnmask=2, dropout_rate=0.0, badim=320, ref_channel: int = -1, beamformer_type='mvdr')[source]

Bases: torch.nn.modules.module.Module

DNN mask based Beamformer

Citation:

Multichannel End-to-end Speech Recognition; T. Ochiai et al., 2017; https://arxiv.org/abs/1703.04783

forward(data: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[torch_complex.tensor.ComplexTensor, torch.LongTensor, torch_complex.tensor.ComplexTensor][source]

The forward function

Notation:

B: Batch C: Channel T: Time or Sequence length F: Freq

Parameters
  • data (ComplexTensor) – (B, T, C, F)

  • ilens (torch.Tensor) – (B,)

Returns

(B, T, F) ilens (torch.Tensor): (B,)

Return type

enhanced (ComplexTensor)

espnet.nets.pytorch_backend.frontends.mask_estimator

class espnet.nets.pytorch_backend.frontends.mask_estimator.MaskEstimator(type, idim, layers, units, projs, dropout, nmask=1)[source]

Bases: torch.nn.modules.module.Module

forward(xs: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]

The forward function

Parameters
  • xs – (B, F, C, T)

  • ilens – (B,)

Returns

The hidden vector (B, F, C, T) masks: A tuple of the masks. (B, F, C, T) ilens: (B,)

Return type

hs (torch.Tensor)

espnet.nets.pytorch_backend.frontends.beamformer

espnet.nets.pytorch_backend.frontends.beamformer.apply_beamforming_vector(beamform_vector: torch_complex.tensor.ComplexTensor, mix: torch_complex.tensor.ComplexTensor) → torch_complex.tensor.ComplexTensor[source]
espnet.nets.pytorch_backend.frontends.beamformer.get_mvdr_vector(psd_s: torch_complex.tensor.ComplexTensor, psd_n: torch_complex.tensor.ComplexTensor, reference_vector: torch.Tensor, eps: float = 1e-15) → torch_complex.tensor.ComplexTensor[source]

Return the MVDR(Minimum Variance Distortionless Response) vector:

h = (Npsd^-1 @ Spsd) / (Tr(Npsd^-1 @ Spsd)) @ u

Reference:

On optimal frequency-domain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420

Parameters
  • psd_s (ComplexTensor) – (…, F, C, C)

  • psd_n (ComplexTensor) – (…, F, C, C)

  • reference_vector (torch.Tensor) – (…, C)

  • eps (float) –

Returns

(…, F, C)

Return type

beamform_vector (ComplexTensor)r

espnet.nets.pytorch_backend.frontends.beamformer.get_power_spectral_density_matrix(xs: torch_complex.tensor.ComplexTensor, mask: torch.Tensor, normalization=True, eps: float = 1e-15) → torch_complex.tensor.ComplexTensor[source]

Return cross-channel power spectral density (PSD) matrix

Parameters
  • xs (ComplexTensor) – (…, F, C, T)

  • mask (torch.Tensor) – (…, F, C, T)

  • normalization (bool) –

  • eps (float) –

Returns

psd (ComplexTensor): (…, F, C, C)

espnet.nets.pytorch_backend.frontends.dnn_wpe

class espnet.nets.pytorch_backend.frontends.dnn_wpe.DNN_WPE(wtype: str = 'blstmp', widim: int = 257, wlayers: int = 3, wunits: int = 300, wprojs: int = 320, dropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask: bool = True, iterations: int = 1, normalization: bool = False)[source]

Bases: torch.nn.modules.module.Module

forward(data: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[torch_complex.tensor.ComplexTensor, torch.LongTensor, torch_complex.tensor.ComplexTensor][source]

The forward function

Notation:

B: Batch C: Channel T: Time or Sequence length F: Freq or Some dimension of the feature vector

Parameters
  • data – (B, C, T, F)

  • ilens – (B,)

Returns

(B, C, T, F) ilens: (B,)

Return type

data

espnet.nets.pytorch_backend.frontends.__init__

Initialize sub package.

espnet.nets.pytorch_backend.frontends.frontend

class espnet.nets.pytorch_backend.frontends.frontend.Frontend(idim: int, use_wpe: bool = False, wtype: str = 'blstmp', wlayers: int = 3, wunits: int = 300, wprojs: int = 320, wdropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask_for_wpe: bool = True, use_beamformer: bool = False, btype: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, bnmask: int = 2, badim: int = 320, ref_channel: int = -1, bdropout_rate=0.0)[source]

Bases: torch.nn.modules.module.Module

forward(x: torch_complex.tensor.ComplexTensor, ilens: Union[torch.LongTensor, numpy.ndarray, List[int]]) → Tuple[torch_complex.tensor.ComplexTensor, torch.LongTensor, Optional[torch_complex.tensor.ComplexTensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet.nets.pytorch_backend.frontends.frontend.frontend_for(args, idim)[source]

espnet.nets.pytorch_backend.lm.seq_rnn

Sequential implementation of Recurrent Neural Network Language Model.

class espnet.nets.pytorch_backend.lm.seq_rnn.SequentialRNNLM(n_vocab, args)[source]

Bases: espnet.nets.lm_interface.LMInterface, torch.nn.modules.module.Module

Sequential RNNLM.

Initialize class.

Parameters
  • n_vocab (int) – The size of the vocabulary

  • args (argparse.Namespace) – configurations. see py:method:add_arguments

static add_arguments(parser)[source]

Add arguments to command line argument parser.

forward(x, t)[source]

Compute LM loss value from buffer sequences.

Parameters
  • x (torch.Tensor) – Input ids. (batch, len)

  • t (torch.Tensor) – Target ids. (batch, len)

Returns

Tuple of

loss to backward (scalar), negative log-likelihood of t: -log p(t) (scalar) and the number of elements in x (scalar)

Return type

tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Notes

The last two return values are used in perplexity: p(t)^{-n} = exp(-log p(t) / n)

init_state(x)[source]

Get an initial state for decoding.

Parameters

x (torch.Tensor) – The encoded feature tensor

Returns: initial state

score(y, state, x)[source]

Score new token.

Parameters
  • y (torch.Tensor) – 1D torch.int64 prefix tokens.

  • state – Scorer state for prefix tokens

  • x (torch.Tensor) – 2D encoder feature that generates ys.

Returns

Tuple of

torch.float32 scores for next token (n_vocab) and next state for ys

Return type

tuple[torch.Tensor, Any]

espnet.nets.pytorch_backend.lm.transformer

Transformer language model.

class espnet.nets.pytorch_backend.lm.transformer.TransformerLM(n_vocab, args)[source]

Bases: torch.nn.modules.module.Module, espnet.nets.lm_interface.LMInterface, espnet.nets.scorer_interface.BatchScorerInterface

Transformer language model.

Initialize class.

Parameters
  • n_vocab (int) – The size of the vocabulary

  • args (argparse.Namespace) – configurations. see py:method:add_arguments

static add_arguments(parser)[source]

Add arguments to command line argument parser.

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch (required).

Parameters
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type

tuple[torch.Tensor, List[Any]]

forward(x: torch.Tensor, t: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Compute LM loss value from buffer sequences.

Parameters
  • x (torch.Tensor) – Input ids. (batch, len)

  • t (torch.Tensor) – Target ids. (batch, len)

Returns

Tuple of

loss to backward (scalar), negative log-likelihood of t: -log p(t) (scalar) and the number of elements in x (scalar)

Return type

tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Notes

The last two return values are used in perplexity: p(t)^{-n} = exp(-log p(t) / n)

score(y: torch.Tensor, state: Any, x: torch.Tensor) → Tuple[torch.Tensor, Any][source]

Score new token.

Parameters
  • y (torch.Tensor) – 1D torch.int64 prefix tokens.

  • state – Scorer state for prefix tokens

  • x (torch.Tensor) – encoder feature that generates ys.

Returns

Tuple of

torch.float32 scores for next token (n_vocab) and next state for ys

Return type

tuple[torch.Tensor, Any]

espnet.nets.pytorch_backend.lm.default

Default Recurrent Neural Network Languge Model in lm_train.py.

class espnet.nets.pytorch_backend.lm.default.ClassifierWithState(predictor, lossfun=CrossEntropyLoss(), label_key=-1)[source]

Bases: torch.nn.modules.module.Module

A wrapper for pytorch RNNLM.

Initialize class.

:param torch.nn.Module predictor : The RNNLM :param function lossfun : The loss function to use :param int/str label_key :

buff_predict(state, x, n)[source]

Predict new tokens from buffered inputs.

final(state, index=None)[source]

Predict final log probabilities for given state using the predictor.

Parameters

state – The state

:return The final log probabilities :rtype torch.Tensor

forward(state, *args, **kwargs)[source]

Compute the loss value for an input and label pair.

Notes

It also computes accuracy and stores it to the attribute. When label_key is int, the corresponding element in args is treated as ground truth labels. And when it is str, the element in kwargs is used. The all elements of args and kwargs except the groundtruth labels are features. It feeds features to the predictor and compare the result with ground truth labels.

:param torch.Tensor state : the LM state :param list[torch.Tensor] args : Input minibatch :param dict[torch.Tensor] kwargs : Input minibatch :return loss value :rtype torch.Tensor

predict(state, x)[source]

Predict log probabilities for given state and input x using the predictor.

:param torch.Tensor state : The current state :param torch.Tensor x : The input :return a tuple (new state, log prob vector) :rtype (torch.Tensor, torch.Tensor)

class espnet.nets.pytorch_backend.lm.default.DefaultRNNLM(n_vocab, args)[source]

Bases: espnet.nets.scorer_interface.BatchScorerInterface, espnet.nets.lm_interface.LMInterface, torch.nn.modules.module.Module

Default RNNLM for LMInterface Implementation.

Note

PyTorch seems to have memory leak when one GPU compute this after data parallel. If parallel GPUs compute this, it seems to be fine. See also https://github.com/espnet/espnet/issues/1075

Initialize class.

Parameters
  • n_vocab (int) – The size of the vocabulary

  • args (argparse.Namespace) – configurations. see py:method:add_arguments

static add_arguments(parser)[source]

Add arguments to command line argument parser.

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch.

Parameters
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type

tuple[torch.Tensor, List[Any]]

final_score(state)[source]

Score eos.

Parameters

state – Scorer state for prefix tokens

Returns

final score

Return type

float

forward(x, t)[source]

Compute LM loss value from buffer sequences.

Parameters
  • x (torch.Tensor) – Input ids. (batch, len)

  • t (torch.Tensor) – Target ids. (batch, len)

Returns

Tuple of

loss to backward (scalar), negative log-likelihood of t: -log p(t) (scalar) and the number of elements in x (scalar)

Return type

tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Notes

The last two return values are used in perplexity: p(t)^{-n} = exp(-log p(t) / n)

load_state_dict(d)[source]

Load state dict.

score(y, state, x)[source]

Score new token.

Parameters
  • y (torch.Tensor) – 1D torch.int64 prefix tokens.

  • state – Scorer state for prefix tokens

  • x (torch.Tensor) – 2D encoder feature that generates ys.

Returns

Tuple of

torch.float32 scores for next token (n_vocab) and next state for ys

Return type

tuple[torch.Tensor, Any]

state_dict()[source]

Dump state dict.

class espnet.nets.pytorch_backend.lm.default.RNNLM(n_vocab, n_layers, n_units, n_embed=None, typ='lstm', dropout_rate=0.5)[source]

Bases: torch.nn.modules.module.Module

A pytorch RNNLM.

Initialize class.

Parameters
  • n_vocab (int) – The size of the vocabulary

  • n_layers (int) – The number of layers to create

  • n_units (int) – The number of units per layer

  • typ (str) – The RNN type

forward(state, x)[source]

Forward neural networks.

zero_state(batchsize)[source]

Initialize state.

espnet.nets.pytorch_backend.lm.__init__

Initialize sub package.

espnet.nets.pytorch_backend.maskctc.add_mask_token

Token masking module for Masked LM.

espnet.nets.pytorch_backend.maskctc.add_mask_token.mask_uniform(ys_pad, mask_token, eos, ignore_id)[source]

Replace random tokens with <mask> label and add <eos> label.

The number of <mask> is chosen from a uniform distribution between one and the target sequence’s length. :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax) :param int mask_token: index of <mask> :param int eos: index of <eos> :param int ignore_id: index of padding :return: padded tensor (B, Lmax) :rtype: torch.Tensor :return: padded tensor (B, Lmax) :rtype: torch.Tensor

espnet.nets.pytorch_backend.maskctc.__init__

Initialize sub package.

espnet.nets.pytorch_backend.maskctc.mask

Attention masking module for Masked LM.

espnet.nets.pytorch_backend.maskctc.mask.square_mask(ys_in_pad, ignore_id)[source]

Create attention mask to avoid attending on padding tokens.

Parameters
  • ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

  • ignore_id (int) – index of padding

  • dtype (torch.dtype) – result dtype

Return type

torch.Tensor (B, Lmax, Lmax)

espnet.nets.pytorch_backend.tacotron2.decoder

Tacotron2 decoder related modules.

class espnet.nets.pytorch_backend.tacotron2.decoder.Decoder(idim, odim, att, dlayers=2, dunits=1024, prenet_layers=2, prenet_units=256, postnet_layers=5, postnet_chans=512, postnet_filts=5, output_activation_fn=None, cumulate_att_w=True, use_batch_norm=True, use_concate=True, dropout_rate=0.5, zoneout_rate=0.1, reduction_factor=1)[source]

Bases: torch.nn.modules.module.Module

Decoder module of Spectrogram prediction network.

This is a module of decoder of Spectrogram prediction network in Tacotron2, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. The decoder generates the sequence of features from the sequence of the hidden states.

Initialize Tacotron2 decoder module.

Parameters
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • att (torch.nn.Module) – Instance of attention class.

  • dlayers (int, optional) – The number of decoder lstm layers.

  • dunits (int, optional) – The number of decoder lstm units.

  • prenet_layers (int, optional) – The number of prenet layers.

  • prenet_units (int, optional) – The number of prenet units.

  • postnet_layers (int, optional) – The number of postnet layers.

  • postnet_filts (int, optional) – The number of postnet filter size.

  • postnet_chans (int, optional) – The number of postnet filter channels.

  • output_activation_fn (torch.nn.Module, optional) – Activation function for outputs.

  • cumulate_att_w (bool, optional) – Whether to cumulate previous attention weight.

  • use_batch_norm (bool, optional) – Whether to use batch normalization.

  • use_concate (bool, optional) – Whether to concatenate encoder embedding with decoder lstm outputs.

  • dropout_rate (float, optional) – Dropout rate.

  • zoneout_rate (float, optional) – Zoneout rate.

  • reduction_factor (int, optional) – Reduction factor.

calculate_all_attentions(hs, hlens, ys)[source]

Calculate all of the attention weights.

Parameters
  • hs (Tensor) – Batch of the sequences of padded hidden states (B, Tmax, idim).

  • hlens (LongTensor) – Batch of lengths of each input batch (B,).

  • ys (Tensor) – Batch of the sequences of padded target features (B, Lmax, odim).

Returns

Batch of attention weights (B, Lmax, Tmax).

Return type

numpy.ndarray

Note

This computation is performed in teacher-forcing manner.

forward(hs, hlens, ys)[source]

Calculate forward propagation.

Parameters
  • hs (Tensor) – Batch of the sequences of padded hidden states (B, Tmax, idim).

  • hlens (LongTensor) – Batch of lengths of each input batch (B,).

  • ys (Tensor) – Batch of the sequences of padded target features (B, Lmax, odim).

Returns

Batch of output tensors after postnet (B, Lmax, odim). Tensor: Batch of output tensors before postnet (B, Lmax, odim). Tensor: Batch of logits of stop prediction (B, Lmax). Tensor: Batch of attention weights (B, Lmax, Tmax).

Return type

Tensor

Note

This computation is performed in teacher-forcing manner.

inference(h, threshold=0.5, minlenratio=0.0, maxlenratio=10.0, use_att_constraint=False, backward_window=None, forward_window=None)[source]

Generate the sequence of features given the sequences of characters.

Parameters
  • h (Tensor) – Input sequence of encoder hidden states (T, C).

  • threshold (float, optional) – Threshold to stop generation.

  • minlenratio (float, optional) – Minimum length ratio. If set to 1.0 and the length of input is 10, the minimum length of outputs will be 10 * 1 = 10.

  • minlenratio – Minimum length ratio. If set to 10 and the length of input is 10, the maximum length of outputs will be 10 * 10 = 100.

  • use_att_constraint (bool) – Whether to apply attention constraint introduced in Deep Voice 3.

  • backward_window (int) – Backward window size in attention constraint.

  • forward_window (int) – Forward window size in attention constraint.

Returns

Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).

Return type

Tensor

Note

This computation is performed in auto-regressive manner.

class espnet.nets.pytorch_backend.tacotron2.decoder.Postnet(idim, odim, n_layers=5, n_chans=512, n_filts=5, dropout_rate=0.5, use_batch_norm=True)[source]

Bases: torch.nn.modules.module.Module

Postnet module for Spectrogram prediction network.

This is a module of Postnet in Spectrogram prediction network, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. The Postnet predicts refines the predicted Mel-filterbank of the decoder, which helps to compensate the detail sturcture of spectrogram.

Initialize postnet module.

Parameters
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • n_layers (int, optional) – The number of layers.

  • n_filts (int, optional) – The number of filter size.

  • n_units (int, optional) – The number of filter channels.

  • use_batch_norm (bool, optional) – Whether to use batch normalization..

  • dropout_rate (float, optional) – Dropout rate..

forward(xs)[source]

Calculate forward propagation.

Parameters

xs (Tensor) – Batch of the sequences of padded input tensors (B, idim, Tmax).

Returns

Batch of padded output tensor. (B, odim, Tmax).

Return type

Tensor

class espnet.nets.pytorch_backend.tacotron2.decoder.Prenet(idim, n_layers=2, n_units=256, dropout_rate=0.5)[source]

Bases: torch.nn.modules.module.Module

Prenet module for decoder of Spectrogram prediction network.

This is a module of Prenet in the decoder of Spectrogram prediction network, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. The Prenet preforms nonlinear conversion of inputs before input to auto-regressive lstm, which helps to learn diagonal attentions.

Note

This module alway applies dropout even in evaluation. See the detail in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.

Initialize prenet module.

Parameters
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • n_layers (int, optional) – The number of prenet layers.

  • n_units (int, optional) – The number of prenet units.

forward(x)[source]

Calculate forward propagation.

Parameters

x (Tensor) – Batch of input tensors (B, …, idim).

Returns

Batch of output tensors (B, …, odim).

Return type

Tensor

class espnet.nets.pytorch_backend.tacotron2.decoder.ZoneOutCell(cell, zoneout_rate=0.1)[source]

Bases: torch.nn.modules.module.Module

ZoneOut Cell module.

This is a module of zoneout described in Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. This code is modified from eladhoffer/seq2seq.pytorch.

Examples

>>> lstm = torch.nn.LSTMCell(16, 32)
>>> lstm = ZoneOutCell(lstm, 0.5)

Initialize zone out cell module.

Parameters
  • cell (torch.nn.Module) – Pytorch recurrent cell module e.g. torch.nn.Module.LSTMCell.

  • zoneout_rate (float, optional) – Probability of zoneout from 0.0 to 1.0.

forward(inputs, hidden)[source]

Calculate forward propagation.

Parameters
  • inputs (Tensor) – Batch of input tensor (B, input_size).

  • hidden (tuple) –

    • Tensor: Batch of initial hidden states (B, hidden_size).

    • Tensor: Batch of initial cell states (B, hidden_size).

Returns

  • Tensor: Batch of next hidden states (B, hidden_size).

  • Tensor: Batch of next cell states (B, hidden_size).

Return type

tuple

espnet.nets.pytorch_backend.tacotron2.decoder.decoder_init(m)[source]

Initialize decoder parameters.

espnet.nets.pytorch_backend.tacotron2.cbhg

CBHG related modules.

class espnet.nets.pytorch_backend.tacotron2.cbhg.CBHG(idim, odim, conv_bank_layers=8, conv_bank_chans=128, conv_proj_filts=3, conv_proj_chans=256, highway_layers=4, highway_units=128, gru_units=256)[source]

Bases: torch.nn.modules.module.Module

CBHG module to convert log Mel-filterbanks to linear spectrogram.

This is a module of CBHG introduced in Tacotron: Towards End-to-End Speech Synthesis. The CBHG converts the sequence of log Mel-filterbanks into linear spectrogram.

Initialize CBHG module.

Parameters
  • idim (int) – Dimension of the inputs.

  • odim (int) – Dimension of the outputs.

  • conv_bank_layers (int, optional) – The number of convolution bank layers.

  • conv_bank_chans (int, optional) – The number of channels in convolution bank.

  • conv_proj_filts (int, optional) – Kernel size of convolutional projection layer.

  • conv_proj_chans (int, optional) – The number of channels in convolutional projection layer.

  • highway_layers (int, optional) – The number of highway network layers.

  • highway_units (int, optional) – The number of highway network units.

  • gru_units (int, optional) – The number of GRU units (for both directions).

forward(xs, ilens)[source]

Calculate forward propagation.

Parameters
  • xs (Tensor) – Batch of the padded sequences of inputs (B, Tmax, idim).

  • ilens (LongTensor) – Batch of lengths of each input sequence (B,).

Returns

Batch of the padded sequence of outputs (B, Tmax, odim). LongTensor: Batch of lengths of each output sequence (B,).

Return type

Tensor

inference(x)[source]

Inference.

Parameters

x (Tensor) – The sequences of inputs (T, idim).

Returns

The sequence of outputs (T, odim).

Return type

Tensor

class espnet.nets.pytorch_backend.tacotron2.cbhg.CBHGLoss(use_masking=True)[source]

Bases: torch.nn.modules.module.Module

Loss function module for CBHG.

Initialize CBHG loss module.

Parameters

use_masking (bool) – Whether to mask padded part in loss calculation.

forward(cbhg_outs, spcs, olens)[source]

Calculate forward propagation.

Parameters
  • cbhg_outs (Tensor) – Batch of CBHG outputs (B, Lmax, spc_dim).

  • spcs (Tensor) – Batch of groundtruth of spectrogram (B, Lmax, spc_dim).

  • olens (LongTensor) – Batch of the lengths of each sequence (B,).

Returns

L1 loss value Tensor: Mean square error loss value.

Return type

Tensor

class espnet.nets.pytorch_backend.tacotron2.cbhg.HighwayNet(idim)[source]

Bases: torch.nn.modules.module.Module

Highway Network module.

This is a module of Highway Network introduced in Highway Networks.

Initialize Highway Network module.

Parameters

idim (int) – Dimension of the inputs.

forward(x)[source]

Calculate forward propagation.

Parameters

x (Tensor) – Batch of inputs (B, …, idim).

Returns

Batch of outputs, which are the same shape as inputs (B, …, idim).

Return type

Tensor

espnet.nets.pytorch_backend.tacotron2.__init__

Initialize sub package.

espnet.nets.pytorch_backend.tacotron2.encoder

Tacotron2 encoder related modules.

class espnet.nets.pytorch_backend.tacotron2.encoder.Encoder(idim, input_layer='embed', embed_dim=512, elayers=1, eunits=512, econv_layers=3, econv_chans=512, econv_filts=5, use_batch_norm=True, use_residual=False, dropout_rate=0.5, padding_idx=0)[source]

Bases: torch.nn.modules.module.Module

Encoder module of Spectrogram prediction network.

This is a module of encoder of Spectrogram prediction network in Tacotron2, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This is the encoder which converts either a sequence of characters or acoustic features into the sequence of hidden states.

Initialize Tacotron2 encoder module.

Parameters
  • idim (int) –

  • input_layer (str) – Input layer type.

  • embed_dim (int, optional) –

  • elayers (int, optional) –

  • eunits (int, optional) –

  • econv_layers (int, optional) –

  • econv_filts (int, optional) –

  • econv_chans (int, optional) –

  • use_batch_norm (bool, optional) –

  • use_residual (bool, optional) –

  • dropout_rate (float, optional) –

forward(xs, ilens=None)[source]

Calculate forward propagation.

Parameters
  • xs (Tensor) – Batch of the padded sequence. Either character ids (B, Tmax) or acoustic feature (B, Tmax, idim * encoder_reduction_factor). Padded value should be 0.

  • ilens (LongTensor) – Batch of lengths of each input batch (B,).

Returns

Batch of the sequences of encoder states(B, Tmax, eunits). LongTensor: Batch of lengths of each sequence (B,)

Return type

Tensor

inference(x)[source]

Inference.

Parameters

x (Tensor) – The sequeunce of character ids (T,) or acoustic feature (T, idim * encoder_reduction_factor).

Returns

The sequences of encoder states(T, eunits).

Return type

Tensor

espnet.nets.pytorch_backend.tacotron2.encoder.encoder_init(m)[source]

Initialize encoder parameters.

espnet.nets.pytorch_backend.transformer.attention

Multi-Head Attention layer definition.

class espnet.nets.pytorch_backend.transformer.attention.MultiHeadedAttention(n_head, n_feat, dropout_rate)[source]

Bases: torch.nn.modules.module.Module

Multi-Head Attention layer.

Parameters
  • n_head (int) – The number of heads.

  • n_feat (int) – The number of features.

  • dropout_rate (float) – Dropout rate.

Construct an MultiHeadedAttention object.

forward(query, key, value, mask)[source]

Compute scaled dot product attention.

Parameters
  • query (torch.Tensor) – Query tensor (#batch, time1, size).

  • key (torch.Tensor) – Key tensor (#batch, time2, size).

  • value (torch.Tensor) – Value tensor (#batch, time2, size).

  • mask (torch.Tensor) – Mask tensor (#batch, 1, time2) or (#batch, time1, time2).

Returns

Output tensor (#batch, time1, d_model).

Return type

torch.Tensor

forward_attention(value, scores, mask)[source]

Compute attention context vector.

Parameters
  • value (torch.Tensor) – Transformed value (#batch, n_head, time2, d_k).

  • scores (torch.Tensor) – Attention score (#batch, n_head, time1, time2).

  • mask (torch.Tensor) – Mask (#batch, 1, time2) or (#batch, time1, time2).

Returns

Transformed value (#batch, time1, d_model)

weighted by the attention score (#batch, time1, time2).

Return type

torch.Tensor

forward_qkv(query, key, value)[source]

Transform query, key and value.

Parameters
  • query (torch.Tensor) – Query tensor (#batch, time1, size).

  • key (torch.Tensor) – Key tensor (#batch, time2, size).

  • value (torch.Tensor) – Value tensor (#batch, time2, size).

Returns

Transformed query tensor (#batch, n_head, time1, d_k). torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k). torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).

Return type

torch.Tensor

class espnet.nets.pytorch_backend.transformer.attention.RelPositionMultiHeadedAttention(n_head, n_feat, dropout_rate)[source]

Bases: espnet.nets.pytorch_backend.transformer.attention.MultiHeadedAttention

Multi-Head Attention layer with relative position encoding.

Paper: https://arxiv.org/abs/1901.02860

Parameters
  • n_head (int) – The number of heads.

  • n_feat (int) – The number of features.

  • dropout_rate (float) – Dropout rate.

Construct an RelPositionMultiHeadedAttention object.

forward(query, key, value, pos_emb, mask)[source]

Compute ‘Scaled Dot Product Attention’ with rel. positional encoding.

Parameters
  • query (torch.Tensor) – Query tensor (#batch, time1, size).

  • key (torch.Tensor) – Key tensor (#batch, time2, size).

  • value (torch.Tensor) – Value tensor (#batch, time2, size).

  • pos_emb (torch.Tensor) – Positional embedding tensor (#batch, time2, size).

  • mask (torch.Tensor) – Mask tensor (#batch, 1, time2) or (#batch, time1, time2).

Returns

Output tensor (#batch, time1, d_model).

Return type

torch.Tensor

rel_shift(x, zero_triu=False)[source]

Compute relative positinal encoding.

Parameters
  • x (torch.Tensor) – Input tensor (batch, time, size).

  • zero_triu (bool) – If true, return the lower triangular part of the matrix.

Returns

Output tensor.

Return type

torch.Tensor

espnet.nets.pytorch_backend.transformer.embedding

Positional Encoding Module.

class espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding(d_model, dropout_rate, max_len=5000, reverse=False)[source]

Bases: torch.nn.modules.module.Module

Positional encoding.

Parameters
  • d_model (int) – Embedding dimension.

  • dropout_rate (float) – Dropout rate.

  • max_len (int) – Maximum input length.

  • reverse (bool) – Whether to reverse the input position.

Construct an PositionalEncoding object.

extend_pe(x)[source]

Reset the positional encodings.

forward(x: torch.Tensor)[source]

Add positional encoding.

Parameters

x (torch.Tensor) – Input tensor (batch, time, *).

Returns

Encoded tensor (batch, time, *).

Return type

torch.Tensor

class espnet.nets.pytorch_backend.transformer.embedding.RelPositionalEncoding(d_model, dropout_rate, max_len=5000)[source]

Bases: espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding

Relative positional encoding module.

See : Appendix B in https://arxiv.org/abs/1901.02860

Parameters
  • d_model (int) – Embedding dimension.

  • dropout_rate (float) – Dropout rate.

  • max_len (int) – Maximum input length.

Initialize class.

forward(x)[source]

Compute positional encoding.

Parameters

x (torch.Tensor) – Input tensor (batch, time, *).

Returns

Encoded tensor (batch, time, *). torch.Tensor: Positional embedding tensor (1, time, *).

Return type

torch.Tensor

class espnet.nets.pytorch_backend.transformer.embedding.ScaledPositionalEncoding(d_model, dropout_rate, max_len=5000)[source]

Bases: espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding

Scaled positional encoding module.

See Sec. 3.2 https://arxiv.org/abs/1809.08895

Parameters
  • d_model (int) – Embedding dimension.

  • dropout_rate (float) – Dropout rate.

  • max_len (int) – Maximum input length.

Initialize class.

forward(x)[source]

Add positional encoding.

Parameters

x (torch.Tensor) – Input tensor (batch, time, *).

Returns

Encoded tensor (batch, time, *).

Return type

torch.Tensor

reset_parameters()[source]

Reset parameters.

espnet.nets.pytorch_backend.transformer.lightconv2d

Lightweight 2-Dimentional Convolution module.

class espnet.nets.pytorch_backend.transformer.lightconv2d.LightweightConvolution2D(wshare, n_feat, dropout_rate, kernel_size_str, lnum, use_kernel_mask=False, use_bias=False)[source]

Bases: torch.nn.modules.module.Module

Lightweight 2-Dimentional Convolution layer.

This implementation is based on https://github.com/pytorch/fairseq/tree/master/fairseq

Parameters
  • wshare (int) – the number of kernel of convolution

  • n_feat (int) – the number of features

  • dropout_rate (float) – dropout_rate

  • kernel_size_str (str) – kernel size (length)

  • lnum (inst) – index of layer

  • use_kernel_mask (bool) – Use causal mask or not for convolution kernel

  • use_bias (bool) – Use bias term or not.

Construct Lightweight 2-Dimentional Convolution layer.

forward(query, key, value, mask)[source]

Forward of ‘Lightweight 2-Dimentional Convolution’.

This function takes query, key and value but uses only query. This is just for compatibility with self-attention layer (attention.py)

Parameters
  • query (torch.Tensor) – (batch, time1, d_model) input tensor

  • key (torch.Tensor) – (batch, time2, d_model) NOT USED

  • value (torch.Tensor) – (batch, time2, d_model) NOT USED

  • mask (torch.Tensor) – (batch, time1, time2) mask

Returns

(batch, time1, d_model) ouput

Return type

x (torch.Tensor)

espnet.nets.pytorch_backend.transformer.layer_norm

Layer normalization module.

class espnet.nets.pytorch_backend.transformer.layer_norm.LayerNorm(nout, dim=-1)[source]

Bases: torch.nn.modules.normalization.LayerNorm

Layer normalization module.

Parameters
  • nout (int) – Output dim size.

  • dim (int) – Dimension to be normalized.

Construct an LayerNorm object.

forward(x)[source]

Apply layer normalization.

Parameters

x (torch.Tensor) – Input tensor.

Returns

Normalized tensor.

Return type

torch.Tensor

espnet.nets.pytorch_backend.transformer.label_smoothing_loss

Label smoothing module.

class espnet.nets.pytorch_backend.transformer.label_smoothing_loss.LabelSmoothingLoss(size, padding_idx, smoothing, normalize_length=False, criterion=KLDivLoss())[source]

Bases: torch.nn.modules.module.Module

Label-smoothing loss.

Parameters
  • size (int) – the number of class

  • padding_idx (int) – ignored class id

  • smoothing (float) – smoothing rate (0.0 means the conventional CE)

  • normalize_length (bool) – normalize loss by sequence length if True

  • criterion (torch.nn.Module) – loss function to be smoothed

Construct an LabelSmoothingLoss object.

forward(x, target)[source]

Compute loss between x and target.

Parameters
  • x (torch.Tensor) – prediction (batch, seqlen, class)

  • target (torch.Tensor) – target signal masked with self.padding_id (batch, seqlen)

Returns

scalar float value

:rtype torch.Tensor

espnet.nets.pytorch_backend.transformer.decoder

Decoder definition.

class espnet.nets.pytorch_backend.transformer.decoder.Decoder(odim, selfattention_layer_type='selfattn', attention_dim=256, attention_heads=4, conv_wshare=4, conv_kernel_length=11, conv_usebias=False, linear_units=2048, num_blocks=6, dropout_rate=0.1, positional_dropout_rate=0.1, self_attention_dropout_rate=0.0, src_attention_dropout_rate=0.0, input_layer='embed', use_output_layer=True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before=True, concat_after=False)[source]

Bases: espnet.nets.scorer_interface.BatchScorerInterface, torch.nn.modules.module.Module

Transfomer decoder module.

Parameters
  • odim (int) – Output diminsion.

  • self_attention_layer_type (str) – Self-attention layer type.

  • attention_dim (int) – Dimention of attention.

  • attention_heads (int) – The number of heads of multi head attention.

  • conv_wshare (int) – The number of kernel of convolution. Only used in self_attention_layer_type == “lightconv*” or “dynamiconv*”.

  • conv_kernel_length (Union[int, str]) – Kernel size str of convolution (e.g. 71_71_71_71_71_71). Only used in self_attention_layer_type == “lightconv*” or “dynamiconv*”.

  • conv_usebias (bool) – Whether to use bias in convolution. Only used in self_attention_layer_type == “lightconv*” or “dynamiconv*”.

  • linear_units (int) – The number of units of position-wise feed forward.

  • num_blocks (int) – The number of decoder blocks.

  • dropout_rate (float) – Dropout rate.

  • positional_dropout_rate (float) – Dropout rate after adding positional encoding.

  • self_attention_dropout_rate (float) – Dropout rate in self-attention.

  • src_attention_dropout_rate (float) – Dropout rate in source-attention.

  • input_layer (Union[str, torch.nn.Module]) – Input layer type.

  • use_output_layer (bool) – Whether to use output layer.

  • pos_enc_class (torch.nn.Module) – Positional encoding module class. PositionalEncoding `or `ScaledPositionalEncoding

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

Construct an Decoder object.

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch (required).

Parameters
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type

tuple[torch.Tensor, List[Any]]

forward(tgt, tgt_mask, memory, memory_mask)[source]

Forward decoder.

Parameters
  • tgt (torch.Tensor) – Input token ids, int64 (#batch, maxlen_out) if input_layer == “embed”. In the other case, input tensor (#batch, maxlen_out, odim).

  • tgt_mask (torch.Tensor) – Input token mask (#batch, maxlen_out). dtype=torch.uint8 in PyTorch 1.2- and dtype=torch.bool in PyTorch 1.2+ (include 1.2).

  • memory (torch.Tensor) – Encoded memory, float32 (#batch, maxlen_in, feat).

  • memory_mask (torch.Tensor) – Encoded memory mask (#batch, maxlen_in). dtype=torch.uint8 in PyTorch 1.2- and dtype=torch.bool in PyTorch 1.2+ (include 1.2).

Returns

Decoded token score before softmax (#batch, maxlen_out, odim)

if use_output_layer is True. In the other case,final block outputs (#batch, maxlen_out, attention_dim).

torch.Tensor: Score mask before softmax (#batch, maxlen_out).

Return type

torch.Tensor

forward_one_step(tgt, tgt_mask, memory, cache=None)[source]

Forward one step.

Parameters
  • tgt (torch.Tensor) – Input token ids, int64 (#batch, maxlen_out).

  • tgt_mask (torch.Tensor) – Input token mask (#batch, maxlen_out). dtype=torch.uint8 in PyTorch 1.2- and dtype=torch.bool in PyTorch 1.2+ (include 1.2).

  • memory (torch.Tensor) – Encoded memory, float32 (#batch, maxlen_in, feat).

  • cache (List[torch.Tensor]) – List of cached tensors. Each tensor shape should be (#batch, maxlen_out - 1, size).

Returns

Output tensor (batch, maxlen_out, odim). List[torch.Tensor]: List of cache tensors of each decoder layer.

Return type

torch.Tensor

score(ys, state, x)[source]

Score.

espnet.nets.pytorch_backend.transformer.multi_layer_conv

Layer modules for FFT block in FastSpeech (Feed-forward Transformer).

class espnet.nets.pytorch_backend.transformer.multi_layer_conv.Conv1dLinear(in_chans, hidden_chans, kernel_size, dropout_rate)[source]

Bases: torch.nn.modules.module.Module

Conv1D + Linear for Transformer block.

A variant of MultiLayeredConv1d, which replaces second conv-layer to linear.

Initialize Conv1dLinear module.

Parameters
  • in_chans (int) – Number of input channels.

  • hidden_chans (int) – Number of hidden channels.

  • kernel_size (int) – Kernel size of conv1d.

  • dropout_rate (float) – Dropout rate.

forward(x)[source]

Calculate forward propagation.

Parameters

x (torch.Tensor) – Batch of input tensors (B, T, in_chans).

Returns

Batch of output tensors (B, T, hidden_chans).

Return type

torch.Tensor

class espnet.nets.pytorch_backend.transformer.multi_layer_conv.MultiLayeredConv1d(in_chans, hidden_chans, kernel_size, dropout_rate)[source]

Bases: torch.nn.modules.module.Module

Multi-layered conv1d for Transformer block.

This is a module of multi-leyered conv1d designed to replace positionwise feed-forward network in Transforner block, which is introduced in FastSpeech: Fast, Robust and Controllable Text to Speech.

Initialize MultiLayeredConv1d module.

Parameters
  • in_chans (int) – Number of input channels.

  • hidden_chans (int) – Number of hidden channels.

  • kernel_size (int) – Kernel size of conv1d.

  • dropout_rate (float) – Dropout rate.

forward(x)[source]

Calculate forward propagation.

Parameters

x (torch.Tensor) – Batch of input tensors (B, T, in_chans).

Returns

Batch of output tensors (B, T, hidden_chans).

Return type

torch.Tensor

espnet.nets.pytorch_backend.transformer.add_sos_eos

Unility funcitons for Transformer.

espnet.nets.pytorch_backend.transformer.add_sos_eos.add_sos_eos(ys_pad, sos, eos, ignore_id)[source]

Add <sos> and <eos> labels.

Parameters
  • ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

  • sos (int) – index of <sos>

  • eos (int) – index of <eeos>

  • ignore_id (int) – index of padding

Returns

padded tensor (B, Lmax)

Return type

torch.Tensor

Returns

padded tensor (B, Lmax)

Return type

torch.Tensor

espnet.nets.pytorch_backend.transformer.decoder_layer

Decoder self-attention layer definition.

class espnet.nets.pytorch_backend.transformer.decoder_layer.DecoderLayer(size, self_attn, src_attn, feed_forward, dropout_rate, normalize_before=True, concat_after=False)[source]

Bases: torch.nn.modules.module.Module

Single decoder layer module.

Parameters
  • size (int) – Input dimension.

  • self_attn (torch.nn.Module) – Self-attention module instance. MultiHeadedAttention instance can be used as the argument.

  • src_attn (torch.nn.Module) – Self-attention module instance. MultiHeadedAttention instance can be used as the argument.

  • feed_forward (torch.nn.Module) – Feed-forward module instance. PositionwiseFeedForward, MultiLayeredConv1d, or Conv1dLinear instance can be used as the argument.

  • dropout_rate (float) – Dropout rate.

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

Construct an DecoderLayer object.

forward(tgt, tgt_mask, memory, memory_mask, cache=None)[source]

Compute decoded features.

Parameters
  • tgt (torch.Tensor) – Input tensor (#batch, maxlen_out, size).

  • tgt_mask (torch.Tensor) – Mask for input tensor (#batch, maxlen_out).

  • memory (torch.Tensor) – Encoded memory, float32 (#batch, maxlen_in, size).

  • memory_mask (torch.Tensor) – Encoded memory mask (#batch, maxlen_in).

  • cache (List[torch.Tensor]) – List of cached tensors. Each tensor shape should be (#batch, maxlen_out - 1, size).

Returns

Output tensor(#batch, maxlen_out, size). torch.Tensor: Mask for output tensor (#batch, maxlen_out). torch.Tensor: Encoded memory (#batch, maxlen_in, size). torch.Tensor: Encoded memory mask (#batch, maxlen_in).

Return type

torch.Tensor

espnet.nets.pytorch_backend.transformer.lightconv

Lightweight Convolution Module.

class espnet.nets.pytorch_backend.transformer.lightconv.LightweightConvolution(wshare, n_feat, dropout_rate, kernel_size_str, lnum, use_kernel_mask=False, use_bias=False)[source]

Bases: torch.nn.modules.module.Module

Lightweight Convolution layer.

This implementation is based on https://github.com/pytorch/fairseq/tree/master/fairseq

Parameters
  • wshare (int) – the number of kernel of convolution

  • n_feat (int) – the number of features

  • dropout_rate (float) – dropout_rate

  • kernel_size_str (str) – kernel size (length)

  • lnum (inst) – index of layer

  • use_kernel_mask (bool) – Use causal mask or not for convolution kernel

  • use_bias (bool) – Use bias term or not.

Construct Lightweight Convolution layer.

forward(query, key, value, mask)[source]

Forward of ‘Lightweight Convolution’.

This function takes query, key and value but uses only query. This is just for compatibility with self-attention layer (attention.py)

Parameters
  • query (torch.Tensor) – (batch, time1, d_model) input tensor

  • key (torch.Tensor) – (batch, time2, d_model) NOT USED

  • value (torch.Tensor) – (batch, time2, d_model) NOT USED

  • mask (torch.Tensor) – (batch, time1, time2) mask

Returns

(batch, time1, d_model) ouput

Return type

x (torch.Tensor)

espnet.nets.pytorch_backend.transformer.plot

class espnet.nets.pytorch_backend.transformer.plot.PlotAttentionReport(att_vis_fn, data, outdir, converter, transform, device, reverse=False, ikey='input', iaxis=0, okey='output', oaxis=0)[source]

Bases: espnet.asr.asr_utils.PlotAttentionReport

get_attention_weights()[source]

Return attention weights.

Returns

attention weights. float. Its shape would be

differ from backend. * pytorch-> 1) multi-head case => (B, H, Lmax, Tmax), 2)

other case => (B, Lmax, Tmax).

  • chainer-> (B, Lmax, Tmax)

Return type

numpy.ndarray

log_attentions(logger, step)[source]

Add image files of att_ws matrix to the tensorboard.

plotfn(*args, **kwargs)[source]
espnet.nets.pytorch_backend.transformer.plot.plot_multi_head_attention(data, attn_dict, outdir, suffix='png', savefn=<function savefig>, ikey='input', iaxis=0, okey='output', oaxis=0, subsampling_rate=4)[source]

Plot multi head attentions.

Parameters
  • data (dict) – utts info from json file

  • torch.Tensor] attn_dict (dict[str,) – multi head attention dict. values should be torch.Tensor (head, input_length, output_length)

  • outdir (str) – dir to save fig

  • suffix (str) – filename suffix including image type (e.g., png)

  • savefn – function to save

  • ikey (str) – key to access input

  • iaxis (int) – dimension to access input

  • okey (str) – key to access output

  • oaxis (int) – dimension to access output

  • subsampling_rate – subsampling rate in encoder

espnet.nets.pytorch_backend.transformer.plot.savefig(plot, filename)[source]

espnet.nets.pytorch_backend.transformer.initializer

Parameter initialization.

espnet.nets.pytorch_backend.transformer.initializer.initialize(model, init_type='pytorch')[source]

Initialize Transformer module.

Parameters
  • model (torch.nn.Module) – transformer instance

  • init_type (str) – initialization type

espnet.nets.pytorch_backend.transformer.repeat

Repeat the same layer definition.

class espnet.nets.pytorch_backend.transformer.repeat.MultiSequential(*args)[source]

Bases: torch.nn.modules.container.Sequential

Multi-input multi-output torch.nn.Sequential.

forward(*args)[source]

Repeat.

espnet.nets.pytorch_backend.transformer.repeat.repeat(N, fn)[source]

Repeat module N times.

Parameters
  • N (int) – Number of repeat time.

  • fn (Callable) – Function to generate module.

Returns

Repeated model instance.

Return type

MultiSequential

espnet.nets.pytorch_backend.transformer.optimizer

Optimizer module.

class espnet.nets.pytorch_backend.transformer.optimizer.NoamOpt(model_size, factor, warmup, optimizer)[source]

Bases: object

Optim wrapper that implements rate.

Construct an NoamOpt object.

load_state_dict(state_dict)[source]

Load state_dict.

property param_groups

Return param_groups.

rate(step=None)[source]

Implement lrate above.

state_dict()[source]

Return state_dict.

step()[source]

Update parameters and rate.

zero_grad()[source]

Reset gradient.

espnet.nets.pytorch_backend.transformer.optimizer.get_std_opt(model_params, d_model, warmup, factor)[source]

Get standard NoamOpt.

espnet.nets.pytorch_backend.transformer.dynamic_conv

Dynamic Convolution module.

class espnet.nets.pytorch_backend.transformer.dynamic_conv.DynamicConvolution(wshare, n_feat, dropout_rate, kernel_size_str, lnum, use_kernel_mask=False, use_bias=False)[source]

Bases: torch.nn.modules.module.Module

Dynamic Convolution layer.

This implementation is based on https://github.com/pytorch/fairseq/tree/master/fairseq

Parameters
  • wshare (int) – the number of kernel of convolution

  • n_feat (int) – the number of features

  • dropout_rate (float) – dropout_rate

  • kernel_size_str (str) – kernel size (length)

  • lnum (inst) – index of layer

  • use_kernel_mask (bool) – Use causal mask or not for convolution kernel

  • use_bias (bool) – Use bias term or not.

Construct Dynamic Convolution layer.

forward(query, key, value, mask)[source]

Forward of ‘Dynamic Convolution’.

This function takes query, key and value but uses only quert. This is just for compatibility with self-attention layer (attention.py)

Parameters
  • query (torch.Tensor) – (batch, time1, d_model) input tensor

  • key (torch.Tensor) – (batch, time2, d_model) NOT USED

  • value (torch.Tensor) – (batch, time2, d_model) NOT USED

  • mask (torch.Tensor) – (batch, time1, time2) mask

Returns

(batch, time1, d_model) ouput

Return type

x (torch.Tensor)

espnet.nets.pytorch_backend.transformer.dynamic_conv2d

Dynamic 2-Dimentional Convolution module.

class espnet.nets.pytorch_backend.transformer.dynamic_conv2d.DynamicConvolution2D(wshare, n_feat, dropout_rate, kernel_size_str, lnum, use_kernel_mask=False, use_bias=False)[source]

Bases: torch.nn.modules.module.Module

Dynamic 2-Dimentional Convolution layer.

This implementation is based on https://github.com/pytorch/fairseq/tree/master/fairseq

Parameters
  • wshare (int) – the number of kernel of convolution

  • n_feat (int) – the number of features

  • dropout_rate (float) – dropout_rate

  • kernel_size_str (str) – kernel size (length)

  • lnum (inst) – index of layer

  • use_kernel_mask (bool) – Use causal mask or not for convolution kernel

  • use_bias (bool) – Use bias term or not.

Construct Dynamic 2-Dimentional Convolution layer.

forward(query, key, value, mask)[source]

Forward of ‘Dynamic 2-Dimentional Convolution’.

This function takes query, key and value but uses only query. This is just for compatibility with self-attention layer (attention.py)

Parameters
  • query (torch.Tensor) – (batch, time1, d_model) input tensor

  • key (torch.Tensor) – (batch, time2, d_model) NOT USED

  • value (torch.Tensor) – (batch, time2, d_model) NOT USED

  • mask (torch.Tensor) – (batch, time1, time2) mask

Returns

(batch, time1, d_model) ouput

Return type

x (torch.Tensor)

espnet.nets.pytorch_backend.transformer.subsampling

Subsampling layer definition.

class espnet.nets.pytorch_backend.transformer.subsampling.Conv2dSubsampling(idim, odim, dropout_rate, pos_enc=None)[source]

Bases: torch.nn.modules.module.Module

Convolutional 2D subsampling (to 1/4 length).

Parameters
  • idim (int) – Input dimension.

  • odim (int) – Output dimension.

  • dropout_rate (float) – Dropout rate.

  • pos_enc (torch.nn.Module) – Custom position encoding layer.

Construct an Conv2dSubsampling object.

forward(x, x_mask)[source]

Subsample x.

Parameters
  • x (torch.Tensor) – Input tensor (#batch, time, idim).

  • x_mask (torch.Tensor) – Input mask (#batch, 1, time).

Returns

Subsampled tensor (#batch, time’, odim),

where time’ = time // 4.

torch.Tensor: Subsampled mask (#batch, 1, time’),

where time’ = time // 4.

Return type

torch.Tensor

class espnet.nets.pytorch_backend.transformer.subsampling.Conv2dSubsampling6(idim, odim, dropout_rate, pos_enc=None)[source]

Bases: torch.nn.modules.module.Module

Convolutional 2D subsampling (to 1/6 length).

Parameters
  • idim (int) – Input dimension.

  • odim (int) – Output dimension.

  • dropout_rate (float) – Dropout rate.

  • pos_enc (torch.nn.Module) – Custom position encoding layer.

Construct an Conv2dSubsampling6 object.

forward(x, x_mask)[source]

Subsample x.

Parameters
  • x (torch.Tensor) – Input tensor (#batch, time, idim).

  • x_mask (torch.Tensor) – Input mask (#batch, 1, time).

Returns

Subsampled tensor (#batch, time’, odim),

where time’ = time // 6.

torch.Tensor: Subsampled mask (#batch, 1, time’),

where time’ = time // 6.

Return type

torch.Tensor

class espnet.nets.pytorch_backend.transformer.subsampling.Conv2dSubsampling8(idim, odim, dropout_rate, pos_enc=None)[source]

Bases: torch.nn.modules.module.Module

Convolutional 2D subsampling (to 1/8 length).

Parameters
  • idim (int) – Input dimension.

  • odim (int) – Output dimension.

  • dropout_rate (float) – Dropout rate.

  • pos_enc (torch.nn.Module) – Custom position encoding layer.

Construct an Conv2dSubsampling8 object.

forward(x, x_mask)[source]

Subsample x.

Parameters
  • x (torch.Tensor) – Input tensor (#batch, time, idim).

  • x_mask (torch.Tensor) – Input mask (#batch, 1, time).

Returns

Subsampled tensor (#batch, time’, odim),

where time’ = time // 8.

torch.Tensor: Subsampled mask (#batch, 1, time’),

where time’ = time // 8.

Return type

torch.Tensor

espnet.nets.pytorch_backend.transformer.positionwise_feed_forward

Positionwise feed forward layer definition.

class espnet.nets.pytorch_backend.transformer.positionwise_feed_forward.PositionwiseFeedForward(idim, hidden_units, dropout_rate, activation=ReLU())[source]

Bases: torch.nn.modules.module.Module

Positionwise feed forward layer.

Parameters
  • idim (int) – Input dimenstion.

  • hidden_units (int) – The number of hidden units.

  • dropout_rate (float) – Dropout rate.

Construct an PositionwiseFeedForward object.

forward(x)[source]

Forward funciton.

espnet.nets.pytorch_backend.transformer.__init__

Initialize sub package.

espnet.nets.pytorch_backend.transformer.encoder_layer

Encoder self-attention layer definition.

class espnet.nets.pytorch_backend.transformer.encoder_layer.EncoderLayer(size, self_attn, feed_forward, dropout_rate, normalize_before=True, concat_after=False)[source]

Bases: torch.nn.modules.module.Module

Encoder layer module.

Parameters
  • size (int) – Input dimension.

  • self_attn (torch.nn.Module) – Self-attention module instance. MultiHeadedAttention or RelPositionMultiHeadedAttention instance can be used as the argument.

  • feed_forward (torch.nn.Module) – Feed-forward module instance. PositionwiseFeedForward, MultiLayeredConv1d, or Conv1dLinear instance can be used as the argument.

  • dropout_rate (float) – Dropout rate.

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

Construct an EncoderLayer object.

forward(x, mask, cache=None)[source]

Compute encoded features.

Parameters
  • x_input (torch.Tensor) – Input tensor (#batch, time, size).

  • mask (torch.Tensor) – Mask tensor for the input (#batch, time).

  • cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).

Returns

Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, time).

Return type

torch.Tensor

espnet.nets.pytorch_backend.transformer.argument

Transformer common arguments.

espnet.nets.pytorch_backend.transformer.argument.add_arguments_transformer_common(group)[source]

Add Transformer common arguments.

espnet.nets.pytorch_backend.transformer.mask

Mask module.

espnet.nets.pytorch_backend.transformer.mask.subsequent_mask(size, device='cpu', dtype=torch.bool)[source]

Create mask for subsequent steps (size, size).

Parameters
  • size (int) – size of mask

  • device (str) – “cpu” or “cuda” or torch.Tensor.device

  • dtype (torch.dtype) – result dtype

Return type

torch.Tensor

>>> subsequent_mask(3)
[[1, 0, 0],
 [1, 1, 0],
 [1, 1, 1]]
espnet.nets.pytorch_backend.transformer.mask.target_mask(ys_in_pad, ignore_id)[source]

Create mask for decoder self-attention.

Parameters
  • ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

  • ignore_id (int) – index of padding

  • dtype (torch.dtype) – result dtype

Return type

torch.Tensor (B, Lmax, Lmax)

espnet.nets.pytorch_backend.transformer.encoder_mix

Encoder Mix definition.

class espnet.nets.pytorch_backend.transformer.encoder_mix.EncoderMix(idim, attention_dim=256, attention_heads=4, linear_units=2048, num_blocks_sd=4, num_blocks_rec=8, dropout_rate=0.1, positional_dropout_rate=0.1, attention_dropout_rate=0.0, input_layer='conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before=True, concat_after=False, positionwise_layer_type='linear', positionwise_conv_kernel_size=1, padding_idx=-1, num_spkrs=2)[source]

Bases: espnet.nets.pytorch_backend.transformer.encoder.Encoder, torch.nn.modules.module.Module

Transformer encoder module.

Parameters
  • idim (int) – input dim

  • attention_dim (int) – dimention of attention

  • attention_heads (int) – the number of heads of multi head attention

  • linear_units (int) – the number of units of position-wise feed forward

  • num_blocks (int) – the number of decoder blocks

  • dropout_rate (float) – dropout rate

  • attention_dropout_rate (float) – dropout rate in attention

  • positional_dropout_rate (float) – dropout rate after adding positional encoding

  • or torch.nn.Module input_layer (str) – input layer type

  • pos_enc_class (class) – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before (bool) – whether to use layer_norm before the first block

  • concat_after (bool) – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type (str) – linear of conv1d

  • positionwise_conv_kernel_size (int) – kernel size of positionwise conv1d layer

  • padding_idx (int) – padding_idx for input_layer=embed

Construct an Encoder object.

forward(xs, masks)[source]

Encode input sequence.

Parameters
  • xs (torch.Tensor) – input tensor

  • masks (torch.Tensor) – input mask

Returns

position embedded tensor and mask

Rtype Tuple[torch.Tensor, torch.Tensor]

forward_one_step(xs, masks, cache=None)[source]

Encode input frame.

Parameters
  • xs (torch.Tensor) – input tensor

  • masks (torch.Tensor) – input mask

  • cache (List[torch.Tensor]) – cache tensors

Returns

position embedded tensor, mask and new cache

Rtype Tuple[torch.Tensor, torch.Tensor, List[torch.Tensor]]

espnet.nets.pytorch_backend.transformer.encoder

Encoder definition.

class espnet.nets.pytorch_backend.transformer.encoder.Encoder(idim, selfattention_layer_type='selfattn', attention_dim=256, attention_heads=4, conv_wshare=4, conv_kernel_length=11, conv_usebias=False, linear_units=2048, num_blocks=6, dropout_rate=0.1, positional_dropout_rate=0.1, attention_dropout_rate=0.0, input_layer='conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before=True, concat_after=False, positionwise_layer_type='linear', positionwise_conv_kernel_size=1, padding_idx=-1)[source]

Bases: torch.nn.modules.module.Module

Transformer encoder module.

Parameters
  • idim (int) – Input dimension.

  • attention_dim (int) – Dimention of attention.

  • attention_heads (int) – The number of heads of multi head attention.

  • linear_units (int) – The number of units of position-wise feed forward.

  • num_blocks (int) – The number of decoder blocks.

  • dropout_rate (float) – Dropout rate.

  • attention_dropout_rate (float) – Dropout rate in attention.

  • positional_dropout_rate (float) – Dropout rate after adding positional encoding.

  • input_layer (Union[str, torch.nn.Module]) – Input layer type.

  • pos_enc_class (torch.nn.Module) – Positional encoding module class. PositionalEncoding `or `ScaledPositionalEncoding

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.

  • positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.

  • padding_idx (int) – Padding idx for input_layer=embed.

Construct an Encoder object.

forward(xs, masks)[source]

Encode input sequence.

Parameters
  • xs (torch.Tensor) – Input tensor (#batch, time, idim).

  • masks (torch.Tensor) – Mask tensor (#batch, time).

Returns

Output tensor (#batch, time, attention_dim). torch.Tensor: Mask tensor (#batch, time).

Return type

torch.Tensor

forward_one_step(xs, masks, cache=None)[source]

Encode input frame.

Parameters
  • xs (torch.Tensor) – Input tensor.

  • masks (torch.Tensor) – Mask tensor.

  • cache (List[torch.Tensor]) – List of cache tensors.

Returns

Output tensor. torch.Tensor: Mask tensor. List[torch.Tensor]: List of new cache tensors.

Return type

torch.Tensor

get_positionwise_layer(positionwise_layer_type='linear', attention_dim=256, linear_units=2048, dropout_rate=0.1, positionwise_conv_kernel_size=1)[source]

Define positionwise layer.

espnet.nets.pytorch_backend.transducer.vgg2l

VGG2L module definition for transformer encoder.

class espnet.nets.pytorch_backend.transducer.vgg2l.VGG2L(idim, odim)[source]

Bases: torch.nn.modules.module.Module

VGG2L module for transformer encoder.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

Construct a VGG2L object.

create_new_mask(x_mask, x)[source]

Create a subsampled version of x_mask.

Parameters
  • x_mask (torch.Tensor) – (B, 1, T)

  • x (torch.Tensor) – (B, sub(T), attention_dim)

Returns

(B, 1, sub(T))

Return type

x_mask (torch.Tensor)

forward(x, x_mask)[source]

VGG2L forward for x.

Parameters
  • x (torch.Tensor) – input torch (B, T, idim)

  • x_mask (torch.Tensor) – (B, 1, T)

Returns

input torch (B, sub(T), attention_dim) x_mask (torch.Tensor): (B, 1, sub(T))

Return type

x (torch.Tensor)

espnet.nets.pytorch_backend.transducer.tdnn

TDNN modules definition for transformer encoder.

class espnet.nets.pytorch_backend.transducer.tdnn.TDNN(idim, odim, ctx_size=5, dilation=1, stride=1, batch_norm=True, relu=True, dropout_rate=0.0)[source]

Bases: torch.nn.modules.module.Module

TDNN implementation based on Peddinti et al. implementation.

Reference: https://www.danielpovey.com/files/2015_interspeech_multisplice.pdf

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • ctx_size (int) – size of context window

  • stride (int) – stride of the sliding blocks

  • dilation (int) – parameter to control the stride of elements within the neighborhood

  • batch_norm (bool) – whether to use batch normalization

  • relu (bool) – whether to use non-linearity layer (ReLU)

Construct a TDNN object.

forward(xs, masks)[source]

Forward TDNN.

Parameters
  • xs (torch.Tensor) – input tensor (B, seq_len, idim)

  • masks (torch.Tensor) – input mask (B, 1, seq_len)

Returns

output tensor (B, new_seq_len, odim) masks (torch.Tensor): output mask (B, 1, new_seq_len)

Return type

xs (torch.Tensor)

espnet.nets.pytorch_backend.transducer.blocks

Set of methods to create transformer-based block.

espnet.nets.pytorch_backend.transducer.blocks.build_blocks(net_part, idim, input_layer, blocks_arch, repeat_block=0, self_attn_type='self_attn', positional_encoding_type='abs_pos', positionwise_layer_type='linear', positionwise_activation_type='relu', conv_mod_activation_type='relu', dropout_rate_embed=0.0, padding_idx=-1)[source]

Build block for transformer-based models.

Parameters
  • net_part (str) – either ‘encoder’ or ‘decoder’

  • idim (int) – dimension of inputs

  • input_layer (str) – input layer type

  • blocks_arch (list) – list of blocks for network part (type and parameters)

  • repeat_block (int) – repeat provided blocks N times if N > 1

  • positional_encoding_type (str) – positional encoding layer type

  • positionwise_layer_type (str) – linear

  • positionwise_activation_type (str) – positionwise activation type

  • conv_mod_activation_type (str) – convolutional module activation type

  • dropout_rate_embed (float) – dropout rate for embedding

  • padding_idx (int) – padding index for embedding input layer (if specified)

Returns

input layer all_blocks (MultiSequential): all blocks for network part out_dim (int): dimension of last block output

Return type

in_layer (torch.nn.*)

espnet.nets.pytorch_backend.transducer.blocks.build_causal_conv1d_block(block_arch)[source]

Build function for causal conv1d block.

Parameters

block_arch (dict) – causal conv1d block parameters

Returns

function to create causal conv1d block

Return type

(function)

espnet.nets.pytorch_backend.transducer.blocks.build_conformer_block(block_arch, self_attn_class, pos_enc_class, pw_layer_type, pw_activation_type, conv_mod_activation_type)[source]

Build function for conformer block.

Parameters
  • block_arch (dict) – conformer block parameters

  • self_attn_type (str) – self-attention module type

  • pos_enc_class (str) – positional encoding class

  • pw_layer_type (str) – positionwise layer type

  • pw_activation_type (str) – positionwise activation type

  • conv_mod_activation_type (str) – convolutional module activation type

Returns

function to create conformer block

Return type

(function)

espnet.nets.pytorch_backend.transducer.blocks.build_input_layer(input_layer, idim, odim, pos_enc_class, dropout_rate_embed, dropout_rate, pos_dropout_rate, padding_idx)[source]

Build input layer.

Parameters
  • input_layer (str) – input layer type

  • idim (int) – input dimension

  • odim (int) – output dimension

  • pos_enc_class (class) – positional encoding class

  • dropout_rate_embed (float) – dropout rate for embedding layer

  • dropout_rate (float) – dropout rate for input layer

  • pos_dropout_rate (float) – dropout rate for positional encoding

  • padding_idx (int) – padding index for embedding input layer (if specified)

Returns

input layer module

Return type

(torch.nn.*)

espnet.nets.pytorch_backend.transducer.blocks.build_tdnn_block(block_arch)[source]

Build function for tdnn block.

Parameters

block_arch (dict) – tdnn block parameters

Returns

function to create tdnn block

Return type

(function)

espnet.nets.pytorch_backend.transducer.blocks.build_transformer_block(net_part, block_arch, pw_layer_type, pw_activation_type)[source]

Build function for transformer block.

Parameters
  • net_part (str) – either ‘encoder’ or ‘decoder’

  • block_arch (dict) – transformer block parameters

  • pw_layer_type (str) – positionwise layer type

  • pw_activation_type (str) – positionwise activation type

Returns

function to create transformer block

Return type

(function)

espnet.nets.pytorch_backend.transducer.blocks.check_and_prepare(net_part, blocks_arch, input_layer)[source]

Check consecutive block shapes match and prepare input parameters.

Parameters
  • net_part (str) – either ‘encoder’ or ‘decoder’

  • blocks_arch (list) – list of blocks for network part (type and parameters)

  • input_layer (str) – input layer type

Returns

input layer type input_layer_odim (int): output dim of input layer input_dropout_rate (float): dropout rate of input layer input_pos_dropout_rate (float): dropout rate of input layer positional enc. out_dim (int): output dim of last block

Return type

input_layer (str)

espnet.nets.pytorch_backend.transducer.blocks.get_pos_enc_and_att_class(net_part, pos_enc_type, self_attn_type)[source]

Get positional encoding and self attention module class.

Parameters
  • net_part (str) – either ‘encoder’ or ‘decoder’

  • pos_enc_type (str) – positional encoding type

  • self_attn_type (str) – self-attention type

Returns

positional encoding class self_attn_class (torch.nn.Module): self-attention class

Return type

pos_enc_class (torch.nn.Module)

espnet.nets.pytorch_backend.transducer.rnn_decoder

RNN-Transducer implementation for training and decoding.

class espnet.nets.pytorch_backend.transducer.rnn_decoder.DecoderRNNT(eprojs, odim, dtype, dlayers, dunits, blank, embed_dim, joint_dim, joint_activation_type='tanh', dropout=0.0, dropout_embed=0.0)[source]

Bases: espnet.nets.transducer_decoder_interface.TransducerDecoderInterface, torch.nn.modules.module.Module

RNN-T Decoder module.

Parameters
  • eprojs (int) – # encoder projection units

  • odim (int) – dimension of outputs

  • dtype (str) – gru or lstm

  • dlayers (int) – # prediction layers

  • dunits (int) – # prediction units

  • blank (int) – blank symbol id

  • embed_dim (init) – dimension of embeddings

  • joint_dim (int) – dimension of joint space

  • joint_activation_type (int) – joint network activation

  • dropout (float) – dropout rate

  • dropout_embed (float) – embedding dropout rate

Transducer initializer.

batch_score(hyps, batch_states, cache, init_tensor=None)[source]

Forward batch one step.

Parameters
  • hyps (list) – batch of hypotheses

  • batch_states (tuple) – batch of decoder states ([L x (B, dec_dim)], [L x (B, dec_dim)])

  • cache (dict) – states cache

Returns

decoder output (B, dec_dim) batch_states (tuple): batch of decoder states

([L x (B, dec_dim)], [L x (B, dec_dim)])

lm_tokens (torch.Tensor): batch of token ids for LM (B)

Return type

batch_y (torch.Tensor)

create_batch_states(batch_states, l_states, l_tokens=None)[source]

Create batch of decoder states.

Parameters
  • batch_states (tuple) – batch of decoder states ([L x (B, dec_dim)], [L x (B, dec_dim)])

  • l_states (list) – list of decoder states [B x ([L x (1, dec_dim)], [L x (1, dec_dim)])]

Returns

batch of decoder states

([L x (B, dec_dim)], [L x (B, dec_dim)])

Return type

batch_states (tuple)

forward(hs_pad, ys_in_pad, hlens=None)[source]

Forward function for transducer.

Parameters
  • hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D)

  • ys_in_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax+1)

Returns

output (B, T, U, odim)

Return type

z (torch.Tensor)

init_state(init_tensor)[source]

Initialize decoder states.

Parameters

init_tensor (torch.Tensor) – batch of input features (B, emb_dim / dec_dim)

Returns

batch of decoder states

([L x (B, dec_dim)], [L x (B, dec_dim)])

Return type

(tuple)

rnn_forward(ey, state)[source]

RNN forward.

Parameters
  • ey (torch.Tensor) – batch of input features (B, emb_dim)

  • state (tuple) – batch of decoder states (L x (B, dec_dim), L x (B, dec_dim))

Returns

batch of output features (B, dec_dim) (tuple): batch of decoder states

(L x (B, dec_dim), L x (B, dec_dim))

Return type

y (torch.Tensor)

score(hyp, cache, init_tensor=None)[source]

Forward one step.

Parameters
  • hyp (dataclass) – hypothesis

  • cache (dict) – states cache

Returns

decoder outputs (1, dec_dim) state (tuple): decoder states

([L x (1, dec_dim)], [L x (1, dec_dim)]),

(torch.Tensor): token id for LM (1)

Return type

y (torch.Tensor)

select_state(batch_states, idx)[source]

Get decoder state from batch of states, for given id.

Parameters
  • batch_states (tuple) – batch of decoder states ([L x (B, dec_dim)], [L x (B, dec_dim)])

  • idx (int) – index to extract state from batch of states

Returns

decoder states for given id

([L x (1, dec_dim)], [L x (1, dec_dim)])

Return type

(tuple)

espnet.nets.pytorch_backend.transducer.joint_network

Transducer joint network implementation.

class espnet.nets.pytorch_backend.transducer.joint_network.JointNetwork(vocab_size: int, encoder_output_size: int, hidden_size: int, joint_space_size: int, joint_activation_type: int)[source]

Bases: torch.nn.modules.module.Module

Transducer joint network module.

Parameters
  • joint_space_size – Dimension of joint space

  • joint_activation_type – Activation type for joint network

Joint network initializer.

forward(h_enc: torch.Tensor, h_dec: torch.Tensor) → torch.Tensor[source]

Joint computation of z.

Parameters
  • h_enc – Batch of expanded hidden state (B, T, 1, D_enc)

  • h_dec – Batch of expanded hidden state (B, 1, U, D_dec)

Returns

Output (B, T, U, vocab_size)

Return type

z

espnet.nets.pytorch_backend.transducer.transformer_encoder

Encoder definition for transformer-transducer models.

class espnet.nets.pytorch_backend.transducer.transformer_encoder.Encoder(idim, enc_arch, input_layer='linear', repeat_block=0, self_attn_type='selfattn', positional_encoding_type='abs_pos', positionwise_layer_type='linear', positionwise_activation_type='relu', conv_mod_activation_type='relu', normalize_before=True, padding_idx=-1)[source]

Bases: torch.nn.modules.module.Module

Transformer encoder module.

Parameters
  • idim (int) – input dim

  • enc_arch (list) – list of encoder blocks (type and parameters)

  • input_layer (str) – input layer type

  • repeat_block (int) – repeat provided block N times if N > 1

  • self_attn_type (str) – type of self-attention

  • positional_encoding_type (str) – positional encoding type

  • positionwise_layer_type (str) – linear

  • positionwise_activation_type (str) – positionwise activation type

  • conv_mod_activation_type (str) – convolutional module activation type

  • normalize_before (bool) – whether to use layer_norm before the first block

  • padding_idx (int) – padding_idx for embedding input layer (if specified)

Construct an Transformer encoder object.

forward(xs, masks)[source]

Encode input sequence.

Parameters
  • xs (torch.Tensor) – input tensor

  • masks (torch.Tensor) – input mask

Returns

position embedded input mask (torch.Tensor): position embedded mask

Return type

xs (torch.Tensor)

espnet.nets.pytorch_backend.transducer.initializer

Parameter initialization for transducer RNN/Transformer parts.

espnet.nets.pytorch_backend.transducer.initializer.initializer(model, args)[source]

Initialize transducer model.

Parameters
  • model (torch.nn.Module) – transducer instance

  • args (Namespace) – argument Namespace containing options

espnet.nets.pytorch_backend.transducer.utils

Utility functions for transducer models.

espnet.nets.pytorch_backend.transducer.utils.check_state(state, max_len, pad_token)[source]

Left pad or trim state according to max_len.

Parameters
  • state (list) – list of of L decoder states (in_len, dec_dim)

  • max_len (int) – maximum length authorized

  • pad_token (int) – padding token id

Returns

list of L padded decoder states (1, max_len, dec_dim)

Return type

final (list)

espnet.nets.pytorch_backend.transducer.utils.create_lm_batch_state(lm_states_list, lm_type, lm_layers)[source]

Create batch of LM states.

Parameters
  • lm_states (list or dict) – list of individual LM states

  • lm_type (str) – type of LM

  • lm_layers (int) – number of LM layers

Returns

batch of LM states

Return type

batch_states (list)

espnet.nets.pytorch_backend.transducer.utils.init_lm_state(lm_model)[source]

Initialize LM state.

Parameters

lm_model (torch.nn.Module) – LM module

Returns

initial LM state

Return type

lm_state (dict)

espnet.nets.pytorch_backend.transducer.utils.is_prefix(x, pref)[source]

Check prefix.

Parameters
  • x (list) – token id sequence

  • pref (list) – token id sequence

Returns

whether pref is a prefix of x.

Return type

(boolean)

espnet.nets.pytorch_backend.transducer.utils.pad_batch_state(state, pred_length, pad_token)[source]

Left pad batch of states and trim if necessary.

Parameters
  • state (list) – list of of L decoder states (B, ?, dec_dim)

  • pred_length (int) – maximum length authorized (trimming)

  • pad_token (int) – padding token id

Returns

list of L padded decoder states (B, pred_length, dec_dim)

Return type

final (list)

espnet.nets.pytorch_backend.transducer.utils.pad_sequence(seqlist, pad_token)[source]

Left pad list of token id sequences.

Parameters
  • seqlist (list) – list of token id sequences

  • pad_token (int) – padding token id

Returns

list of padded token id sequences

Return type

final (list)

espnet.nets.pytorch_backend.transducer.utils.prepare_loss_inputs(ys_pad, hlens, blank_id=0, ignore_id=-1)[source]

Prepare tensors for transducer loss computation.

Parameters
  • ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)

  • hlens (torch.Tensor) – batch of hidden sequence lengthts (B) or batch of masks (B, 1, Tmax)

  • blank_id (int) – index of blank label

  • ignore_id (int) – index of initial padding

Returns

batch of padded target sequences + blank (B, Lmax + 1) target (torch.Tensor): batch of padded target sequences (B, Lmax) pred_len (torch.Tensor): batch of hidden sequence lengths (B) target_len (torch.Tensor): batch of output sequence lengths (B)

Return type

ys_in_pad (torch.Tensor)

espnet.nets.pytorch_backend.transducer.utils.recombine_hyps(hyps)[source]

Recombine hypotheses with equivalent output sequence.

Parameters

hyps (list) – list of hypotheses

Returns

list of recombined hypotheses

Return type

final (list)

espnet.nets.pytorch_backend.transducer.utils.select_lm_state(lm_states, idx, lm_type, lm_layers)[source]

Get LM state from batch for given id.

Parameters
  • lm_states (list or dict) – batch of LM states

  • idx (int) – index to extract state from batch state

  • lm_type (str) – type of LM

  • lm_layers (int) – number of LM layers

Returns

LM state for given id

Return type

idx_state (dict)

espnet.nets.pytorch_backend.transducer.utils.substract(x, subset)[source]

Remove elements of subset if corresponding token id sequence exist in x.

Parameters
  • x (list) – set of hypotheses

  • subset (list) – subset of hypotheses

Returns

new set

Return type

final (list)

espnet.nets.pytorch_backend.transducer.transformer_decoder

Decoder definition for transformer-transducer models.

class espnet.nets.pytorch_backend.transducer.transformer_decoder.DecoderTT(odim, edim, jdim, dec_arch, input_layer='embed', repeat_block=0, joint_activation_type='tanh', positional_encoding_type='abs_pos', positionwise_layer_type='linear', positionwise_activation_type='relu', dropout_rate_embed=0.0, blank=0)[source]

Bases: espnet.nets.transducer_decoder_interface.TransducerDecoderInterface, torch.nn.modules.module.Module

Decoder module for transformer-transducer models.

Parameters
  • odim (int) – dimension of outputs

  • edim (int) – dimension of encoder outputs

  • jdim (int) – dimension of joint-space

  • dec_arch (list) – list of layer definitions

  • input_layer (str) – input layer type

  • repeat_block (int) – repeat provided blocks N times if N > 1

  • joint_activation_type (str) –

  • positional_encoding_type (str) – positional encoding type

  • positionwise_layer_type (str) – linear

  • positionwise_activation_type (str) – positionwise activation type

  • dropout_rate_embed (float) – dropout rate for embedding layer (if specified)

  • blank (int) – blank symbol ID

Construct a Decoder object for transformer-transducer models.

batch_score(hyps, batch_states, cache, init_tensor=None)[source]

Forward batch one step.

Parameters
  • hyps (list) – batch of hypotheses

  • batch_states (list) – decoder states [L x (B, max_len, dec_dim)]

  • cache (dict) – states cache

Returns

decoder output (B, dec_dim) batch_states (list): decoder states

[L x (B, max_len, dec_dim)]

lm_tokens (torch.Tensor): batch of token ids for LM (B)

Return type

batch_y (torch.Tensor)

create_batch_states(batch_states, l_states, l_tokens)[source]

Create batch of decoder states.

Parameters
  • batch_states (list) – batch of decoder states [L x (B, max_len, dec_dim)]

  • l_states (list) – list of decoder states [B x [L x (1, max_len, dec_dim)]]

  • l_tokens (list) – list of token sequences for batch

Returns

batch of decoder and attention states

[L x (B, max_len, dec_dim)]

Return type

batch_states (list)

forward(tgt, tgt_mask, memory)[source]

Forward transformer-transducer decoder.

Parameters
  • tgt (torch.Tensor) – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases

  • tgt_mask (torch.Tensor) – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)

  • memory (torch.Tensor) – encoded memory, float32 (batch, maxlen_in, feat)

Returns

joint output (batch, maxlen_in, maxlen_out, odim) tgt_mask (torch.Tensor): score mask before softmax (batch, maxlen_out)

Return type

z (torch.Tensor)

init_state(init_tensor=None)[source]

Initialize decoder states.

Parameters

init_tensor (torch.Tensor) – batch of input features (B, dec_dim)

Returns

batch of decoder decoder states [L x None]

Return type

state (list)

score(hyp, cache, init_tensor=None)[source]

Forward one step.

Parameters
  • hyp (dataclass) – hypothesis

  • cache (dict) – states cache

Returns

decoder outputs (1, dec_dim) (list): decoder and attention states

[L x (1, max_len, dec_dim)]

lm_tokens (torch.Tensor): token id for LM (1)

Return type

y (torch.Tensor)

select_state(batch_states, idx)[source]

Get decoder state from batch of states, for given id.

Parameters
  • batch_states (list) – batch of decoder states [L x (B, max_len, dec_dim)]

  • idx (int) – index to extract state from batch of states

Returns

decoder states for given id

[L x (1, max_len, dec_dim)]

Return type

state_idx (list)

espnet.nets.pytorch_backend.transducer.causal_conv1d

CausalConv1d module definition for transformer decoder.

class espnet.nets.pytorch_backend.transducer.causal_conv1d.CausalConv1d(idim, odim, kernel_size, stride=1, dilation=1, groups=1, bias=True)[source]

Bases: torch.nn.modules.module.Module

CausalConv1d module for transformer decoder.

Parameters
  • idim (int) – dimension of inputs

  • odim (int) – dimension of outputs

  • kernel_size (int) – size of convolving kernel

  • stride (int) – stride of the convolution

  • dilation (int) – spacing between the kernel points

  • groups (int) – number of blocked connections from ichannels to ochannels

  • bias (bool) – whether to add a learnable bias to the output

Construct a CausalConv1d object.

forward(x, x_mask, cache=None)[source]

CausalConv1d forward for x.

Parameters
  • x (torch.Tensor) – input torch (B, U, idim)

  • x_mask (torch.Tensor) – (B, 1, U)

Returns

input torch (B, sub(U), attention_dim) x_mask (torch.Tensor): (B, 1, sub(U))

Return type

x (torch.Tensor)

espnet.nets.pytorch_backend.transducer.__init__

Initialize sub package.

espnet.nets.pytorch_backend.transducer.loss

Transducer loss module.

class espnet.nets.pytorch_backend.transducer.loss.TransLoss(trans_type, blank_id)[source]

Bases: torch.nn.modules.module.Module

Transducer loss module.

Parameters
  • trans_type (str) – type of transducer implementation to calculate loss.

  • blank_id (int) – blank symbol id

Construct an TransLoss object.

forward(pred_pad, target, pred_len, target_len)[source]

Compute path-aware regularization transducer loss.

Parameters
  • pred_pad (torch.Tensor) – Batch of predicted sequences (batch, maxlen_in, maxlen_out+1, odim)

  • target (torch.Tensor) – Batch of target sequences (batch, maxlen_out)

  • pred_len (torch.Tensor) – batch of lengths of predicted sequences (batch)

  • target_len (torch.tensor) – batch of lengths of target sequences (batch)

Returns

transducer loss

Return type

loss (torch.Tensor)

espnet.nets.pytorch_backend.transducer.transformer_decoder_layer

Decoder layer definition for transformer-transducer models.

class espnet.nets.pytorch_backend.transducer.transformer_decoder_layer.DecoderLayer(size, self_attn, feed_forward, dropout_rate)[source]

Bases: torch.nn.modules.module.Module

Single decoder layer module for transformer-transducer models.

Parameters
  • size (int) – input dim

  • self_attn (MultiHeadedAttention) – self attention module

  • feed_forward (PositionwiseFeedForward) – feed forward layer module

  • dropout_rate (float) – dropout rate

  • normalize_before (bool) – whether to use layer_norm before the first block

Construct an DecoderLayer object.

forward(tgt, tgt_mask, cache=None)[source]

Compute decoded features.

Parameters
  • tgt (torch.Tensor) – decoded previous target features (B, Lmax, idim)

  • tgt_mask (torch.Tensor) – mask for tgt (B, Lmax)

  • cache (torch.Tensor) – cached output (B, Lmax-1, idim)

Returns

decoder target features (B, Lmax, odim) tgt_mask (torch.Tensor): mask for tgt (B, Lmax)

Return type

tgt (torch.Tensor)

espnet.nets.pytorch_backend.transducer.rnn_att_decoder

RNN-Transducer with attention implementation for training and decoding.

class espnet.nets.pytorch_backend.transducer.rnn_att_decoder.DecoderRNNTAtt(eprojs, odim, dtype, dlayers, dunits, blank, att, embed_dim, joint_dim, joint_activation_type='tanh', dropout=0.0, dropout_embed=0.0)[source]

Bases: espnet.nets.transducer_decoder_interface.TransducerDecoderInterface, torch.nn.modules.module.Module

RNNT-Att Decoder module.

Parameters
  • eprojs (int) – # encoder projection units

  • odim (int) – dimension of outputs

  • dtype (str) – gru or lstm

  • dlayers (int) – # decoder layers

  • dunits (int) – # decoder units

  • blank (int) – blank symbol id

  • att (torch.nn.Module) – attention module

  • embed_dim (int) – dimension of embeddings

  • joint_dim (int) – dimension of joint space

  • joint_activation_type (int) – joint network activation

  • dropout (float) – dropout rate

  • dropout_embed (float) – embedding dropout rate

Transducer with attention initializer.

batch_score(hyps, batch_states, cache, init_tensor)[source]

Forward batch one step.

Parameters
  • hyps (list) – batch of hypotheses

  • batch_states (tuple) – batch of decoder and attention states (([L x (B, dec_dim)], [L x (B, dec_dim)]), (B, max_len))

  • cache (dict) – states cache

  • init_tensor – encoder outputs for att. computation (1, max_enc_len)

Returns

decoder output (B, dec_dim) batch_states (tuple): batch of decoder and attention states

(([L x (B, dec_dim)], [L x (B, dec_dim)]), (B, max_len))

lm_tokens (torch.Tensor): batch of token ids for LM (B)

Return type

batch_y (torch.Tensor)

calculate_all_attentions(hs_pad, hlens, ys_pad)[source]

Calculate all of attentions.

Parameters
  • hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D)

  • hlens (torch.Tensor) – batch of lengths of hidden state sequences (B)

  • ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)

Returns

attention weights with the following shape,
  1. multi-head case => attention weights (B, H, Lmax, Tmax),

  2. other case => attention weights (B, Lmax, Tmax).

Return type

att_ws (ndarray)

create_batch_states(batch_states, l_states, l_tokens=None)[source]

Create batch of decoder and attention states.

Parameters
  • batch_states (tuple) – batch of decoder and attention states (([L x (B, dec_dim)], [L x (B, dec_dim)]), (B, max_len))

  • l_states (list) – list of single decoder and attention states [B x (([L x (1, dec_dim)], [L x (1, dec_dim)]), (1, max_len))]

Returns

batch of decoder and attention states

(([L x (B, dec_dim)], [L x (B, dec_dim)]), (B, max_len))

Return type

(tuple)

forward(hs_pad, ys_in_pad, hlens=None)[source]

Forward function for transducer with attention.

Parameters
  • hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D)

  • ys_in_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax+1)

Returns

output (B, T, U, odim)

Return type

z (torch.Tensor)

init_state(init_tensor)[source]

Initialize decoder states.

Parameters

init_tensor (torch.Tensor) – batch of input features (B, (emb_dim + eprojs) / dec_dim)

Returns

batch of decoder and attention states

([L x (B, dec_dim)], [L x (B, dec_dim)], None)

Return type

(tuple)

rnn_forward(ey, state)[source]

RNN forward.

Parameters
  • ey (torch.Tensor) – batch of input features (B, (emb_dim + eprojs))

  • state (tuple) – batch of decoder states ([L x (B, dec_dim)], [L x (B, dec_dim)])

Returns

decoder output for one step (B, dec_dim) (tuple): batch of decoder states

([L x (B, dec_dim)], [L x (B, dec_dim)])

Return type

y (torch.Tensor)

score(hyp, cache, init_tensor)[source]

Forward one step.

Parameters
  • hyp (dataclass) – hypothese

  • cache (dict) – states cache

  • init_tensor (torch.Tensor) – initial tensor (1, max_len, dec_dim)

Returns

decoder outputs (1, dec_dim) (tuple): decoder and attention states

(([L x (1, dec_dim)], [L x (1, dec_dim)]), (1, max_len))

lm_tokens (torch.Tensor): token id for LM (1)

Return type

y (torch.Tensor)

select_state(batch_states, idx)[source]

Get decoder and attention state from batch of states, for given id.

Parameters
  • batch_states (tuple) – batch of decoder and attention states (([L x (B, dec_dim)], [L x (B, dec_dim)]), (B, max_len))

  • idx (int) – index to extract state from batch of states

Returns

decoder and attention states

(([L x (1, dec_dim)], [L x (1, dec_dim)]), (1, max_len))

Return type

(tuple)

espnet.nets.pytorch_backend.rnn.encoders

class espnet.nets.pytorch_backend.rnn.encoders.Encoder(etype, idim, elayers, eunits, eprojs, subsample, dropout, in_channel=1)[source]

Bases: torch.nn.modules.module.Module

Encoder module

Parameters
  • etype (str) – type of encoder network

  • idim (int) – number of dimensions of encoder network

  • elayers (int) – number of layers of encoder network

  • eunits (int) – number of lstm units of encoder network

  • eprojs (int) – number of projection units of encoder network

  • subsample (np.ndarray) – list of subsampling numbers

  • dropout (float) – dropout rate

  • in_channel (int) – number of input channels

forward(xs_pad, ilens, prev_states=None)[source]

Encoder forward

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, D)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • prev_state (torch.Tensor) – batch of previous encoder hidden states (?, …)

Returns

batch of hidden state sequences (B, Tmax, eprojs)

Return type

torch.Tensor

class espnet.nets.pytorch_backend.rnn.encoders.RNN(idim, elayers, cdim, hdim, dropout, typ='blstm')[source]

Bases: torch.nn.modules.module.Module

RNN module

Parameters
  • idim (int) – dimension of inputs

  • elayers (int) – number of encoder layers

  • cdim (int) – number of rnn units (resulted in cdim * 2 if bidirectional)

  • hdim (int) – number of final projection units

  • dropout (float) – dropout rate

  • typ (str) – The RNN type

forward(xs_pad, ilens, prev_state=None)[source]

RNN forward

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, D)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • prev_state (torch.Tensor) – batch of previous RNN states

Returns

batch of hidden state sequences (B, Tmax, eprojs)

Return type

torch.Tensor

class espnet.nets.pytorch_backend.rnn.encoders.RNNP(idim, elayers, cdim, hdim, subsample, dropout, typ='blstm')[source]

Bases: torch.nn.modules.module.Module

RNN with projection layer module

Parameters
  • idim (int) – dimension of inputs

  • elayers (int) – number of encoder layers

  • cdim (int) – number of rnn units (resulted in cdim * 2 if bidirectional)

  • hdim (int) – number of projection units

  • subsample (np.ndarray) – list of subsampling numbers

  • dropout (float) – dropout rate

  • typ (str) – The RNN type

forward(xs_pad, ilens, prev_state=None)[source]

RNNP forward

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

  • prev_state (torch.Tensor) – batch of previous RNN states

Returns

batch of hidden state sequences (B, Tmax, hdim)

Return type

torch.Tensor

class espnet.nets.pytorch_backend.rnn.encoders.VGG2L(in_channel=1)[source]

Bases: torch.nn.modules.module.Module

VGG-like module

Parameters

in_channel (int) – number of input channels

forward(xs_pad, ilens, **kwargs)[source]

VGG2L forward

Parameters
  • xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, D)

  • ilens (torch.Tensor) – batch of lengths of input sequences (B)

Returns

batch of padded hidden state sequences (B, Tmax // 4, 128 * D // 4)

Return type

torch.Tensor

espnet.nets.pytorch_backend.rnn.encoders.encoder_for(args, idim, subsample)[source]

Instantiates an encoder module given the program arguments

Parameters
  • args (Namespace) – The arguments

  • or List of integer idim (int) – dimension of input, e.g. 83, or List of dimensions of inputs, e.g. [83,83]

  • or List of List subsample (List) –

    subsample factors, e.g. [1,2,2,1,1], or List of subsample factors of each encoder.

    e.g. [[1,2,2,1,1], [1,2,2,1,1]]

:rtype torch.nn.Module :return: The encoder module

espnet.nets.pytorch_backend.rnn.encoders.reset_backward_rnn_state(states)[source]

Sets backward BRNN states to zeroes

Useful in processing of sliding windows over the inputs

espnet.nets.pytorch_backend.rnn.decoders

class espnet.nets.pytorch_backend.rnn.decoders.Decoder(eprojs, odim, dtype, dlayers, dunits, sos, eos, att, verbose=0, char_list=None, labeldist=None, lsm_weight=0.0, sampling_probability=0.0, dropout=0.0, context_residual=False, replace_sos=False, num_encs=1)[source]

Bases: torch.nn.modules.module.Module, espnet.nets.scorer_interface.ScorerInterface

Decoder module

Parameters
  • eprojs (int) – encoder projection units

  • odim (int) – dimension of outputs

  • dtype (str) – gru or lstm

  • dlayers (int) – decoder layers

  • dunits (int) – decoder units

  • sos (int) – start of sequence symbol id

  • eos (int) – end of sequence symbol id

  • att (torch.nn.Module) – attention module

  • verbose (int) – verbose level

  • char_list (list) – list of character strings

  • labeldist (ndarray) – distribution of label smoothing

  • lsm_weight (float) – label smoothing weight

  • sampling_probability (float) – scheduled sampling probability

  • dropout (float) – dropout rate

  • context_residual (float) – if True, use context vector for token generation

  • replace_sos (float) – use for multilingual (speech/text) translation

calculate_all_attentions(hs_pad, hlen, ys_pad, strm_idx=0, lang_ids=None)[source]

Calculate all of attentions

Parameters
  • hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D) in multi-encoder case, list of torch.Tensor, [(B, Tmax_1, D), (B, Tmax_2, D), …, ] ]

  • hlen (torch.Tensor) – batch of lengths of hidden state sequences (B) [in multi-encoder case, list of torch.Tensor, [(B), (B), …, ]

  • ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)

  • strm_idx (int) – stream index for parallel speaker attention in multi-speaker case

  • lang_ids (torch.Tensor) – batch of target language id tensor (B, 1)

Returns

attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) multi-encoder case =>

[(B, Lmax, Tmax1), (B, Lmax, Tmax2), …, (B, Lmax, NumEncs)]

  1. other case => attention weights (B, Lmax, Tmax).

Return type

float ndarray

forward(hs_pad, hlens, ys_pad, strm_idx=0, lang_ids=None)[source]

Decoder forward

Parameters
  • hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D) [in multi-encoder case, list of torch.Tensor, [(B, Tmax_1, D), (B, Tmax_2, D), …, ] ]

  • hlens (torch.Tensor) – batch of lengths of hidden state sequences (B) [in multi-encoder case, list of torch.Tensor, [(B), (B), …, ]

  • ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)

  • strm_idx (int) – stream index indicates the index of decoding stream.

  • lang_ids (torch.Tensor) – batch of target language id tensor (B, 1)

Returns

attention loss value

Return type

torch.Tensor

Returns

accuracy

Return type

float

init_state(x)[source]

Get an initial state for decoding (optional).

Parameters

x (torch.Tensor) – The encoded feature tensor

Returns: initial state

recognize_beam(h, lpz, recog_args, char_list, rnnlm=None, strm_idx=0)[source]

beam search implementation

Parameters
  • h (torch.Tensor) – encoder hidden state (T, eprojs) [in multi-encoder case, list of torch.Tensor, [(T1, eprojs), (T2, eprojs), …] ]

  • lpz (torch.Tensor) – ctc log softmax output (T, odim) [in multi-encoder case, list of torch.Tensor, [(T1, odim), (T2, odim), …] ]

  • recog_args (Namespace) – argument Namespace containing options

  • char_list – list of character strings

  • rnnlm (torch.nn.Module) – language module

  • strm_idx (int) – stream index for speaker parallel attention in multi-speaker case

Returns

N-best decoding results

Return type

list of dicts

recognize_beam_batch(h, hlens, lpz, recog_args, char_list, rnnlm=None, normalize_score=True, strm_idx=0, lang_ids=None)[source]
rnn_forward(ey, z_list, c_list, z_prev, c_prev)[source]
score(yseq, state, x)[source]

Score new token (required).

Parameters
  • y (torch.Tensor) – 1D torch.int64 prefix tokens.

  • state – Scorer state for prefix tokens

  • x (torch.Tensor) – The encoder feature that generates ys.

Returns

Tuple of

scores for next token that has a shape of (n_vocab) and next state for ys

Return type

tuple[torch.Tensor, Any]

zero_state(hs_pad)[source]
espnet.nets.pytorch_backend.rnn.decoders.decoder_for(args, odim, sos, eos, att, labeldist)[source]

espnet.nets.pytorch_backend.rnn.attentions

Attention modules for RNN.

class espnet.nets.pytorch_backend.rnn.attentions.AttAdd(eprojs, dunits, att_dim, han_mode=False)[source]

Bases: torch.nn.modules.module.Module

Additive attention

Parameters
  • eprojs (int) – # projection-units of encoder

  • dunits (int) – # units of decoder

  • att_dim (int) – attention dimension

  • han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h

forward(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0)[source]

AttAdd forward

Parameters
  • enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)

  • enc_hs_len (list) – padded encoder hidden state length (B)

  • dec_z (torch.Tensor) – decoder hidden state (B x D_dec)

  • att_prev (torch.Tensor) – dummy (does not use)

  • scaling (float) – scaling parameter before applying softmax

Returns

attention weighted encoder state (B, D_enc)

Return type

torch.Tensor

Returns

previous attention weights (B x T_max)

Return type

torch.Tensor

reset()[source]

reset states

class espnet.nets.pytorch_backend.rnn.attentions.AttCov(eprojs, dunits, att_dim, han_mode=False)[source]

Bases: torch.nn.modules.module.Module

Coverage mechanism attention

Reference: Get To The Point: Summarization with Pointer-Generator Network

(https://arxiv.org/abs/1704.04368)

Parameters
  • eprojs (int) – # projection-units of encoder

  • dunits (int) – # units of decoder

  • att_dim (int) – attention dimension

  • han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h

forward(enc_hs_pad, enc_hs_len, dec_z, att_prev_list, scaling=2.0)[source]

AttCov forward

Parameters
  • enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)

  • enc_hs_len (list) – padded encoder hidden state length (B)

  • dec_z (torch.Tensor) – decoder hidden state (B x D_dec)

  • att_prev_list (list) – list of previous attention weight

  • scaling (float) – scaling parameter before applying softmax

Returns

attention weighted encoder state (B, D_enc)

Return type

torch.Tensor

Returns

list of previous attention weights

Return type

list

reset()[source]

reset states

class espnet.nets.pytorch_backend.rnn.attentions.AttCovLoc(eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False)[source]

Bases: torch.nn.modules.module.Module

Coverage mechanism location aware attention

This attention is a combination of coverage and location-aware attentions.

Parameters
  • eprojs (int) – # projection-units of encoder

  • dunits (int) – # units of decoder

  • att_dim (int) – attention dimension

  • aconv_chans (int) – # channels of attention convolution

  • aconv_filts (int) – filter size of attention convolution

  • han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h

forward(enc_hs_pad, enc_hs_len, dec_z, att_prev_list, scaling=2.0)[source]

AttCovLoc forward

Parameters
  • enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)

  • enc_hs_len (list) – padded encoder hidden state length (B)

  • dec_z (torch.Tensor) – decoder hidden state (B x D_dec)

  • att_prev_list (list) – list of previous attention weight

  • scaling (float) – scaling parameter before applying softmax

Returns

attention weighted encoder state (B, D_enc)

Return type

torch.Tensor

Returns

list of previous attention weights

Return type

list

reset()[source]

reset states

class espnet.nets.pytorch_backend.rnn.attentions.AttDot(eprojs, dunits, att_dim, han_mode=False)[source]

Bases: torch.nn.modules.module.Module

Dot product attention

Parameters
  • eprojs (int) – # projection-units of encoder

  • dunits (int) – # units of decoder

  • att_dim (int) – attention dimension

  • han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h

forward(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0)[source]

AttDot forward

Parameters
  • enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)

  • enc_hs_len (list) – padded encoder hidden state length (B)

  • dec_z (torch.Tensor) – dummy (does not use)

  • att_prev (torch.Tensor) – dummy (does not use)

  • scaling (float) – scaling parameter before applying softmax

Returns

attention weighted encoder state (B, D_enc)

Return type

torch.Tensor

Returns

previous attention weight (B x T_max)

Return type

torch.Tensor

reset()[source]

reset states

class espnet.nets.pytorch_backend.rnn.attentions.AttForward(eprojs, dunits, att_dim, aconv_chans, aconv_filts)[source]

Bases: torch.nn.modules.module.Module

Forward attention module.

Reference: Forward attention in sequence-to-sequence acoustic modeling for speech synthesis

Parameters
  • eprojs (int) – # projection-units of encoder

  • dunits (int) – # units of decoder

  • att_dim (int) – attention dimension

  • aconv_chans (int) – # channels of attention convolution

  • aconv_filts (int) – filter size of attention convolution

forward(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=1.0, last_attended_idx=None, backward_window=1, forward_window=3)[source]

Calculate AttForward forward propagation.

Parameters
  • enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)

  • enc_hs_len (list) – padded encoder hidden state length (B)

  • dec_z (torch.Tensor) – decoder hidden state (B x D_dec)

  • att_prev (torch.Tensor) – attention weights of previous step

  • scaling (float) – scaling parameter before applying softmax

  • last_attended_idx (int) – index of the inputs of the last attended

  • backward_window (int) – backward window size in attention constraint

  • forward_window (int) – forward window size in attetion constraint

Returns

attention weighted encoder state (B, D_enc)

Return type

torch.Tensor

Returns

previous attention weights (B x T_max)

Return type

torch.Tensor

reset()[source]

reset states

class espnet.nets.pytorch_backend.rnn.attentions.AttForwardTA(eunits, dunits, att_dim, aconv_chans, aconv_filts, odim)[source]

Bases: torch.nn.modules.module.Module

Forward attention with transition agent module.

Reference: Forward attention in sequence-to-sequence acoustic modeling for speech synthesis

Parameters
  • eunits (int) – # units of encoder

  • dunits (int) – # units of decoder

  • att_dim (int) – attention dimension

  • aconv_chans (int) – # channels of attention convolution

  • aconv_filts (int) – filter size of attention convolution

  • odim (int) – output dimension

forward(enc_hs_pad, enc_hs_len, dec_z, att_prev, out_prev, scaling=1.0, last_attended_idx=None, backward_window=1, forward_window=3)[source]

Calculate AttForwardTA forward propagation.

Parameters
  • enc_hs_pad (torch.Tensor) – padded encoder hidden state (B, Tmax, eunits)

  • enc_hs_len (list) – padded encoder hidden state length (B)

  • dec_z (torch.Tensor) – decoder hidden state (B, dunits)

  • att_prev (torch.Tensor) – attention weights of previous step

  • out_prev (torch.Tensor) – decoder outputs of previous step (B, odim)

  • scaling (float) – scaling parameter before applying softmax

  • last_attended_idx (int) – index of the inputs of the last attended

  • backward_window (int) – backward window size in attention constraint

  • forward_window (int) – forward window size in attetion constraint

Returns

attention weighted encoder state (B, dunits)

Return type

torch.Tensor

Returns

previous attention weights (B, Tmax)

Return type

torch.Tensor

reset()[source]
class espnet.nets.pytorch_backend.rnn.attentions.AttLoc(eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False)[source]

Bases: torch.nn.modules.module.Module

location-aware attention module.

Reference: Attention-Based Models for Speech Recognition

(https://arxiv.org/pdf/1506.07503.pdf)

Parameters
  • eprojs (int) – # projection-units of encoder

  • dunits (int) – # units of decoder

  • att_dim (int) – attention dimension

  • aconv_chans (int) – # channels of attention convolution

  • aconv_filts (int) – filter size of attention convolution

  • han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h

forward(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0, last_attended_idx=None, backward_window=1, forward_window=3)[source]

Calcualte AttLoc forward propagation.

Parameters
  • enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)

  • enc_hs_len (list) – padded encoder hidden state length (B)

  • dec_z (torch.Tensor) – decoder hidden state (B x D_dec)

  • att_prev (torch.Tensor) – previous attention weight (B x T_max)

  • scaling (float) – scaling parameter before applying softmax

  • forward_window (int) – forward window size when constraining attention

  • last_attended_idx (int) – index of the inputs of the last attended

  • backward_window (int) – backward window size in attention constraint

  • forward_window – forward window size in attetion constraint

Returns

attention weighted encoder state (B, D_enc)

Return type

torch.Tensor

Returns

previous attention weights (B x T_max)

Return type

torch.Tensor

reset()[source]

reset states

class espnet.nets.pytorch_backend.rnn.attentions.AttLoc2D(eprojs, dunits, att_dim, att_win, aconv_chans, aconv_filts, han_mode=False)[source]

Bases: torch.nn.modules.module.Module

2D location-aware attention

This attention is an extended version of location aware attention. It take not only one frame before attention weights, but also earlier frames into account.

Parameters
  • eprojs (int) – # projection-units of encoder

  • dunits (int) – # units of decoder

  • att_dim (int) – attention dimension

  • aconv_chans (int) – # channels of attention convolution

  • aconv_filts (int) – filter size of attention convolution

  • att_win (int) – attention window size (default=5)

  • han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h

forward(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0)[source]

AttLoc2D forward

Parameters
  • enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)

  • enc_hs_len (list) – padded encoder hidden state length (B)

  • dec_z (torch.Tensor) – decoder hidden state (B x D_dec)

  • att_prev (torch.Tensor) – previous attention weight (B x att_win x T_max)

  • scaling (float) – scaling parameter before applying softmax

Returns

attention weighted encoder state (B, D_enc)

Return type

torch.Tensor

Returns

previous attention weights (B x att_win x T_max)

Return type

torch.Tensor

reset()[source]

reset states

class espnet.nets.pytorch_backend.rnn.attentions.AttLocRec(eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False)[source]

Bases: torch.nn.modules.module.Module

location-aware recurrent attention

This attention is an extended version of location aware attention. With the use of RNN, it take the effect of the history of attention weights into account.

Parameters
  • eprojs (int) – # projection-units of encoder

  • dunits (int) – # units of decoder

  • att_dim (int) – attention dimension

  • aconv_chans (int) – # channels of attention convolution

  • aconv_filts (int) – filter size of attention convolution

  • han_mode (bool) – flag to swith on mode of hierarchical attention and not store pre_compute_enc_h

forward(enc_hs_pad, enc_hs_len, dec_z, att_prev_states, scaling=2.0)[source]

AttLocRec forward

Parameters
  • enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)

  • enc_hs_len (list) – padded encoder hidden state length (B)

  • dec_z (torch.Tensor) – decoder hidden state (B x D_dec)

  • att_prev_states (tuple) – previous attention weight and lstm states ((B, T_max), ((B, att_dim), (B, att_dim)))

  • scaling (float) – scaling parameter before applying softmax

Returns

attention weighted encoder state (B, D_enc)

Return type

torch.Tensor

Returns

previous attention weights and lstm states (w, (hx, cx)) ((B, T_max), ((B, att_dim), (B, att_dim)))

Return type

tuple

</