espnet2.enh package¶
espnet2.enh.espnet_model¶
Enhancement model module.

class
espnet2.enh.espnet_model.
ESPnetEnhancementModel
(encoder: espnet2.enh.encoder.abs_encoder.AbsEncoder, separator: Optional[espnet2.enh.separator.abs_separator.AbsSeparator], decoder: espnet2.enh.decoder.abs_decoder.AbsDecoder, mask_module: Optional[espnet2.diar.layers.abs_mask.AbsMask], loss_wrappers: Optional[List[espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper]], stft_consistency: bool = False, loss_type: str = 'mask_mse', mask_type: Optional[str] = None, flexible_numspk: bool = False, extract_feats_in_collect_stats: bool = False, normalize_variance: bool = False, normalize_variance_per_ch: bool = False, categories: list = [], category_weights: list = [], always_forward_in_48k: bool = False)[source]¶ Bases:
espnet2.train.abs_espnet_model.AbsESPnetModel
Speech enhancement or separation Frontend model
Main entry of speech enhancement/separation model training.
 Parameters:
encoder – waveform encoder that converts waveforms to feature representations
separator – separator that enhance or separate the feature representations
decoder – waveform decoder that converts the feature back to waveforms
mask_module – mask module that converts the feature to masks NOTE: Only used for compatibility with joint speaker diarization. See test/espnet2/enh/test_espnet_enh_s2t_model.py for details.
loss_wrappers – list of loss wrappers Each loss wrapper contains a criterion for loss calculation and the corresonding loss weight. The losses will be calculated in the order of the list and summed up.
 –
stft_consistency – (deprecated, kept for compatibility) whether to compute the TFdomain loss while enforcing STFT consistency NOTE: STFT consistency is now always used for frequencydomain spectrum losses.
loss_type – (deprecated, kept for compatibility) loss type
mask_type – (deprecated, kept for compatibility) mask type in TFdomain model
 –
flexible_numspk – whether to allow the model to predict a variable number of speakers in its output. NOTE: This should be used when training a speech separation model for unknown number of speakers.
 –
extract_feats_in_collect_stats – used in espnet2/tasks/abs_task.py for determining whether or not to skip model building in collect_stats stage (stage 5 in egs2/*/enh1/enh.sh).
normalize_variance – whether to normalize the signal variance before model forward, and revert it back after.
normalize_variance_per_ch – whether to normalize the signal variance for each channel instead of the whole signal. NOTE: normalize_variance and normalize_variance_per_ch cannot be True at the same time.
 –
categories – list of all possible categories of minibatches (order matters!) (e.g. [“1ch_8k_reverb”, “1ch_8k_both”] for multicondition training) NOTE: this will be used to convert category index to the corresponding name for logging in forward_loss. Different categories will have different loss name suffixes.
category_weights – list of weights for each category. Used to set loss weights for batches of different categories.
 –
always_forward_in_48k – whether to always upsample the input speech to 48kHz for forward, and then downsample to the original sample rate for loss calculation. NOTE: this can be useful to train a model capable of handling various sampling rates while unifying bandwidth extension + speech enhancement.

collect_feats
(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶

forward
(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Frontend + Encoder + Decoder + Calc loss
 Parameters:
speech_mix – (Batch, samples) or (Batch, samples, channels)
speech_ref – (Batch, num_speaker, samples) or (Batch, num_speaker, samples, channels)
speech_mix_lengths – (Batch,), default None for chunk interator, because the chunkiterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py
kwargs – “utt_id” is among the input.

forward_enhance
(speech_mix: torch.Tensor, speech_lengths: torch.Tensor, additional: Optional[Dict] = None, fs: Optional[int] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

forward_loss
(speech_pre: torch.Tensor, speech_lengths: torch.Tensor, feature_mix: torch.Tensor, feature_pre: List[torch.Tensor], others: OrderedDict, speech_ref: List[torch.Tensor], noise_ref: Optional[List[torch.Tensor]] = None, dereverb_speech_ref: Optional[List[torch.Tensor]] = None, category: Optional[torch.Tensor] = None, num_spk: Optional[int] = None, fs: Optional[int] = None) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

static
sort_by_perm
(nn_output, perm)[source]¶ Sort the input list of tensors by the specified permutation.
 Parameters:
nn_output – List[torch.Tensor(Batch, …)], len(nn_output) == num_spk
perm – (Batch, num_spk) or List[torch.Tensor(num_spk)]
 Returns:
List[torch.Tensor(Batch, …)]
 Return type:
nn_output_new
espnet2.enh.abs_enh¶

class
espnet2.enh.abs_enh.
AbsEnhancement
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract
forward
(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, collections.OrderedDict][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract
espnet2.enh.__init__¶
espnet2.enh.diffusion_enh¶
Enhancement model module.

class
espnet2.enh.diffusion_enh.
ESPnetDiffusionModel
(encoder: espnet2.enh.encoder.abs_encoder.AbsEncoder, diffusion: espnet2.enh.diffusion.abs_diffusion.AbsDiffusion, decoder: espnet2.enh.decoder.abs_decoder.AbsDecoder, num_spk: int = 1, normalize: bool = False, **kwargs)[source]¶ Bases:
espnet2.enh.espnet_model.ESPnetEnhancementModel
Target Speaker Extraction Frontend model

collect_feats
(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶

forward
(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Frontend + Encoder + Decoder + Calc loss
 Parameters:
speech_mix – (Batch, samples) or (Batch, samples, channels)
speech_ref1 – (Batch, samples) or (Batch, samples, channels)
speech_ref2 – (Batch, samples) or (Batch, samples, channels)
.. –
speech_mix_lengths – (Batch,), default None for chunk interator, because the chunkiterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py
enroll_ref1 – (Batch, samples_aux) enrollment (raw audio or embedding) for speaker 1
enroll_ref2 – (Batch, samples_aux) enrollment (raw audio or embedding) for speaker 2
.. –
kwargs – “utt_id” is among the input.

espnet2.enh.espnet_model_tse¶
Enhancement model module.

class
espnet2.enh.espnet_model_tse.
ESPnetExtractionModel
(encoder: espnet2.enh.encoder.abs_encoder.AbsEncoder, extractor: espnet2.enh.extractor.abs_extractor.AbsExtractor, decoder: espnet2.enh.decoder.abs_decoder.AbsDecoder, loss_wrappers: List[espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper], num_spk: int = 1, flexible_numspk: bool = False, share_encoder: bool = True, extract_feats_in_collect_stats: bool = False)[source]¶ Bases:
espnet2.train.abs_espnet_model.AbsESPnetModel
Target Speaker Extraction Frontend model

collect_feats
(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶

forward
(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Frontend + Encoder + Decoder + Calc loss
 Parameters:
speech_mix – (Batch, samples) or (Batch, samples, channels)
speech_ref1 – (Batch, samples) or (Batch, samples, channels)
speech_ref2 – (Batch, samples) or (Batch, samples, channels)
.. –
speech_mix_lengths – (Batch,), default None for chunk interator, because the chunkiterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py
enroll_ref1 – (Batch, samples_aux) enrollment (raw audio or embedding) for speaker 1
enroll_ref2 – (Batch, samples_aux) enrollment (raw audio or embedding) for speaker 2
.. –
kwargs – “utt_id” is among the input.

espnet2.enh.espnet_enh_s2t_model¶

class
espnet2.enh.espnet_enh_s2t_model.
ESPnetEnhS2TModel
(enh_model: espnet2.enh.espnet_model.ESPnetEnhancementModel, s2t_model: Union[espnet2.asr.espnet_model.ESPnetASRModel, espnet2.st.espnet_model.ESPnetSTModel, espnet2.diar.espnet_model.ESPnetDiarizationModel], calc_enh_loss: bool = True, bypass_enh_prob: float = 0)[source]¶ Bases:
espnet2.train.abs_espnet_model.AbsESPnetModel
Joint model Enhancement and Speech to Text.

batchify_nll
(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)¶ Compute negative log likelihood(nll) from transformerdecoder
To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,
you may change this to avoid OOM or increase GPU memory usage

collect_feats
(speech: torch.Tensor, speech_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶

encode
(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Frontend + Encoder. Note that this method is used by asr_inference.py
 Parameters:
speech – (Batch, Length, …)
speech_lengths – (Batch, )

encode_diar
(speech: torch.Tensor, speech_lengths: torch.Tensor, num_spk: int) → Tuple[torch.Tensor, torch.Tensor][source]¶ Frontend + Encoder. Note that this method is used by diar_inference.py
 Parameters:
speech – (Batch, Length, …)
speech_lengths – (Batch, )
num_spk – int

forward
(speech: torch.Tensor, speech_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Frontend + Encoder + Decoder + Calc loss
 Parameters:
speech – (Batch, Length, …)
speech_lengths – (Batch, ) default None for chunk interator, because the chunkiterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py
Enh+ASR task (For) – text_spk1: (Batch, Length) text_spk2: (Batch, Length) … text_spk1_lengths: (Batch,) text_spk2_lengths: (Batch,) …
other tasks (For) –
text: (Batch, Length) default None just to keep the argument order text_lengths: (Batch,)
default None for the same reason as speech_lengths

inherite_attributes
(inherite_enh_attrs: List[str] = [], inherite_s2t_attrs: List[str] = [])[source]¶

nll
(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]¶ Compute negative log likelihood(nll) from transformerdecoder
Normally, this function is called in batchify_nll.
 Parameters:
encoder_out – (Batch, Length, Dim)
encoder_out_lens – (Batch,)
ys_pad – (Batch, Length)
ys_pad_lens – (Batch,)

espnet2.enh.extractor.__init__¶
espnet2.enh.extractor.abs_extractor¶

class
espnet2.enh.extractor.abs_extractor.
AbsExtractor
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract
forward
(input: torch.Tensor, ilens: torch.Tensor, input_aux: torch.Tensor, ilens_aux: torch.Tensor, suffix_tag: str = '', additional: Optional[Dict] = None) → Tuple[Tuple[torch.Tensor], torch.Tensor, collections.OrderedDict][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract
espnet2.enh.extractor.td_speakerbeam_extractor¶

class
espnet2.enh.extractor.td_speakerbeam_extractor.
TDSpeakerBeamExtractor
(input_dim: int, layer: int = 8, stack: int = 3, bottleneck_dim: int = 128, hidden_dim: int = 512, skip_dim: int = 128, kernel: int = 3, causal: bool = False, norm_type: str = 'gLN', pre_nonlinear: str = 'prelu', nonlinear: str = 'relu', i_adapt_layer: int = 7, adapt_layer_type: str = 'mul', adapt_enroll_dim: int = 128, use_spk_emb: bool = False, spk_emb_dim: int = 256)[source]¶ Bases:
espnet2.enh.extractor.abs_extractor.AbsExtractor
TimeDomain SpeakerBeam Extractor.
 Parameters:
input_dim – input feature dimension
layer – int, number of layers in each stack
stack – int, number of stacks
bottleneck_dim – bottleneck dimension
hidden_dim – number of convolution channel
skip_dim – int, number of skip connection channels
kernel – int, kernel size.
causal – bool, defalut False.
norm_type – str, choose from ‘BN’, ‘gLN’, ‘cLN’
pre_nonlinear – the nonlinear function right before mask estimation select from ‘prelu’, ‘relu’, ‘tanh’, ‘sigmoid’, ‘linear’
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’, ‘linear’
i_adapt_layer – int, index of adaptation layer
adapt_layer_type – str, type of adaptation layer see espnet2.enh.layers.adapt_layers for options
adapt_enroll_dim – int, dimensionality of the speaker embedding
use_spk_emb – bool, whether to use speaker embeddings as enrollment
spk_emb_dim – int, dimension of input speaker embeddings only used when use_spk_emb is True

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, input_aux: torch.Tensor, ilens_aux: torch.Tensor, suffix_tag: str = '', additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ TDSpeakerBeam Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
input_aux (torch.Tensor or ComplexTensor) – Encoded auxiliary feature for the target speaker [B, T, N] or [B, N]
ilens_aux (torch.Tensor) – input lengths of auxiliary input for the target speaker [Batch]
suffix_tag (str) – suffix to append to the keys in others
additional (None or dict) – additional parameters not used in this model
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
f’mask{suffix_tag}’: torch.Tensor(Batch, Frames, Freq), f’enroll_emb{suffix_tag}’: torch.Tensor(Batch, adapt_enroll_dim/adapt_enroll_dim*2),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])
espnet2.enh.encoder.__init__¶
espnet2.enh.encoder.stft_encoder¶

class
espnet2.enh.encoder.stft_encoder.
STFTEncoder
(n_fft: int = 512, win_length: int = None, hop_length: int = 128, window='hann', center: bool = True, normalized: bool = False, onesided: bool = True, use_builtin_complex: bool = True, default_fs: int = 16000, spec_transform_type: str = None, spec_factor: float = 0.15, spec_abs_exponent: float = 0.5)[source]¶ Bases:
espnet2.enh.encoder.abs_encoder.AbsEncoder
STFT encoder for speech enhancement and separation

forward
(input: torch.Tensor, ilens: torch.Tensor, fs: int = None)[source]¶ Forward.
 Parameters:
input (torch.Tensor) – mixed speech [Batch, sample]
ilens (torch.Tensor) – input lengths [Batch]
fs (int) – sampling rate in Hz If not None, reconfigure STFT window and hop lengths for a new sampling rate while keeping their duration fixed.
 Returns:
[Batch, T, (C,) F] flens (torch.Tensor): [Batch]
 Return type:
spectrum (ComplexTensor)

forward_streaming
(input: torch.Tensor)[source]¶ Forward.
 Parameters:
input (torch.Tensor) – mixed speech [Batch, frame_length]
 Returns:
B, 1, F

property
output_dim
¶

streaming_frame
(audio)[source]¶ streaming_frame. It splits the continuous audio into framelevel audio chunks in the streaming simulation. It is noted that this function takes the entire long audio as input for a streaming simulation. You may refer to this function to manage your streaming input buffer in a real streaming application.
 Parameters:
audio – (B, T)
 Returns:
List [(B, frame_size),]
 Return type:
chunked

espnet2.enh.encoder.conv_encoder¶

class
espnet2.enh.encoder.conv_encoder.
ConvEncoder
(channel: int, kernel_size: int, stride: int)[source]¶ Bases:
espnet2.enh.encoder.abs_encoder.AbsEncoder
Convolutional encoder for speech enhancement and separation

forward
(input: torch.Tensor, ilens: torch.Tensor, fs: int = None)[source]¶ Forward.
 Parameters:
input (torch.Tensor) – mixed speech [Batch, sample]
ilens (torch.Tensor) – input lengths [Batch]
fs (int) – sampling rate in Hz (Not used)
 Returns:
mixed feature after encoder [Batch, flens, channel]
 Return type:
feature (torch.Tensor)

property
output_dim
¶

streaming_frame
(audio: torch.Tensor)[source]¶ Stream frame.
It splits the continuous audio into framelevel audio chunks in the streaming simulation. It is noted that this function takes the entire long audio as input for a streaming simulation. You may refer to this function to manage your streaming input buffer in a real streaming application.
 Parameters:
audio – (B, T)
 Returns:
List [(B, frame_size),]
 Return type:
chunked

espnet2.enh.encoder.null_encoder¶

class
espnet2.enh.encoder.null_encoder.
NullEncoder
[source]¶ Bases:
espnet2.enh.encoder.abs_encoder.AbsEncoder
Null encoder.

forward
(input: torch.Tensor, ilens: torch.Tensor, fs: int = None)[source]¶ Forward.
 Parameters:
input (torch.Tensor) – mixed speech [Batch, sample]
ilens (torch.Tensor) – input lengths [Batch]
fs (int) – sampling rate in Hz (Not used)

property
output_dim
¶

espnet2.enh.encoder.abs_encoder¶

class
espnet2.enh.encoder.abs_encoder.
AbsEncoder
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract
forward
(input: torch.Tensor, ilens: torch.Tensor, fs: int = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract property
output_dim
¶

streaming_frame
(audio: torch.Tensor)[source]¶ Stream frame.
It splits the continuous audio into framelevel audio chunks in the streaming simulation. It is noted that this function takes the entire long audio as input for a streaming simulation. You may refer to this function to manage your streaming input buffer in a real streaming application.
 Parameters:
audio – (B, T)
 Returns:
List [(B, frame_size),]
 Return type:
chunked

abstract
espnet2.enh.layers.adapt_layers¶

class
espnet2.enh.layers.adapt_layers.
ConcatAdaptLayer
(indim, enrolldim, ninputs=1)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(main, enroll)[source]¶ ConcatAdaptLayer forward.
 Parameters:
main –
tensor or tuple or list activations in the main neural network, which are adapted tuple/list may be useful when we want to apply the adaptation
to both normal and skip connection at once
enroll –
tensor or tuple or list embedding extracted from enrollment tuple/list may be useful when we want to apply the adaptation
to both normal and skip connection at once


class
espnet2.enh.layers.adapt_layers.
MulAddAdaptLayer
(indim, enrolldim, ninputs=1, do_addition=True)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(main, enroll)[source]¶ MulAddAdaptLayer Forward.
 Parameters:
main –
tensor or tuple or list activations in the main neural network, which are adapted tuple/list may be useful when we want to apply the adaptation
to both normal and skip connection at once
enroll –
tensor or tuple or list embedding extracted from enrollment tuple/list may be useful when we want to apply the adaptation
to both normal and skip connection at once

espnet2.enh.layers.dprnn¶

class
espnet2.enh.layers.dprnn.
DPRNN
(rnn_type, input_size, hidden_size, output_size, dropout=0, num_layers=1, bidirectional=True)[source]¶ Bases:
torch.nn.modules.module.Module
Deep dualpath RNN.
 Parameters:
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size – int, dimension of the hidden state.
output_size – int, dimension of the output size.
dropout – float, dropout ratio. Default is 0.
num_layers – int, number of stacked RNN layers. Default is 1.
bidirectional – bool, whether the RNN layers are bidirectional. Default is True.

forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class
espnet2.enh.layers.dprnn.
DPRNN_TAC
(rnn_type, input_size, hidden_size, output_size, dropout=0, num_layers=1, bidirectional=True)[source]¶ Bases:
torch.nn.modules.module.Module
Deep duaLpath RNN with TAC applied to each layer/block.
 Parameters:
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size – int, dimension of the hidden state.
output_size – int, dimension of the output size.
dropout – float, dropout ratio. Default is 0.
num_layers – int, number of stacked RNN layers. Default is 1.
bidirectional – bool, whether the RNN layers are bidirectional. Default is False.

forward
(input, num_mic)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class
espnet2.enh.layers.dprnn.
SingleRNN
(rnn_type, input_size, hidden_size, dropout=0, bidirectional=False)[source]¶ Bases:
torch.nn.modules.module.Module
Container module for a single RNN layer.
 Parameters:
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size – int, dimension of the hidden state.
dropout – float, dropout ratio. Default is 0.
bidirectional – bool, whether the RNN layers are bidirectional. Default is False.

forward
(input, state=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.enh.layers.__init__¶
espnet2.enh.layers.dptnet¶

class
espnet2.enh.layers.dptnet.
DPTNet
(rnn_type, input_size, hidden_size, output_size, att_heads=4, dropout=0, activation='relu', num_layers=1, bidirectional=True, norm_type='gLN')[source]¶ Bases:
torch.nn.modules.module.Module
Dualpath transformer network.
 Parameters:
rnn_type (str) – select from ‘RNN’, ‘LSTM’ and ‘GRU’.
input_size (int) – dimension of the input feature. Input size must be a multiple of att_heads.
hidden_size (int) – dimension of the hidden state.
output_size (int) – dimension of the output size.
att_heads (int) – number of attention heads.
dropout (float) – dropout ratio. Default is 0.
activation (str) – activation function applied at the output of RNN.
num_layers (int) – number of stacked RNN layers. Default is 1.
bidirectional (bool) – whether the RNN layers are bidirectional. Default is True.
norm_type (str) – type of normalization to use after each inter or intrachunk Transformer block.

forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class
espnet2.enh.layers.dptnet.
ImprovedTransformerLayer
(rnn_type, input_size, att_heads, hidden_size, dropout=0.0, activation='relu', bidirectional=True, norm='gLN')[source]¶ Bases:
torch.nn.modules.module.Module
Container module of the (improved) Transformer proposed in [1].
 Reference:
Dualpath transformer network: Direct contextaware modeling for endtoend monaural speech separation; Chen et al, Interspeech 2020.
 Parameters:
rnn_type (str) – select from ‘RNN’, ‘LSTM’ and ‘GRU’.
input_size (int) – Dimension of the input feature.
att_heads (int) – Number of attention heads.
hidden_size (int) – Dimension of the hidden state.
dropout (float) – Dropout ratio. Default is 0.
activation (str) – activation function applied at the output of RNN.
bidirectional (bool, optional) – True for bidirectional InterChunk RNN (IntraChunk is always bidirectional).
norm (str, optional) – Type of normalization to use.

forward
(x, attn_mask=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.enh.layers.complex_utils¶
Beamformer module.

espnet2.enh.layers.complex_utils.
cat
(seq: Sequence[Union[torch_complex.tensor.ComplexTensor, torch.Tensor]], *args, **kwargs)[source]¶

espnet2.enh.layers.complex_utils.
complex_norm
(c: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], dim=1, keepdim=False) → torch.Tensor[source]¶

espnet2.enh.layers.complex_utils.
inverse
(c: Union[torch.Tensor, torch_complex.tensor.ComplexTensor]) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶

espnet2.enh.layers.complex_utils.
matmul
(a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], b: Union[torch.Tensor, torch_complex.tensor.ComplexTensor]) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶

espnet2.enh.layers.complex_utils.
new_complex_like
(ref: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], real_imag: Tuple[torch.Tensor, torch.Tensor])[source]¶

espnet2.enh.layers.complex_utils.
reverse
(a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], dim=0)[source]¶

espnet2.enh.layers.complex_utils.
solve
(b: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor])[source]¶ Solve the linear equation ax = b.
espnet2.enh.layers.dnn_wpe¶

class
espnet2.enh.layers.dnn_wpe.
DNN_WPE
(wtype: str = 'blstmp', widim: int = 257, wlayers: int = 3, wunits: int = 300, wprojs: int = 320, dropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask: bool = True, nmask: int = 1, nonlinear: str = 'sigmoid', iterations: int = 1, normalization: bool = False, eps: float = 1e06, diagonal_loading: bool = True, diag_eps: float = 1e07, mask_flooring: bool = False, flooring_thres: float = 1e06, use_torch_solver: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[Union[torch.Tensor, torch_complex.tensor.ComplexTensor], torch.LongTensor, Union[torch.Tensor, torch_complex.tensor.ComplexTensor]][source]¶ DNN_WPE forward function.
 Notation:
B: Batch C: Channel T: Time or Sequence length F: Freq or Some dimension of the feature vector
 Parameters:
data – (B, T, C, F)
ilens – (B,)
 Returns:
(B, T, C, F) ilens: (B,) masks (torch.Tensor or List[torch.Tensor]): (B, T, C, F) power (List[torch.Tensor]): (B, F, T)
 Return type:
enhanced (torch.Tensor or List[torch.Tensor])

predict_mask
(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Predict mask for WPE dereverberation.
 Parameters:
data (torch.complex64/ComplexTensor) – (B, T, C, F), double precision
ilens (torch.Tensor) – (B,)
 Returns:
(B, T, C, F) ilens (torch.Tensor): (B,)
 Return type:
masks (torch.Tensor or List[torch.Tensor])

espnet2.enh.layers.tcndenseunet¶

class
espnet2.enh.layers.tcndenseunet.
Conv2DActNorm
(in_channels, out_channels, ksz=(3, 3), stride=(1, 2), padding=(1, 0), upsample=False, activation=<class 'torch.nn.modules.activation.ELU'>)[source]¶ Bases:
torch.nn.modules.module.Module
Basic Conv2D + activation + instance norm building block.

forward
(inp)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.tcndenseunet.
DenseBlock
(in_channels, out_channels, num_freqs, pre_blocks=2, freq_proc_blocks=1, post_blocks=2, ksz=(3, 3), activation=<class 'torch.nn.modules.activation.ELU'>, hid_chans=32)[source]¶ Bases:
torch.nn.modules.module.Module
single DenseNet block as used in iNeuBe model.
 Parameters:
in_channels – number of input channels (image axis).
out_channels – number of output channels (image axis).
num_freqs – number of complex frequencies in the input STFT complex imagelike tensor. The input is batch, image_channels, frames, freqs.
pre_blocks – dense block before pointwise convolution block over frequency axis.
freq_proc_blocks – number of frequency axis processing blocks.
post_blocks – dense block after pointwise convolution block over frequency axis.
ksz – kernel size used in densenet Conv2D layers.
activation – activation function to use in the whole iNeuBe model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.
hid_chans – number of hidden channels in densenet Conv2D.

forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class
espnet2.enh.layers.tcndenseunet.
FreqWiseBlock
(in_channels, num_freqs, out_channels, activation=<class 'torch.nn.modules.activation.ELU'>)[source]¶ Bases:
torch.nn.modules.module.Module
FreqWiseBlock, see iNeuBe paper.
Block that applies pointwise 2D convolution over STFTlike image tensor on frequency axis. The input is assumed to be [batch, image_channels, frames, freq].

forward
(inp)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.tcndenseunet.
TCNDenseUNet
(n_spk=1, in_freqs=257, mic_channels=1, hid_chans=32, hid_chans_dense=32, ksz_dense=(3, 3), ksz_tcn=3, tcn_repeats=4, tcn_blocks=7, tcn_channels=384, activation=<class 'torch.nn.modules.activation.ELU'>)[source]¶ Bases:
torch.nn.modules.module.Module
TCNDenseNet block from iNeuBe
Reference: Lu, Y. J., Cornell, S., Chang, X., Zhang, W., Li, C., Ni, Z., … & Watanabe, S. Towards LowDistortion MultiChannel Speech Enhancement: The ESPNETSe Submission to the L3DAS22 Challenge. ICASSP 2022 p. 92019205.
 Parameters:
n_spk – number of output sources/speakers.
in_freqs – number of complex STFT frequencies.
mic_channels – number of microphones channels (only fixedarray geometry supported).
hid_chans – number of channels in the subsampling/upsampling conv layers.
hid_chans_dense – number of channels in the densenet layers (reduce this to reduce VRAM requirements).
ksz_dense – kernel size in the densenet layers thorough iNeuBe.
ksz_tcn – kernel size in the TCN submodule.
tcn_repeats – number of repetitions of blocks in the TCN submodule.
tcn_blocks – number of blocks in the TCN submodule.
tcn_channels – number of channels in the TCN submodule.
activation – activation function to use in the whole iNeuBe model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.

forward
(tf_rep)[source]¶ forward.
 Parameters:
tf_rep (torch.Tensor) – 4D tensor (multichannel complex STFT of mixture) of shape [B, T, C, F] batch, frames, microphones, frequencies.
 Returns:
 complex 3D tensor monaural STFT of the targets
shape is [B, T, F] batch, frames, frequencies.
 Return type:
out (torch.Tensor)

class
espnet2.enh.layers.tcndenseunet.
TCNResBlock
(in_chan, out_chan, ksz=3, stride=1, dilation=1, activation=<class 'torch.nn.modules.activation.ELU'>)[source]¶ Bases:
torch.nn.modules.module.Module
single depthwise separable TCN block as used in iNeuBe TCN.
 Parameters:
in_chan – number of input feature channels.
out_chan – number of output feature channels.
ksz – kernel size.
stride – stride in depthwise convolution.
dilation – dilation in depthwise convolution.
activation – activation function to use in the whole iNeuBe model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.

forward
(inp)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.enh.layers.beamformer_th¶
Beamformer module.

espnet2.enh.layers.beamformer_th.
apply_beamforming_vector
(beamform_vector: torch.Tensor, mix: torch.Tensor) → torch.Tensor[source]¶

espnet2.enh.layers.beamformer_th.
blind_analytic_normalization
(ws, psd_noise, eps=1e08)[source]¶ Blind analytic normalization (BAN) for postfiltering
 Parameters:
ws (torch.complex64) – beamformer vector (…, F, C)
psd_noise (torch.complex64) – noise PSD matrix (…, F, C, C)
eps (float) –
 Returns:
normalized beamformer vector (…, F)
 Return type:
ws_ban (torch.complex64)

espnet2.enh.layers.beamformer_th.
generalized_eigenvalue_decomposition
(a: torch.Tensor, b: torch.Tensor, eps=1e06)[source]¶ Solves the generalized eigenvalue decomposition through Cholesky decomposition.
ported from https://github.com/asteroidteam/asteroid/blob/master/asteroid/dsp/beamforming.py#L464
a @ e_vec = e_val * b @ e_vec   Cholesky decomposition on b:  b = L @ L^H, where L is a lower triangular matrix   Let C = L^1 @ a @ L^H, it is Hermitian.  => C @ y = lambda * y => e_vec = L^H @ y
Reference: https://www.netlib.org/lapack/lug/node54.html
 Parameters:
a – A complex Hermitian or real symmetric matrix whose eigenvalues and eigenvectors will be computed. (…, C, C)
b – A complex Hermitian or real symmetric definite positive matrix. (…, C, C)
 Returns:
generalized eigenvalues (ascending order) e_vec: generalized eigenvectors
 Return type:
e_val

espnet2.enh.layers.beamformer_th.
get_WPD_filter
(Phi: torch.Tensor, Rf: torch.Tensor, reference_vector: torch.Tensor, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08) → torch.Tensor[source]¶ Return the WPD vector.
WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:
h = (Rf^1 @ Phi_{xx}) / tr[(Rf^1) @ Phi_{xx}] @ u
 Reference:
T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481
 Parameters:
Phi (torch.complex64) – (B, F, (btaps+1) * C, (btaps+1) * C) is the PSD of zeropadded speech [x^T(t,f) 0 … 0]^T.
Rf (torch.complex64) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatiotemporal covariance matrix.
reference_vector (torch.Tensor) – (B, (btaps+1) * C) is the reference_vector.
use_torch_solver (bool) – Whether to use solve instead of inverse
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(B, F, (btaps + 1) * C)
 Return type:
filter_matrix (torch.complex64)

espnet2.enh.layers.beamformer_th.
get_WPD_filter_v2
(Phi: torch.Tensor, Rf: torch.Tensor, reference_vector: torch.Tensor, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08) → torch.Tensor[source]¶ Return the WPD vector (v2).
 This implementation is more efficient than get_WPD_filter as
it skips unnecessary computation with zeros.
 Parameters:
Phi (torch.complex64) – (B, F, C, C) is speech PSD.
Rf (torch.complex64) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatiotemporal covariance matrix.
reference_vector (torch.Tensor) – (B, C) is the reference_vector.
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(B, F, (btaps+1) * C)
 Return type:
filter_matrix (torch.complex64)

espnet2.enh.layers.beamformer_th.
get_WPD_filter_with_rtf
(psd_observed_bar: torch.Tensor, psd_speech: torch.Tensor, psd_noise: torch.Tensor, iterations: int = 3, reference_vector: Union[int, torch.Tensor] = 0, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e15) → torch.Tensor[source]¶ Return the WPD vector calculated with RTF.
WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:
h = (Rf^1 @ vbar) / (vbar^H @ R^1 @ vbar)
 Reference:
T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481
 Parameters:
psd_observed_bar (torch.complex64) – stacked observation covariance matrix
psd_speech (torch.complex64) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64) – noise covariance matrix (…, F, C, C)
iterations (int) – number of iterations in power method
reference_vector (torch.Tensor or int) – (…, C) or scalar
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.
get_covariances
(Y: torch.Tensor, inverse_power: torch.Tensor, bdelay: int, btaps: int, get_vector: bool = False) → torch.Tensor[source]¶  Calculates the power normalized spatiotemporal covariance
matrix of the framed signal.
 Parameters:
Y – Complex STFT signal with shape (B, F, C, T)
inverse_power – Weighting factor with shape (B, F, T)
 Returns:
(B, F, (btaps+1) * C, (btaps+1) * C) Correlation vector: (B, F, btaps + 1, C, C)
 Return type:
Correlation matrix

espnet2.enh.layers.beamformer_th.
get_gev_vector
(psd_noise: torch.Tensor, psd_speech: torch.Tensor, mode='power', reference_vector: Union[int, torch.Tensor] = 0, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08) → torch.Tensor[source]¶ Return the generalized eigenvalue (GEV) beamformer vector:
psd_speech @ h = lambda * psd_noise @ h
 Reference:
Blind acoustic beamforming based on generalized eigenvalue decomposition; E. Warsitz and R. HaebUmbach, 2007.
 Parameters:
psd_noise (torch.complex64) – noise covariance matrix (…, F, C, C)
psd_speech (torch.complex64) – speech covariance matrix (…, F, C, C)
mode (str) – one of (“power”, “evd”) “power”: power method “evd”: eigenvalue decomposition
reference_vector (torch.Tensor or int) – (…, C) or scalar
iterations (int) – number of iterations in power method
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.
get_lcmv_vector_with_rtf
(psd_n: torch.Tensor, rtf_mat: torch.Tensor, reference_vector: Union[int, torch.Tensor, None] = None, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08) → torch.Tensor[source]¶  Return the LCMV (Linearly Constrained Minimum Variance) vector
calculated with RTF:
h = (Npsd^1 @ rtf_mat) @ (rtf_mat^H @ Npsd^1 @ rtf_mat)^1 @ p
 Reference:
H. L. Van Trees, “Optimum array processing: Part IV of detection, estimation, and modulation theory,” John Wiley & Sons, 2004. (Chapter 6.7)
 Parameters:
psd_n (torch.complex64) – observation/noise covariance matrix (…, F, C, C)
rtf_mat (torch.complex64) – RTF matrix (…, F, C, num_spk)
reference_vector (torch.Tensor or int) – (…, num_spk) or scalar
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.
get_mvdr_vector
(psd_s, psd_n, reference_vector: Union[torch.Tensor, int], diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08)[source]¶ Return the MVDR (Minimum Variance Distortionless Response) vector:
h = (Npsd^1 @ Spsd) / (Tr(Npsd^1 @ Spsd)) @ u
 Reference:
On optimal frequencydomain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420
 Parameters:
psd_s (torch.complex64) – speech covariance matrix (…, F, C, C)
psd_n (torch.complex64) – observation/noise covariance matrix (…, F, C, C)
reference_vector (torch.Tensor) – (…, C) or an integer
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.
get_mvdr_vector_with_rtf
(psd_n: torch.Tensor, psd_speech: torch.Tensor, psd_noise: torch.Tensor, iterations: int = 3, reference_vector: Union[int, torch.Tensor, None] = None, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08) → torch.Tensor[source]¶  Return the MVDR (Minimum Variance Distortionless Response) vector
calculated with RTF:
h = (Npsd^1 @ rtf) / (rtf^H @ Npsd^1 @ rtf)
 Reference:
On optimal frequencydomain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420
 Parameters:
psd_n (torch.complex64) – observation/noise covariance matrix (…, F, C, C)
psd_speech (torch.complex64) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64) – noise covariance matrix (…, F, C, C)
iterations (int) – number of iterations in power method
reference_vector (torch.Tensor or int) – (…, C) or scalar
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.
get_mwf_vector
(psd_s, psd_n, reference_vector: Union[torch.Tensor, int], diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08)[source]¶ Return the MWF (Minimum Multichannel Wiener Filter) vector:
h = (Npsd^1 @ Spsd) @ u
 Parameters:
psd_s (torch.complex64) – speech covariance matrix (…, F, C, C)
psd_n (torch.complex64) – powernormalized observation covariance matrix (…, F, C, C)
reference_vector (torch.Tensor or int) – (…, C) or scalar
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.
get_rank1_mwf_vector
(psd_speech, psd_noise, reference_vector: Union[torch.Tensor, int], denoising_weight: float = 1.0, approx_low_rank_psd_speech: bool = False, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08)[source]¶ Return the R1MWF (Rank1 Multichannel Wiener Filter) vector
h = (Npsd^1 @ Spsd) / (mu + Tr(Npsd^1 @ Spsd)) @ u
 Reference:
[1] Rank1 constrained multichannel Wiener filter for speech recognition in noisy environments; Z. Wang et al, 2018 https://hal.inria.fr/hal01634449/document [2] Lowrank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants; R. Serizel, 2014 https://ieeexplore.ieee.org/document/6730918
 Parameters:
psd_speech (torch.complex64) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64) – noise covariance matrix (…, F, C, C)
reference_vector (torch.Tensor or int) – (…, C) or scalar
denoising_weight (float) – a tradeoff parameter between noise reduction and speech distortion. A larger value leads to more noise reduction at the expense of more speech distortion. When denoising_weight = 0, it corresponds to MVDR beamformer.
approx_low_rank_psd_speech (bool) – whether to replace original input psd_speech with its lowrank approximation as in [1]
iterations (int) – number of iterations in power method, only used when approx_low_rank_psd_speech = True
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.
get_rtf
(psd_speech, psd_noise, mode='power', reference_vector: Union[int, torch.Tensor] = 0, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e07)[source]¶ Calculate the relative transfer function (RTF).
 Parameters:
psd_speech (torch.complex64) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64) – noise covariance matrix (…, F, C, C)
mode (str) – one of (“power”, “evd”) “power”: power method “evd”: eigenvalue decomposition
reference_vector (torch.Tensor or int) – (…, C) or scalar
iterations (int) – number of iterations in power method
 Returns:
(…, F, C)
 Return type:
rtf (torch.complex64)

espnet2.enh.layers.beamformer_th.
get_rtf_matrix
(psd_speeches, psd_noises, diagonal_loading: bool = True, ref_channel: int = 0, rtf_iterations: int = 3, diag_eps: float = 1e07, eps: float = 1e08)[source]¶ Calculate the RTF matrix with each column the relative transfer function of the corresponding source.

espnet2.enh.layers.beamformer_th.
get_sdw_mwf_vector
(psd_speech, psd_noise, reference_vector: Union[torch.Tensor, int], denoising_weight: float = 1.0, approx_low_rank_psd_speech: bool = False, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08)[source]¶ Return the SDWMWF (Speech Distortion Weighted Multichannel Wiener Filter) vector
h = (Spsd + mu * Npsd)^1 @ Spsd @ u
 Reference:
[1] Spatially preprocessed speech distortion weighted multichannel Wiener filtering for noise reduction; A. Spriet et al, 2004 https://dl.acm.org/doi/abs/10.1016/j.sigpro.2004.07.028 [2] Rank1 constrained multichannel Wiener filter for speech recognition in noisy environments; Z. Wang et al, 2018 https://hal.inria.fr/hal01634449/document [3] Lowrank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants; R. Serizel, 2014 https://ieeexplore.ieee.org/document/6730918
 Parameters:
psd_speech (torch.complex64) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64) – noise covariance matrix (…, F, C, C)
reference_vector (torch.Tensor or int) – (…, C) or scalar
denoising_weight (float) – a tradeoff parameter between noise reduction and speech distortion. A larger value leads to more noise reduction at the expense of more speech distortion. The plain MWF is obtained with denoising_weight = 1 (by default).
approx_low_rank_psd_speech (bool) – whether to replace original input psd_speech with its lowrank approximation as in [2]
iterations (int) – number of iterations in power method, only used when approx_low_rank_psd_speech = True
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64)

espnet2.enh.layers.beamformer_th.
gev_phase_correction
(vector)[source]¶ Phase correction to reduce distortions due to phase inconsistencies.
ported from https://github.com/fgnt/nngev/blob/master/fgnt/beamforming.py#L169
 Parameters:
vector – Beamforming vector with shape (…, F, C)
 Returns:
Phase corrected beamforming vectors
 Return type:
w

espnet2.enh.layers.beamformer_th.
perform_WPD_filtering
(filter_matrix: torch.Tensor, Y: torch.Tensor, bdelay: int, btaps: int) → torch.Tensor[source]¶ Perform WPD filtering.
 Parameters:
filter_matrix – Filter matrix (B, F, (btaps + 1) * C)
Y – Complex STFT signal with shape (B, F, C, T)
 Returns:
(B, F, T)
 Return type:
enhanced (torch.complex64)

espnet2.enh.layers.beamformer_th.
prepare_beamformer_stats
(signal, masks_speech, mask_noise, powers=None, beamformer_type='mvdr', bdelay=3, btaps=5, eps=1e06)[source]¶ Prepare necessary statistics for constructing the specified beamformer.
 Parameters:
signal (torch.complex64) – (…, F, C, T)
masks_speech (List[torch.Tensor]) – (…, F, C, T) masks for all speech sources
mask_noise (torch.Tensor) – (…, F, C, T) noise mask
powers (List[torch.Tensor]) – powers for all speech sources (…, F, T) used for wMPDR or WPD beamformers
beamformer_type (str) – one of the predefined beamformer types
bdelay (int) – delay factor, used for WPD beamformser
btaps (int) – number of filter taps, used for WPD beamformser
eps (torch.Tensor) – tiny constant
 Returns:
 a dictionary containing all necessary statistics
e.g. “psd_n”, “psd_speech”, “psd_distortion” Note: * When masks_speech is a tensor or a singleelement list, all returned
statistics are tensors;
When masks_speech is a multielement list, some returned statistics can be a list, e.g., “psd_n” for MVDR, “psd_speech” and “psd_distortion”.
 Return type:
beamformer_stats (dict)

espnet2.enh.layers.beamformer_th.
signal_framing
(signal: torch.Tensor, frame_length: int, frame_step: int, bdelay: int, do_padding: bool = False, pad_value: int = 0, indices: List = None) → torch.Tensor[source]¶ Expand signal into several frames, with each frame of length frame_length.
 Parameters:
signal – (…, T)
frame_length – length of each segment
frame_step – step for selecting frames
bdelay – delay for WPD
do_padding – whether or not to pad the input signal at the beginning of the time dimension
pad_value – value to fill in the padding
 Returns:
if do_padding: (…, T, frame_length) else: (…, T  bdelay  frame_length + 2, frame_length)
 Return type:
torch.Tensor

espnet2.enh.layers.beamformer_th.
tik_reg
(mat, reg: float = 1e08, eps: float = 1e08)[source]¶ Perform Tikhonov regularization (only modifying real part).
 Parameters:
mat (torch.complex64) – input matrix (…, C, C)
reg (float) – regularization factor
eps (float) –
 Returns:
regularized matrix (…, C, C)
 Return type:
ret (torch.complex64)
espnet2.enh.layers.tcn¶

class
espnet2.enh.layers.tcn.
ChannelwiseLayerNorm
(channel_size, shape='BDT')[source]¶ Bases:
torch.nn.modules.module.Module
Channelwise Layer Normalization (cLN).

class
espnet2.enh.layers.tcn.
Chomp1d
(chomp_size)[source]¶ Bases:
torch.nn.modules.module.Module
To ensure the output length is the same as the input.

class
espnet2.enh.layers.tcn.
DepthwiseSeparableConv
(in_channels, out_channels, skip_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]¶ Bases:
torch.nn.modules.module.Module

class
espnet2.enh.layers.tcn.
GlobalLayerNorm
(channel_size, shape='BDT')[source]¶ Bases:
torch.nn.modules.module.Module
Global Layer Normalization (gLN).

class
espnet2.enh.layers.tcn.
TemporalBlock
(in_channels, out_channels, skip_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]¶ Bases:
torch.nn.modules.module.Module

class
espnet2.enh.layers.tcn.
TemporalConvNet
(N, B, H, P, X, R, C, Sc=None, out_channel=None, norm_type='gLN', causal=False, pre_mask_nonlinear='linear', mask_nonlinear='relu')[source]¶ Bases:
torch.nn.modules.module.Module
Basic Module of tasnet.
 Parameters:
N – Number of filters in autoencoder
B – Number of channels in bottleneck 1 * 1conv block
H – Number of channels in convolutional blocks
P – Kernel size in convolutional blocks
X – Number of convolutional blocks in each repeat
R – Number of repeats
C – Number of speakers
Sc – Number of channels in skipconnection paths’ 1x1conv blocks
out_channel – Number of output channels if it is None, N will be used instead.
norm_type – BN, gLN, cLN
causal – causal or noncausal
pre_mask_nonlinear – the nonlinear function before masknet
mask_nonlinear – use which nonlinear function to generate mask

class
espnet2.enh.layers.tcn.
TemporalConvNetInformed
(N, B, H, P, X, R, Sc=None, out_channel=None, norm_type='gLN', causal=False, pre_mask_nonlinear='prelu', mask_nonlinear='relu', i_adapt_layer: int = 7, adapt_layer_type: str = 'mul', adapt_enroll_dim: int = 128, **adapt_layer_kwargs)[source]¶ Bases:
espnet2.enh.layers.tcn.TemporalConvNet
Basic Module of TasNet with adaptation layers.
 Parameters:
N – Number of filters in autoencoder
B – Number of channels in bottleneck 1 * 1conv block
H – Number of channels in convolutional blocks
P – Kernel size in convolutional blocks
X – Number of convolutional blocks in each repeat
R – Number of repeats
Sc – Number of channels in skipconnection paths’ 1x1conv blocks
out_channel – Number of output channels if it is None, N will be used instead.
norm_type – BN, gLN, cLN
causal – causal or noncausal
pre_mask_nonlinear – the nonlinear function before masknet
mask_nonlinear – use which nonlinear function to generate mask
i_adapt_layer – int, index of the adaptation layer
adapt_layer_type – str, type of adaptation layer see espnet2.enh.layers.adapt_layers for options
adapt_enroll_dim – int, dimensionality of the speaker embedding
espnet2.enh.layers.wpe¶

espnet2.enh.layers.wpe.
get_correlations
(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], inverse_power: torch.Tensor, taps, delay) → Tuple[Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Union[torch.Tensor, torch_complex.tensor.ComplexTensor]][source]¶ Calculates weighted correlations of a window of length taps
 Parameters:
Y – Complexvalued STFT signal with shape (F, C, T)
inverse_power – Weighting factor with shape (F, T)
taps (int) – Lenghts of correlation window
delay (int) – Delay for the weighting factor
 Returns:
Correlation matrix of shape (F, taps*C, taps*C) Correlation vector of shape (F, taps, C, C)

espnet2.enh.layers.wpe.
get_filter_matrix_conj
(correlation_matrix: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], correlation_vector: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], eps: float = 1e10) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Calculate (conjugate) filter matrix based on correlations for one freq.
 Parameters:
correlation_matrix – Correlation matrix (F, taps * C, taps * C)
correlation_vector – Correlation vector (F, taps, C, C)
eps –
 Returns:
(F, taps, C, C)
 Return type:
filter_matrix_conj (torch.complex/ComplexTensor)

espnet2.enh.layers.wpe.
get_power
(signal, dim=2) → torch.Tensor[source]¶ Calculates power for signal
 Parameters:
signal – Single frequency signal with shape (F, C, T).
axis – reduce_mean axis
 Returns:
Power with shape (F, T)

espnet2.enh.layers.wpe.
is_torch_1_9_plus
= True¶ //github.com/fgnt/nara_wpe Many functions aren’t enough tested
 Type:
WPE pytorch version
 Type:
Ported from https

espnet2.enh.layers.wpe.
perform_filter_operation
(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], filter_matrix_conj: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], taps, delay) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶  Parameters:
Y – Complexvalued STFT signal of shape (F, C, T)
Matrix (filter) –

espnet2.enh.layers.wpe.
signal_framing
(signal: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], frame_length: int, frame_step: int, pad_value=0) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Expands signal into frames of frame_length.
 Parameters:
signal – (B * F, D, T)
 Returns:
(B * F, D, T, W)
 Return type:
torch.Tensor

espnet2.enh.layers.wpe.
wpe
(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], taps=10, delay=3, iterations=3) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ WPE
 Parameters:
Y – Complex valued STFT signal with shape (F, C, T)
taps – Number of filter taps
delay – Delay as a guard interval, such that X does not become zero.
iterations –
 Returns:
(F, C, T)
 Return type:
enhanced

espnet2.enh.layers.wpe.
wpe_one_iteration
(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], power: torch.Tensor, taps: int = 10, delay: int = 3, eps: float = 1e10, inverse_power: bool = True) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ WPE for one iteration
 Parameters:
Y – Complex valued STFT signal with shape (…, C, T)
power – : (…, T)
taps – Number of filter taps
delay – Delay as a guard interval, such that X does not become zero.
eps –
inverse_power (bool) –
 Returns:
(…, C, T)
 Return type:
enhanced
espnet2.enh.layers.complexnn¶

class
espnet2.enh.layers.complexnn.
ComplexBatchNorm
(num_features, eps=1e05, momentum=0.1, affine=True, track_running_stats=True, complex_axis=1)[source]¶ Bases:
torch.nn.modules.module.Module

extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both singleline and multiline strings are acceptable.

forward
(inputs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.complexnn.
ComplexConv2d
(in_channels, out_channels, kernel_size=(1, 1), stride=(1, 1), padding=(0, 0), dilation=1, groups=1, causal=True, complex_axis=1)[source]¶ Bases:
torch.nn.modules.module.Module
ComplexConv2d.
in_channels: real+imag out_channels: real+imag kernel_size : input [B,C,D,T] kernel size in [D,T] padding : input [B,C,D,T] padding in [D,T] causal: if causal, will padding time dimension’s left side,
otherwise both

forward
(inputs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.complexnn.
ComplexConvTranspose2d
(in_channels, out_channels, kernel_size=(1, 1), stride=(1, 1), padding=(0, 0), output_padding=(0, 0), causal=False, complex_axis=1, groups=1)[source]¶ Bases:
torch.nn.modules.module.Module
ComplexConvTranspose2d.
in_channels: real+imag out_channels: real+imag

forward
(inputs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Bases:
torch.nn.modules.module.Module
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.enh.layers.dc_crn¶

class
espnet2.enh.layers.dc_crn.
DC_CRN
(input_dim, input_channels: List = [2, 16, 32, 64, 128, 256], enc_hid_channels=8, enc_kernel_size=(1, 3), enc_padding=(0, 1), enc_last_kernel_size=(1, 4), enc_last_stride=(1, 2), enc_last_padding=(0, 1), enc_layers=5, skip_last_kernel_size=(1, 3), skip_last_stride=(1, 1), skip_last_padding=(0, 1), glstm_groups=2, glstm_layers=2, glstm_bidirectional=False, glstm_rearrange=False, output_channels=2)[source]¶ Bases:
torch.nn.modules.module.Module
DenselyConnected Convolutional Recurrent Network (DCCRN).
Reference: Fig. 3 and Section IIIB in [1]
 Parameters:
input_dim (int) – input feature dimension
input_channels (list) – number of input channels for the stacked DenselyConnectedBlock layers Its length should be (number of DenselyConnectedBlock layers). It is recommended to use even number of channels to avoid AssertError when glstm_bidirectional=True.
enc_hid_channels (int) – common number of intermediate channels for all DenselyConnectedBlock of the encoder
enc_kernel_size (tuple) – common kernel size for all DenselyConnectedBlock of the encoder
enc_padding (tuple) – common padding for all DenselyConnectedBlock of the encoder
enc_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the encoder
enc_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the encoder
enc_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the encoder
enc_layers (int) – common total number of Conv layers for all DenselyConnectedBlock layers of the encoder
skip_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the skip pathways
skip_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the skip pathways
skip_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the skip pathways
glstm_groups (int) – number of groups in each Grouped LSTM layer
glstm_layers (int) – number of Grouped LSTM layers
glstm_bidirectional (bool) – whether to use BLSTM or unidirectional LSTM in Grouped LSTM layers
glstm_rearrange (bool) – whether to apply the rearrange operation after each grouped LSTM layer
output_channels (int) – number of output channels (must be an even number to recover both real and imaginary parts)

class
espnet2.enh.layers.dc_crn.
DenselyConnectedBlock
(in_channels, out_channels, hid_channels=8, kernel_size=(1, 3), padding=(0, 1), last_kernel_size=(1, 4), last_stride=(1, 2), last_padding=(0, 1), last_output_padding=(0, 0), layers=5, transposed=False)[source]¶ Bases:
torch.nn.modules.module.Module
DenselyConnected Convolutional Block.
 Parameters:
in_channels (int) – number of input channels
out_channels (int) – number of output channels
hid_channels (int) – number of output channels in intermediate Conv layers
kernel_size (tuple) – kernel size for all but the last Conv layers
padding (tuple) – padding for all but the last Conv layers
last_kernel_size (tuple) – kernel size for the last GluConv layer
last_stride (tuple) – stride for the last GluConv layer
last_padding (tuple) – padding for the last GluConv layer
last_output_padding (tuple) – output padding for the last GluConvTranspose2d (only used when transposed=True)
layers (int) – total number of Conv layers
transposed (bool) – True to use GluConvTranspose2d in the last layer False to use GluConv2d in the last layer

class
espnet2.enh.layers.dc_crn.
GLSTM
(hidden_size=1024, groups=2, layers=2, bidirectional=False, rearrange=False)[source]¶ Bases:
torch.nn.modules.module.Module
Grouped LSTM.
 Reference:
Efficient Sequence Learning with Group Recurrent Networks; Gao et al., 2018
 Parameters:
hidden_size (int) – total hidden size of all LSTMs in each grouped LSTM layer i.e., hidden size of each LSTM is hidden_size // groups
groups (int) – number of LSTMs in each grouped LSTM layer
layers (int) – number of grouped LSTM layers
bidirectional (bool) – whether to use BLSTM or unidirectional LSTM
rearrange (bool) – whether to apply the rearrange operation after each grouped LSTM layer

class
espnet2.enh.layers.dc_crn.
GluConv2d
(in_channels, out_channels, kernel_size, stride, padding=0)[source]¶ Bases:
torch.nn.modules.module.Module
Conv2d with Gated Linear Units (GLU).
Input and output shapes are the same as regular Conv2d layers.
Reference: Section IIIB in [1]
 Parameters:
in_channels (int) – number of input channels
out_channels (int) – number of output channels
kernel_size (int/tuple) – kernel size in Conv2d
stride (int/tuple) – stride size in Conv2d
padding (int/tuple) – padding size in Conv2d

class
espnet2.enh.layers.dc_crn.
GluConvTranspose2d
(in_channels, out_channels, kernel_size, stride, padding=0, output_padding=(0, 0))[source]¶ Bases:
torch.nn.modules.module.Module
ConvTranspose2d with Gated Linear Units (GLU).
Input and output shapes are the same as regular ConvTranspose2d layers.
Reference: Section IIIB in [1]
 Parameters:
in_channels (int) – number of input channels
out_channels (int) – number of output channels
kernel_size (int/tuple) – kernel size in ConvTranspose2d
stride (int/tuple) – stride size in ConvTranspose2d
padding (int/tuple) – padding size in ConvTranspose2d
output_padding (int/tuple) – Additional size added to one side of each dimension in the output shape
espnet2.enh.layers.beamformer¶
Beamformer module.

espnet2.enh.layers.beamformer.
apply_beamforming_vector
(beamform_vector: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], mix: Union[torch.Tensor, torch_complex.tensor.ComplexTensor]) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶

espnet2.enh.layers.beamformer.
blind_analytic_normalization
(ws, psd_noise, eps=1e08)[source]¶ Blind analytic normalization (BAN) for postfiltering
 Parameters:
ws (torch.complex64/ComplexTensor) – beamformer vector (…, F, C)
psd_noise (torch.complex64/ComplexTensor) – noise PSD matrix (…, F, C, C)
eps (float) –
 Returns:
normalized beamformer vector (…, F)
 Return type:
ws_ban (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.
generalized_eigenvalue_decomposition
(a: torch.Tensor, b: torch.Tensor, eps=1e06)[source]¶ Solves the generalized eigenvalue decomposition through Cholesky decomposition.
ported from https://github.com/asteroidteam/asteroid/blob/master/asteroid/dsp/beamforming.py#L464
a @ e_vec = e_val * b @ e_vec   Cholesky decomposition on b:  b = L @ L^H, where L is a lower triangular matrix   Let C = L^1 @ a @ L^H, it is Hermitian.  => C @ y = lambda * y => e_vec = L^H @ y
Reference: https://www.netlib.org/lapack/lug/node54.html
 Parameters:
a – A complex Hermitian or real symmetric matrix whose eigenvalues and eigenvectors will be computed. (…, C, C)
b – A complex Hermitian or real symmetric definite positive matrix. (…, C, C)
 Returns:
generalized eigenvalues (ascending order) e_vec: generalized eigenvectors
 Return type:
e_val

espnet2.enh.layers.beamformer.
get_WPD_filter
(Phi: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Rf: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], reference_vector: torch.Tensor, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Return the WPD vector.
WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:
h = (Rf^1 @ Phi_{xx}) / tr[(Rf^1) @ Phi_{xx}] @ u
 Reference:
T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481
 Parameters:
Phi (torch.complex64/ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the PSD of zeropadded speech [x^T(t,f) 0 … 0]^T.
Rf (torch.complex64/ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatiotemporal covariance matrix.
reference_vector (torch.Tensor) – (B, (btaps+1) * C) is the reference_vector.
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(B, F, (btaps + 1) * C)
 Return type:
filter_matrix (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.
get_WPD_filter_v2
(Phi: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Rf: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], reference_vector: torch.Tensor, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Return the WPD vector (v2).
 This implementation is more efficient than get_WPD_filter as
it skips unnecessary computation with zeros.
 Parameters:
Phi (torch.complex64/ComplexTensor) – (B, F, C, C) is speech PSD.
Rf (torch.complex64/ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatiotemporal covariance matrix.
reference_vector (torch.Tensor) – (B, C) is the reference_vector.
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(B, F, (btaps+1) * C)
 Return type:
filter_matrix (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.
get_WPD_filter_with_rtf
(psd_observed_bar: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_speech: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_noise: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], iterations: int = 3, reference_vector: Union[int, torch.Tensor, None] = None, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e15) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Return the WPD vector calculated with RTF.
WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:
h = (Rf^1 @ vbar) / (vbar^H @ R^1 @ vbar)
 Reference:
T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481
 Parameters:
psd_observed_bar (torch.complex64/ComplexTensor) – stacked observation covariance matrix
psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)
iterations (int) – number of iterations in power method
reference_vector (torch.Tensor or int) – (…, C) or scalar
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64/ComplexTensor)r

espnet2.enh.layers.beamformer.
get_covariances
(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], inverse_power: torch.Tensor, bdelay: int, btaps: int, get_vector: bool = False) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶  Calculates the power normalized spatiotemporal covariance
matrix of the framed signal.
 Parameters:
Y – Complex STFT signal with shape (B, F, C, T)
inverse_power – Weighting factor with shape (B, F, T)
 Returns:
(B, F, (btaps+1) * C, (btaps+1) * C) Correlation vector: (B, F, btaps + 1, C, C)
 Return type:
Correlation matrix

espnet2.enh.layers.beamformer.
get_gev_vector
(psd_noise: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_speech: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], mode='power', reference_vector: Union[int, torch.Tensor] = 0, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Return the generalized eigenvalue (GEV) beamformer vector:
psd_speech @ h = lambda * psd_noise @ h
 Reference:
Blind acoustic beamforming based on generalized eigenvalue decomposition; E. Warsitz and R. HaebUmbach, 2007.
 Parameters:
psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)
psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
mode (str) – one of (“power”, “evd”) “power”: power method “evd”: eigenvalue decomposition (only for torch builtin complex tensors)
reference_vector (torch.Tensor or int) – (…, C) or scalar
iterations (int) – number of iterations in power method
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.
get_lcmv_vector_with_rtf
(psd_n: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], rtf_mat: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], reference_vector: Union[int, torch.Tensor, None] = None, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶  Return the LCMV (Linearly Constrained Minimum Variance) vector
calculated with RTF:
h = (Npsd^1 @ rtf_mat) @ (rtf_mat^H @ Npsd^1 @ rtf_mat)^1 @ p
 Reference:
H. L. Van Trees, “Optimum array processing: Part IV of detection, estimation, and modulation theory,” John Wiley & Sons, 2004. (Chapter 6.7)
 Parameters:
psd_n (torch.complex64/ComplexTensor) – observation/noise covariance matrix (…, F, C, C)
rtf_mat (torch.complex64/ComplexTensor) – RTF matrix (…, F, C, num_spk)
reference_vector (torch.Tensor or int) – (…, num_spk) or scalar
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.
get_mvdr_vector
(psd_s, psd_n, reference_vector: torch.Tensor, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08)[source]¶ Return the MVDR (Minimum Variance Distortionless Response) vector:
h = (Npsd^1 @ Spsd) / (Tr(Npsd^1 @ Spsd)) @ u
 Reference:
On optimal frequencydomain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420
 Parameters:
psd_s (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_n (torch.complex64/ComplexTensor) – observation/noise covariance matrix (…, F, C, C)
reference_vector (torch.Tensor) – (…, C)
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.
get_mvdr_vector_with_rtf
(psd_n: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_speech: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_noise: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], iterations: int = 3, reference_vector: Union[int, torch.Tensor, None] = None, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶  Return the MVDR (Minimum Variance Distortionless Response) vector
calculated with RTF:
h = (Npsd^1 @ rtf) / (rtf^H @ Npsd^1 @ rtf)
 Reference:
On optimal frequencydomain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420
 Parameters:
psd_n (torch.complex64/ComplexTensor) – observation/noise covariance matrix (…, F, C, C)
psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)
iterations (int) – number of iterations in power method
reference_vector (torch.Tensor or int) – (…, C) or scalar
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.
get_mwf_vector
(psd_s, psd_n, reference_vector: Union[torch.Tensor, int], diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08)[source]¶ Return the MWF (Minimum Multichannel Wiener Filter) vector:
h = (Npsd^1 @ Spsd) @ u
 Parameters:
psd_s (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_n (torch.complex64/ComplexTensor) – powernormalized observation covariance matrix (…, F, C, C)
reference_vector (torch.Tensor or int) – (…, C) or scalar
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.
get_power_spectral_density_matrix
(xs, mask, normalization=True, reduction='mean', eps: float = 1e15)[source]¶ Return crosschannel power spectral density (PSD) matrix
 Parameters:
xs (torch.complex64/ComplexTensor) – (…, F, C, T)
reduction (str) – “mean” or “median”
mask (torch.Tensor) – (…, F, C, T)
normalization (bool) –
eps (float) –
 Returns
psd (torch.complex64/ComplexTensor): (…, F, C, C)

espnet2.enh.layers.beamformer.
get_rank1_mwf_vector
(psd_speech, psd_noise, reference_vector: Union[torch.Tensor, int], denoising_weight: float = 1.0, approx_low_rank_psd_speech: bool = False, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08)[source]¶ Return the R1MWF (Rank1 Multichannel Wiener Filter) vector
h = (Npsd^1 @ Spsd) / (mu + Tr(Npsd^1 @ Spsd)) @ u
 Reference:
[1] Rank1 constrained multichannel Wiener filter for speech recognition in noisy environments; Z. Wang et al, 2018 https://hal.inria.fr/hal01634449/document [2] Lowrank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants; R. Serizel, 2014 https://ieeexplore.ieee.org/document/6730918
 Parameters:
psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)
reference_vector (torch.Tensor or int) – (…, C) or scalar
denoising_weight (float) – a tradeoff parameter between noise reduction and speech distortion. A larger value leads to more noise reduction at the expense of more speech distortion. When denoising_weight = 0, it corresponds to MVDR beamformer.
approx_low_rank_psd_speech (bool) – whether to replace original input psd_speech with its lowrank approximation as in [1]
iterations (int) – number of iterations in power method, only used when approx_low_rank_psd_speech = True
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.
get_rtf
(psd_speech, psd_noise, mode='power', reference_vector: Union[int, torch.Tensor] = 0, iterations: int = 3)[source]¶ Calculate the relative transfer function (RTF)
 Algorithm of power method:
rtf = reference_vector
 for i in range(iterations):
rtf = (psd_noise^1 @ psd_speech) @ rtf rtf = rtf / rtf_2 # this normalization can be skipped
rtf = psd_noise @ rtf
rtf = rtf / rtf[…, ref_channel, :]
Note: 4) Normalization at the reference channel is not performed here.
 Parameters:
psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)
mode (str) – one of (“power”, “evd”) “power”: power method “evd”: eigenvalue decomposition
reference_vector (torch.Tensor or int) – (…, C) or scalar
iterations (int) – number of iterations in power method
 Returns:
(…, F, C, 1)
 Return type:
rtf (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.
get_rtf_matrix
(psd_speeches, psd_noises, diagonal_loading: bool = True, ref_channel: int = 0, rtf_iterations: int = 3, diag_eps: float = 1e07, eps: float = 1e08)[source]¶ Calculate the RTF matrix with each column the relative transfer function of the corresponding source.

espnet2.enh.layers.beamformer.
get_sdw_mwf_vector
(psd_speech, psd_noise, reference_vector: Union[torch.Tensor, int], denoising_weight: float = 1.0, approx_low_rank_psd_speech: bool = False, iterations: int = 3, diagonal_loading: bool = True, diag_eps: float = 1e07, eps: float = 1e08)[source]¶ Return the SDWMWF (Speech Distortion Weighted Multichannel Wiener Filter) vector
h = (Spsd + mu * Npsd)^1 @ Spsd @ u
 Reference:
[1] Spatially preprocessed speech distortion weighted multichannel Wiener filtering for noise reduction; A. Spriet et al, 2004 https://dl.acm.org/doi/abs/10.1016/j.sigpro.2004.07.028 [2] Rank1 constrained multichannel Wiener filter for speech recognition in noisy environments; Z. Wang et al, 2018 https://hal.inria.fr/hal01634449/document [3] Lowrank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants; R. Serizel, 2014 https://ieeexplore.ieee.org/document/6730918
 Parameters:
psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)
reference_vector (torch.Tensor or int) – (…, C) or scalar
denoising_weight (float) – a tradeoff parameter between noise reduction and speech distortion. A larger value leads to more noise reduction at the expense of more speech distortion. The plain MWF is obtained with denoising_weight = 1 (by default).
approx_low_rank_psd_speech (bool) – whether to replace original input psd_speech with its lowrank approximation as in [2]
iterations (int) – number of iterations in power method, only used when approx_low_rank_psd_speech = True
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
 Returns:
(…, F, C)
 Return type:
beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.
gev_phase_correction
(vector)[source]¶ Phase correction to reduce distortions due to phase inconsistencies.
ported from https://github.com/fgnt/nngev/blob/master/fgnt/beamforming.py#L169
 Parameters:
vector – Beamforming vector with shape (…, F, C)
 Returns:
Phase corrected beamforming vectors
 Return type:
w

espnet2.enh.layers.beamformer.
perform_WPD_filtering
(filter_matrix: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], bdelay: int, btaps: int) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Perform WPD filtering.
 Parameters:
filter_matrix – Filter matrix (B, F, (btaps + 1) * C)
Y – Complex STFT signal with shape (B, F, C, T)
 Returns:
(B, F, T)
 Return type:
enhanced (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.
prepare_beamformer_stats
(signal, masks_speech, mask_noise, powers=None, beamformer_type='mvdr', bdelay=3, btaps=5, eps=1e06)[source]¶ Prepare necessary statistics for constructing the specified beamformer.
 Parameters:
signal (torch.complex64/ComplexTensor) – (…, F, C, T)
masks_speech (List[torch.Tensor]) – (…, F, C, T) masks for all speech sources
mask_noise (torch.Tensor) – (…, F, C, T) noise mask
powers (List[torch.Tensor]) – powers for all speech sources (…, F, T) used for wMPDR or WPD beamformers
beamformer_type (str) – one of the predefined beamformer types
bdelay (int) – delay factor, used for WPD beamformser
btaps (int) – number of filter taps, used for WPD beamformser
eps (torch.Tensor) – tiny constant
 Returns:
 a dictionary containing all necessary statistics
e.g. “psd_n”, “psd_speech”, “psd_distortion” Note: * When masks_speech is a tensor or a singleelement list, all returned
statistics are tensors;
When masks_speech is a multielement list, some returned statistics can be a list, e.g., “psd_n” for MVDR, “psd_speech” and “psd_distortion”.
 Return type:
beamformer_stats (dict)

espnet2.enh.layers.beamformer.
signal_framing
(signal: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], frame_length: int, frame_step: int, bdelay: int, do_padding: bool = False, pad_value: int = 0, indices: List = None) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Expand signal into several frames, with each frame of length frame_length.
 Parameters:
signal – (…, T)
frame_length – length of each segment
frame_step – step for selecting frames
bdelay – delay for WPD
do_padding – whether or not to pad the input signal at the beginning of the time dimension
pad_value – value to fill in the padding
 Returns:
if do_padding: (…, T, frame_length) else: (…, T  bdelay  frame_length + 2, frame_length)
 Return type:
torch.Tensor

espnet2.enh.layers.beamformer.
tik_reg
(mat, reg: float = 1e08, eps: float = 1e08)[source]¶ Perform Tikhonov regularization (only modifying real part).
 Parameters:
mat (torch.complex64/ComplexTensor) – input matrix (…, C, C)
reg (float) – regularization factor
eps (float) –
 Returns:
regularized matrix (…, C, C)
 Return type:
ret (torch.complex64/ComplexTensor)
espnet2.enh.layers.dnsmos¶
espnet2.enh.layers.dnn_beamformer¶
DNN beamformer module.

class
espnet2.enh.layers.dnn_beamformer.
AttentionReference
(bidim, att_dim, eps=1e06)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(psd_in: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor, scaling: float = 2.0) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Attentionbased reference forward function.
 Parameters:
psd_in (torch.complex64/ComplexTensor) – (B, F, C, C)
ilens (torch.Tensor) – (B,)
scaling (float) –
 Returns:
(B, C) ilens (torch.Tensor): (B,)
 Return type:
u (torch.Tensor)


class
espnet2.enh.layers.dnn_beamformer.
DNN_Beamformer
(bidim, btype: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, num_spk: int = 1, use_noise_mask: bool = True, nonlinear: str = 'sigmoid', dropout_rate: float = 0.0, badim: int = 320, ref_channel: int = 1, beamformer_type: str = 'mvdr_souden', rtf_iterations: int = 2, mwf_mu: float = 1.0, eps: float = 1e06, diagonal_loading: bool = True, diag_eps: float = 1e07, mask_flooring: bool = False, flooring_thres: float = 1e06, use_torch_solver: bool = True, use_torchaudio_api: bool = False, btaps: int = 5, bdelay: int = 3)[source]¶ Bases:
torch.nn.modules.module.Module
DNN mask based Beamformer.
 Citation:
Multichannel Endtoend Speech Recognition; T. Ochiai et al., 2017; http://proceedings.mlr.press/v70/ochiai17a/ochiai17a.pdf

apply_beamforming
(data, ilens, psd_n, psd_speech, psd_distortion=None, rtf_mat=None, spk=0)[source]¶ Beamforming with the provided statistics.
 Parameters:
data (torch.complex64/ComplexTensor) – (B, F, C, T)
ilens (torch.Tensor) – (B,)
psd_n (torch.complex64/ComplexTensor) – Noise covariance matrix for MVDR (B, F, C, C) Observation covariance matrix for MPDR/wMPDR (B, F, C, C) Stacked observation covariance for WPD (B,F,(btaps+1)*C,(btaps+1)*C)
psd_speech (torch.complex64/ComplexTensor) – Speech covariance matrix (B, F, C, C)
psd_distortion (torch.complex64/ComplexTensor) – Noise covariance matrix (B, F, C, C)
rtf_mat (torch.complex64/ComplexTensor) – RTF matrix (B, F, C, num_spk)
spk (int) – speaker index
 Returns:
(B, F, T) ws (torch.complex64/ComplexTensor): (B, F) or (B, F, (btaps+1)*C)
 Return type:
enhanced (torch.complex64/ComplexTensor)

forward
(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor, powers: Optional[List[torch.Tensor]] = None, oracle_masks: Optional[List[torch.Tensor]] = None) → Tuple[Union[torch.Tensor, torch_complex.tensor.ComplexTensor], torch.LongTensor, torch.Tensor][source]¶ DNN_Beamformer forward function.
 Notation:
B: Batch C: Channel T: Time or Sequence length F: Freq
 Parameters:
data (torch.complex64/ComplexTensor) – (B, T, C, F)
ilens (torch.Tensor) – (B,)
powers (List[torch.Tensor] or None) – used for wMPDR or WPD (B, F, T)
oracle_masks (List[torch.Tensor] or None) – oracle masks (B, F, C, T) if not None, oracle_masks will be used instead of self.mask
 Returns:
(B, T, F) ilens (torch.Tensor): (B,) masks (torch.Tensor): (B, T, C, F)
 Return type:
enhanced (torch.complex64/ComplexTensor)

predict_mask
(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]¶ Predict masks for beamforming.
 Parameters:
data (torch.complex64/ComplexTensor) – (B, T, C, F), double precision
ilens (torch.Tensor) – (B,)
 Returns:
(B, T, C, F) ilens (torch.Tensor): (B,)
 Return type:
masks (torch.Tensor)
espnet2.enh.layers.ncsnpp¶

class
espnet2.enh.layers.ncsnpp.
NCSNpp
(scale_by_sigma=True, nonlinearity='swish', nf=128, ch_mult=(1, 1, 2, 2, 2, 2, 2), num_res_blocks=2, attn_resolutions=(16,), resamp_with_conv=True, conditional=True, fir=True, fir_kernel=[1, 3, 3, 1], skip_rescale=True, resblock_type='biggan', progressive='output_skip', progressive_input='input_skip', progressive_combine='sum', init_scale=0.0, fourier_scale=16, image_size=256, embedding_type='fourier', dropout=0.0, centered=True, **unused_kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
NCSN++ model, adapted from https://github.com/yangsong/score_sde and
https://github.com/spuhh/sgmse repository

forward
(x, time_cond)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.layers.fasnet¶

class
espnet2.enh.layers.fasnet.
BF_module
(input_dim, feature_dim, hidden_dim, output_dim, num_spk=2, layer=4, segment_size=100, bidirectional=True, dropout=0.0, fasnet_type='ifasnet')[source]¶ Bases:
torch.nn.modules.module.Module

forward
(input, num_mic)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.fasnet.
FaSNet_base
(enc_dim, feature_dim, hidden_dim, layer, segment_size=24, nspk=2, win_len=16, context_len=16, dropout=0.0, sr=16000)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(input, num_mic)[source]¶ abstract forward function
input: shape (batch, max_num_ch, T) num_mic: shape (batch, ), the number of channels for each input.
Zero for fixed geometry configuration.

seg_signal_context
(x, window, context)[source]¶ Segmenting the signal into chunks with specific context.
 input:
x: size (B, ch, T) window: int context: int

espnet2.enh.layers.ifasnet¶
espnet2.enh.layers.mask_estimator¶

class
espnet2.enh.layers.mask_estimator.
MaskEstimator
(type, idim, layers, units, projs, dropout, nmask=1, nonlinear='sigmoid')[source]¶ Bases:
torch.nn.modules.module.Module

forward
(xs: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]¶ Mask estimator forward function.
 Parameters:
xs – (B, F, C, T)
ilens – (B,)
 Returns:
The hidden vector (B, F, C, T) masks: A tuple of the masks. (B, F, C, T) ilens: (B,)
 Return type:
hs (torch.Tensor)

espnet2.enh.layers.uses¶

class
espnet2.enh.layers.uses.
ATFBlock
(input_size, rnn_type='lstm', hidden_size=128, att_heads=4, dropout=0.0, activation='relu', bidirectional=True, norm_type='cLN', ch_mode='att', ch_att_dim=256, eps=1e05, with_channel_modeling=True)[source]¶ Bases:
torch.nn.modules.module.Module
Container module for a single Attentive TimeFrequency Block.
 Parameters:
input_size (int) – dimension of the input feature.
rnn_type (str) – type of the RNN cell in the improved Transformer layer.
hidden_size (int) – hidden dimension of the RNN cell.
att_heads (int) – number of attention heads in Transformer.
dropout (float) – dropout ratio. Default is 0.
activation (str) – nonlinear activation function applied in each block.
bidirectional (bool) – whether the RNN layers are bidirectional.
norm_type (str) – normalization type in the improved Transformer layer.
ch_mode (str) – mode of channel modeling. Select from “att” and “tac”.
ch_att_dim (int) – dimension of the channel attention.
eps (float) – epsilon for layer normalization.
with_channel_modeling (bool) – whether to use channel modeling.

forward
(input, ref_channel=None)[source]¶ Forward.
 Parameters:
input (torch.Tensor) – feature sequence (batch, C, N, freq, time)
ref_channel (None or int) – index of the reference channel. if None, simply average all channels. if int, take the specified channel instead of averaging.
 Returns:
output sequence (batch, C, N, freq, time)
 Return type:
output (torch.Tensor)

class
espnet2.enh.layers.uses.
ChannelAttention
(input_dim, att_heads=4, att_dim=256, activation='relu', eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module
Channel Attention module.
 Parameters:
input_dim (int) – dimension of the input feature.
att_heads (int) – number of attention heads in selfattention.
att_dim (int) – projection dimension for query and key before selfattention.
activation (str) – nonlinear activation function.
eps (float) – epsilon for layer normalization.

class
espnet2.enh.layers.uses.
ChannelTAC
(input_dim, eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module
Channel TransformAverageConcatenate (TAC) module.
 Parameters:
input_dim (int) – dimension of the input feature.
eps (float) – epsilon for layer normalization.

class
espnet2.enh.layers.uses.
LayerNormalization
(input_dim, dim=1, total_dim=4, eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.uses.
USES
(input_size, output_size, bottleneck_size=64, num_blocks=6, num_spatial_blocks=3, segment_size=64, memory_size=20, memory_types=1, rnn_type='lstm', hidden_size=128, att_heads=4, dropout=0.0, activation='relu', bidirectional=True, norm_type='cLN', ch_mode='att', ch_att_dim=256, eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module
Unconstrained Speech Enhancement and Separation (USES) Network.
 Reference:
[1] W. Zhang, K. Saijo, Z.Q., Wang, S. Watanabe, and Y. Qian, “Toward Universal Speech Enhancement for Diverse Input Conditions,” in Proc. ASRU, 2023.
 Parameters:
input_size (int) – dimension of the input feature.
output_size (int) – dimension of the output.
bottleneck_size (int) – dimension of the bottleneck feature. Must be a multiple of att_heads.
num_blocks (int) – number of processing blocks.
num_spatial_blocks (int) – number of processing blocks with channel modeling.
segment_size (int) – number of frames in each nonoverlapping segment. This is used to segment long utterances into smaller segments for efficient processing.
memory_size (int) – group size of global memory tokens. The basic use of memory tokens is to store the history information from previous segments. The memory tokens are updated by the output of the last block after processing each segment.
memory_types (int) –
numbre of memory token groups. Each group corresponds to a different type of processing, i.e.,
the first group is used for denoising without dereverberation, the second group is used for denoising with dereverberation,
rnn_type (str) – type of the RNN cell in the improved Transformer layer.
hidden_size (int) – hidden dimension of the RNN cell.
att_heads (int) – number of attention heads in Transformer.
dropout (float) – dropout ratio. Default is 0.
activation (str) – nonlinear activation function applied in each block.
bidirectional (bool) – whether the RNN layers are bidirectional.
norm_type (str) – normalization type in the improved Transformer layer.
ch_mode (str) – mode of channel modeling. Select from “att” and “tac”.
ch_att_dim (int) – dimension of the channel attention.
eps (float) – epsilon for layer normalization.

forward
(input, ref_channel=None, mem_idx=None)[source]¶ USES forward.
 Parameters:
input (torch.Tensor) – input feature (batch, mics, input_size, freq, time)
ref_channel (None or int) – index of the reference channel. if None, simply average all channels. if int, take the specified channel instead of averaging.
mem_idx (None or int) – index of the memory token group. if None, use the only group of memory tokens in the model. if int, use the specified group from multiple existing groups.
 Returns:
output feature (batch, output_size, freq, time)
 Return type:
output (torch.Tensor)
espnet2.enh.layers.dpmulcat¶

class
espnet2.enh.layers.dpmulcat.
DPMulCat
(input_size: int, hidden_size: int, output_size: int, num_spk: int, dropout: float = 0.0, num_layers: int = 4, bidirectional: bool = True, input_normalize: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
Dualpath RNN module with MulCat blocks.
 Parameters:
input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size – int, dimension of the hidden state.
output_size – int, dimension of the output size.
num_spk – int, the number of speakers in the output.
dropout – float, the dropout rate in the LSTM layer. (Default: 0.0)
bidirectional – bool, whether the RNN layers are bidirectional. (Default: True)
num_layers – int, number of stacked MulCat blocks. (Default: 4)
input_normalize – bool, whether to apply GroupNorm on the input Tensor. (Default: False)

forward
(input)[source]¶ Compute output after DPMulCat module.
 Parameters:
input (torch.Tensor) – The input feature. Tensor of shape (batch, N, dim1, dim2) Apply RNN on dim1 first and then dim2
 Returns:
 (list(torch.Tensor) or list(list(torch.Tensor))
In training mode, the module returns output of each DPMulCat block. In eval mode, the module only returns output in the last block.

class
espnet2.enh.layers.dpmulcat.
MulCatBlock
(input_size: int, hidden_size: int, dropout: float = 0.0, bidirectional: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
The MulCat block.
 Parameters:
input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size – int, dimension of the hidden state.
dropout – float, the dropout rate in the LSTM layer. (Default: 0.0)
bidirectional – bool, whether the RNN layers are bidirectional. (Default: True)
espnet2.enh.layers.skim¶

class
espnet2.enh.layers.skim.
MemLSTM
(hidden_size, dropout=0.0, bidirectional=False, mem_type='hc', norm_type='cLN')[source]¶ Bases:
torch.nn.modules.module.Module
the MemLSTM of SkiM
 Parameters:
hidden_size – int, dimension of the hidden state.
dropout – float, dropout ratio. Default is 0.
bidirectional – bool, whether the LSTM layers are bidirectional. Default is False.
mem_type – ‘hc’, ‘h’, ‘c’ or ‘id’. It controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned.
norm_type – gLN, cLN. cLN is for causal implementation.

extra_repr
() → str[source]¶ Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both singleline and multiline strings are acceptable.

forward
(hc, S)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class
espnet2.enh.layers.skim.
SegLSTM
(input_size, hidden_size, dropout=0.0, bidirectional=False, norm_type='cLN')[source]¶ Bases:
torch.nn.modules.module.Module
the SegLSTM of SkiM
 Parameters:
input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size – int, dimension of the hidden state.
dropout – float, dropout ratio. Default is 0.
bidirectional – bool, whether the LSTM layers are bidirectional. Default is False.
norm_type – gLN, cLN. cLN is for causal implementation.

forward
(input, hc)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class
espnet2.enh.layers.skim.
SkiM
(input_size, hidden_size, output_size, dropout=0.0, num_blocks=2, segment_size=20, bidirectional=True, mem_type='hc', norm_type='gLN', seg_overlap=False)[source]¶ Bases:
torch.nn.modules.module.Module
Skipping Memory Net
 Parameters:
input_size – int, dimension of the input feature. Input shape shoud be (batch, length, input_size)
hidden_size – int, dimension of the hidden state.
output_size – int, dimension of the output size.
dropout – float, dropout ratio. Default is 0.
num_blocks – number of basic SkiM blocks
segment_size – segmentation size for splitting long features
bidirectional – bool, whether the RNN layers are bidirectional.
mem_type – ‘hc’, ‘h’, ‘c’, ‘id’ or None. It controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned. When mem_type is None, the MemLSTM will be removed.
norm_type – gLN, cLN. cLN is for causal implementation.
seg_overlap – Bool, whether the segmentation will reserve 50% overlap for adjacent segments.Default is False.

forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.enh.layers.conv_utils¶

espnet2.enh.layers.conv_utils.
conv2d_output_shape
(h_w, kernel_size=1, stride=1, pad=0, dilation=1)[source]¶
espnet2.enh.layers.dcunet¶

class
espnet2.enh.layers.dcunet.
ArgsComplexMultiplicationWrapper
(module_cls, *args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
Adapted from asteroid’s complex_nn.py, allowing
args/kwargs to be passed through forward().
Make a complexvalued module F from a realvalued module f by applying complex multiplication rules:
F(a + i b) = f1(a)  f1(b) + i (f2(b) + f2(a))
where f1, f2 are instances of f that do not share weights.
 Parameters:
module_cls (callable) – A class or function that returns a Torch module/functional. Constructor of f in the formula above. Called 2x with *args, **kwargs, to construct the real and imaginary component modules.

forward
(x, *args, **kwargs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class
espnet2.enh.layers.dcunet.
BatchNorm
(num_features: int, eps: float = 1e05, momentum: float = 0.1, affine: bool = True, track_running_stats: bool = True, device=None, dtype=None)[source]¶ Bases:
torch.nn.modules.batchnorm._BatchNorm

class
espnet2.enh.layers.dcunet.
ComplexBatchNorm
(num_features, eps=1e05, momentum=0.1, affine=True, track_running_stats=False)[source]¶ Bases:
torch.nn.modules.module.Module

extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both singleline and multiline strings are acceptable.

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.dcunet.
ComplexLinear
(input_dim, output_dim, complex_valued)[source]¶ Bases:
torch.nn.modules.module.Module
A potentially complexvalued linear layer. Reduces to a regular linear
layer if complex_valued=False.

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.dcunet.
DCUNet
(dcunet_architecture: str = 'DilDCUNetv2', dcunet_time_embedding: str = 'gfp', dcunet_temb_layers_global: int = 2, dcunet_temb_layers_local: int = 1, dcunet_temb_activation: str = 'silu', dcunet_time_embedding_complex: bool = False, dcunet_fix_length: str = 'pad', dcunet_mask_bound: str = 'none', dcunet_norm_type: str = 'bN', dcunet_activation: str = 'relu', embed_dim: int = 128, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(spec, t) → torch.Tensor[source]¶ Input shape is expected to be $(batch, nfreqs, time)$, with $nfreqs  1$
divisible by $f_0 * f_1 * … * f_N$ where $f_k$ are the frequency strides of the encoders, and $time  1$ is divisible by $t_0 * t_1 * … * t_N$ where $t_N$ are the time strides of the encoders. :param spec: complex spectrogram tensor. 1D, 2D or 3D tensor, time last. :type spec: Tensor
 Returns:
Tensor, of shape (batch, time) or (time).


class
espnet2.enh.layers.dcunet.
DCUNetComplexDecoderBlock
(in_chan, out_chan, kernel_size, stride, padding, dilation, output_padding=(0, 0), norm_type='bN', activation='leaky_relu', embed_dim=None, temb_layers=1, temb_activation='swish', complex_time_embedding=False)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x, t_embed, output_size=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.dcunet.
DCUNetComplexEncoderBlock
(in_chan, out_chan, kernel_size, stride, padding, dilation, norm_type='bN', activation='leaky_relu', embed_dim=None, complex_time_embedding=False, temb_layers=1, temb_activation='silu')[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x, t_embed)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.dcunet.
DiffusionStepEmbedding
(embed_dim, complex_valued=False)[source]¶ Bases:
torch.nn.modules.module.Module
DiffusionStep embedding as in DiffWave / Vaswani et al. 2017.

forward
(t)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.dcunet.
FeatureMapDense
(input_dim, output_dim, complex_valued=False)[source]¶ Bases:
torch.nn.modules.module.Module
A fully connected layer that reshapes outputs to feature maps.

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.dcunet.
GaussianFourierProjection
(embed_dim, scale=16, complex_valued=False)[source]¶ Bases:
torch.nn.modules.module.Module
Gaussian random features for encoding time steps.

forward
(t)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.dcunet.
OnReIm
(module_cls, *args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


espnet2.enh.layers.dcunet.
unet_decoder_args
(encoders, *, skip_connections)[source]¶ Get list of decoder arguments for upsampling (right) side of a symmetric unet,
given the arguments used to construct the encoder. :param encoders (tuple of length N of tuples of: (in_chan, out_chan, kernel_size, stride, padding)):
List of arguments used to construct the encoders
 Parameters:
skip_connections (bool) – Whether to include skip connections in the calculation of decoder input channels.
 Returns:
 tuple of length N of tuples of
(in_chan, out_chan, kernel_size, stride, padding): Arguments to be used to construct decoders
espnet2.enh.layers.bsrnn¶

class
espnet2.enh.layers.bsrnn.
BSRNN
(input_dim=481, num_channel=16, num_layer=6, target_fs=48000, causal=True)[source]¶ Bases:
torch.nn.modules.module.Module
BandSplit RNN (BSRNN).
References
[1] J. Yu, H. Chen, Y. Luo, R. Gu, and C. Weng, “High fidelity speech enhancement with bandsplit RNN,” in Proc. ISCA Interspeech, 2023. https://iscaspeech.org/archive/interspeech_2023/yu23b_interspeech.html [2] J. Yu, and Y. Luo, “Efficient monaural speech enhancement with universal sample rate bandsplit RNN,” in Proc. ICASSP, 2023. https://ieeexplore.ieee.org/document/10096020
 Parameters:
input_dim (int) – maximum number of frequency bins corresponding to target_fs
num_channel (int) – embedding dimension of each timefrequency bin
num_layer (int) – number of time and frequency RNN layers
target_fs (int) – maximum sampling frequency supported by the model
causal (bool) – Whether or not to adopt causal processing if True, LSTM will be used instead of BLSTM for time modeling

forward
(x, fs=None)[source]¶ BSRNN forward.
 Parameters:
x (torch.Tensor) – input tensor of shape (B, T, F, 2)
fs (int, optional) – sampling rate of the input signal. if not None, the input signal will be truncated to only process the effective frequency subbands. if None, the input signal is assumed to be already truncated to only contain effective frequency subbands.
 Returns:
output tensor of shape (B, T, F, 2)
 Return type:
out (torch.Tensor)

class
espnet2.enh.layers.bsrnn.
BandSplit
(input_dim, target_fs=48000, channels=128)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x, fs=None)[source]¶ BandSplit forward.
 Parameters:
x (torch.Tensor) – input tensor of shape (B, T, F, 2)
fs (int, optional) – sampling rate of the input signal. if not None, the input signal will be truncated to only process the effective frequency subbands. if None, the input signal is assumed to be already truncated to only contain effective frequency subbands.
 Returns:
 output tensor of shape (B, N, T, K’)
K’ might be smaller than len(self.subbands) if fs < self.target_fs.
 Return type:
z (torch.Tensor)

espnet2.enh.layers.ncsnpp_utils.normalization¶
Normalization layers.

class
espnet2.enh.layers.ncsnpp_utils.normalization.
ConditionalBatchNorm2d
(num_features, num_classes, bias=True)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x, y)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.normalization.
ConditionalInstanceNorm2d
(num_features, num_classes, bias=True)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x, y)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.normalization.
ConditionalInstanceNorm2dPlus
(num_features, num_classes, bias=True)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x, y)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.normalization.
ConditionalNoneNorm2d
(num_features, num_classes, bias=True)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x, y)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.normalization.
ConditionalVarianceNorm2d
(num_features, num_classes, bias=False)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x, y)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.normalization.
InstanceNorm2dPlus
(num_features, bias=True)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.normalization.
NoneNorm2d
(num_features, bias=True)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.normalization.
VarianceNorm2d
(num_features, bias=False)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.layers.ncsnpp_utils.__init__¶
espnet2.enh.layers.ncsnpp_utils.up_or_down_sampling¶
Layers used for upsampling or downsampling images.
Many functions are ported from https://github.com/NVlabs/stylegan2.

class
espnet2.enh.layers.ncsnpp_utils.up_or_down_sampling.
Conv2d
(in_ch, out_ch, kernel, up=False, down=False, resample_kernel=(1, 3, 3, 1), use_bias=True, kernel_init=None)[source]¶ Bases:
torch.nn.modules.module.Module
Conv2d layer with optimal upsampling and downsampling (StyleGAN2).

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


espnet2.enh.layers.ncsnpp_utils.up_or_down_sampling.
conv_downsample_2d
(x, w, k=None, factor=2, gain=1)[source]¶ Fused tf.nn.conv2d() followed by downsample_2d().
Padding is performed only once at the beginning, not between the operations. The fused op is considerably more efficient than performing the same calculation using standard TensorFlow ops. It supports gradients of arbitrary order. :param x: Input tensor of the shape [N, C, H, W] or `[N, H, W,
C]`.
 Parameters:
w – Weight tensor of the shape [filterH, filterW, inChannels, outChannels]. Grouped convolution can be performed by inChannels = x.shape[0] // numGroups.
k – FIR filter of the shape [firH, firW] or [firN] (separable). The default is [1] * factor, which corresponds to average pooling.
factor – Integer downsampling factor (default: 2).
gain – Scaling factor for signal magnitude (default: 1.0).
 Returns:
Tensor of the shape [N, C, H // factor, W // factor] or [N, H // factor, W // factor, C], and same datatype as x.

espnet2.enh.layers.ncsnpp_utils.up_or_down_sampling.
downsample_2d
(x, k=None, factor=2, gain=1)[source]¶ Downsample a batch of 2D images with the given filter.
Accepts a batch of 2D images of the shape [N, C, H, W] or [N, H, W, C] and downsamples each image with the given filter. The filter is normalized so that if the input pixels are constant, they will be scaled by the specified gain. Pixels outside the image are assumed to be zero, and the filter is padded with zeros so that its shape is a multiple of the downsampling factor. :param x: Input tensor of the shape [N, C, H, W] or `[N, H, W,
C]`.
 Parameters:
k – FIR filter of the shape [firH, firW] or [firN] (separable). The default is [1] * factor, which corresponds to average pooling.
factor – Integer downsampling factor (default: 2).
gain – Scaling factor for signal magnitude (default: 1.0).
 Returns:
Tensor of the shape [N, C, H // factor, W // factor]

espnet2.enh.layers.ncsnpp_utils.up_or_down_sampling.
get_weight
(module, shape, weight_var='weight', kernel_init=None)[source]¶ Get/create weight tensor for a convolution or fullyconnected layer.

espnet2.enh.layers.ncsnpp_utils.up_or_down_sampling.
upsample_2d
(x, k=None, factor=2, gain=1)[source]¶ Upsample a batch of 2D images with the given filter.
Accepts a batch of 2D images of the shape [N, C, H, W] or [N, H, W, C] and upsamples each image with the given filter. The filter is normalized so that if the input pixels are constant, they will be scaled by the specified gain. Pixels outside the image are assumed to be zero, and the filter is padded with zeros so that its shape is a multiple of the upsampling factor. :param x: Input tensor of the shape [N, C, H, W] or `[N, H, W,
C]`.
 Parameters:
k – FIR filter of the shape [firH, firW] or [firN] (separable). The default is [1] * factor, which corresponds to nearestneighbor upsampling.
factor – Integer upsampling factor (default: 2).
gain – Scaling factor for signal magnitude (default: 1.0).
 Returns:
Tensor of the shape [N, C, H * factor, W * factor]

espnet2.enh.layers.ncsnpp_utils.up_or_down_sampling.
upsample_conv_2d
(x, w, k=None, factor=2, gain=1)[source]¶ Fused upsample_2d() followed by tf.nn.conv2d().
Padding is performed only once at the beginning, not between the operations. The fused op is considerably more efficient than performing the same calculation using standard TensorFlow ops. It supports gradients of arbitrary order. :param x: Input tensor of the shape [N, C, H, W] or `[N, H, W,
C]`.
 Parameters:
w – Weight tensor of the shape [filterH, filterW, inChannels, outChannels]. Grouped convolution can be performed by inChannels = x.shape[0] // numGroups.
k – FIR filter of the shape [firH, firW] or [firN] (separable). The default is [1] * factor, which corresponds to nearestneighbor upsampling.
factor – Integer upsampling factor (default: 2).
gain – Scaling factor for signal magnitude (default: 1.0).
 Returns:
Tensor of the shape [N, C, H * factor, W * factor] or [N, H * factor, W * factor, C], and same datatype as x.
espnet2.enh.layers.ncsnpp_utils.layerspp¶
Layers for defining NCSN++.

class
espnet2.enh.layers.ncsnpp_utils.layerspp.
AttnBlockpp
(channels, skip_rescale=False, init_scale=0.0)[source]¶ Bases:
torch.nn.modules.module.Module
Channelwise selfattention block. Modified from DDPM.

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layerspp.
Combine
(dim1, dim2, method='cat')[source]¶ Bases:
torch.nn.modules.module.Module
Combine information from skip connections.

forward
(x, y)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layerspp.
Downsample
(in_ch=None, out_ch=None, with_conv=False, fir=False, fir_kernel=(1, 3, 3, 1))[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layerspp.
GaussianFourierProjection
(embedding_size=256, scale=1.0)[source]¶ Bases:
torch.nn.modules.module.Module
Gaussian Fourier embeddings for noise levels.

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layerspp.
ResnetBlockBigGANpp
(act, in_ch, out_ch=None, temb_dim=None, up=False, down=False, dropout=0.1, fir=False, fir_kernel=(1, 3, 3, 1), skip_rescale=True, init_scale=0.0)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x, temb=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layerspp.
ResnetBlockDDPMpp
(act, in_ch, out_ch=None, temb_dim=None, conv_shortcut=False, dropout=0.1, skip_rescale=False, init_scale=0.0)[source]¶ Bases:
torch.nn.modules.module.Module
ResBlock adapted from DDPM.

forward
(x, temb=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layerspp.
Upsample
(in_ch=None, out_ch=None, with_conv=False, fir=False, fir_kernel=(1, 3, 3, 1))[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.layers.ncsnpp_utils.layers¶
Common layers for defining score networks.

class
espnet2.enh.layers.ncsnpp_utils.layers.
AttnBlock
(channels)[source]¶ Bases:
torch.nn.modules.module.Module
Channelwise selfattention block.

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
CRPBlock
(features, n_stages, act=ReLU(), maxpool=True)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
CondCRPBlock
(features, n_stages, num_classes, normalizer, act=ReLU())[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x, y)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
CondMSFBlock
(in_planes, features, num_classes, normalizer)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(xs, y, shape)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
CondRCUBlock
(features, n_blocks, n_stages, num_classes, normalizer, act=ReLU())[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x, y)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
CondRefineBlock
(in_planes, features, num_classes, normalizer, act=ReLU(), start=False, end=False)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(xs, y, output_shape)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
ConditionalResidualBlock
(input_dim, output_dim, num_classes, resample=1, act=ELU(alpha=1.0), normalization=<class 'espnet2.enh.layers.ncsnpp_utils.normalization.ConditionalInstanceNorm2dPlus'>, adjust_padding=False, dilation=None)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x, y)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
ConvMeanPool
(input_dim, output_dim, kernel_size=3, biases=True, adjust_padding=False)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(inputs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
Dense
[source]¶ Bases:
torch.nn.modules.module.Module
Linear layer with default_init.

class
espnet2.enh.layers.ncsnpp_utils.layers.
Downsample
(channels, with_conv=False)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
MSFBlock
(in_planes, features)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(xs, shape)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
MeanPoolConv
(input_dim, output_dim, kernel_size=3, biases=True)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(inputs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
NIN
(in_dim, num_units, init_scale=0.1)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
RCUBlock
(features, n_blocks, n_stages, act=ReLU())[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
RefineBlock
(in_planes, features, act=ReLU(), start=False, end=False, maxpool=True)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(xs, output_shape)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
ResidualBlock
(input_dim, output_dim, resample=None, act=ELU(alpha=1.0), normalization=<class 'torch.nn.modules.instancenorm.InstanceNorm2d'>, adjust_padding=False, dilation=1)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
ResnetBlockDDPM
(act, in_ch, out_ch=None, temb_dim=None, conv_shortcut=False, dropout=0.1)[source]¶ Bases:
torch.nn.modules.module.Module
The ResNet Blocks used in DDPM.

forward
(x, temb=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
Upsample
(channels, with_conv=False)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.layers.ncsnpp_utils.layers.
UpsampleConv
(input_dim, output_dim, kernel_size=3, biases=True)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(inputs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


espnet2.enh.layers.ncsnpp_utils.layers.
ddpm_conv1x1
(in_planes, out_planes, stride=1, bias=True, init_scale=1.0, padding=0)[source]¶ 1x1 convolution with DDPM initialization.

espnet2.enh.layers.ncsnpp_utils.layers.
ddpm_conv3x3
(in_planes, out_planes, stride=1, bias=True, dilation=1, init_scale=1.0, padding=1)[source]¶ 3x3 convolution with DDPM initialization.

espnet2.enh.layers.ncsnpp_utils.layers.
default_init
(scale=1.0)[source]¶ The same initialization used in DDPM.

espnet2.enh.layers.ncsnpp_utils.layers.
get_act
(config)[source]¶ Get activation functions from the config file.

espnet2.enh.layers.ncsnpp_utils.layers.
ncsn_conv1x1
(in_planes, out_planes, stride=1, bias=True, dilation=1, init_scale=1.0, padding=0)[source]¶ 1x1 convolution. Same as NCSNv1/v2.
espnet2.enh.layers.ncsnpp_utils.upfirdn2d¶
UpFirDn2d functions for upsampling, padding, FIR filter and downsampling.
Functions are ported from https://github.com/NVlabs/stylegan2.
espnet2.enh.separator.dccrn_separator¶

class
espnet2.enh.separator.dccrn_separator.
DCCRNSeparator
(input_dim: int, num_spk: int = 1, rnn_layer: int = 2, rnn_units: int = 256, masking_mode: str = 'E', use_clstm: bool = True, bidirectional: bool = False, use_cbn: bool = False, kernel_size: int = 5, kernel_num: List[int] = [32, 64, 128, 256, 256, 256], use_builtin_complex: bool = True, use_noise_mask: bool = False)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
DCCRN separator.
 Parameters:
input_dim (int) – input dimension。
num_spk (int, optional) – number of speakers. Defaults to 1.
rnn_layer (int, optional) – number of lstm layers in the crn. Defaults to 2.
rnn_units (int, optional) – rnn units. Defaults to 128.
masking_mode (str, optional) – usage of the estimated mask. Defaults to “E”.
use_clstm (bool, optional) – whether use complex LSTM. Defaults to False.
bidirectional (bool, optional) – whether use BLSTM. Defaults to False.
use_cbn (bool, optional) – whether use complex BN. Defaults to False.
kernel_size (int, optional) – convolution kernel size. Defaults to 5.
kernel_num (list, optional) – output dimension of each layer of the encoder.
use_builtin_complex (bool, optional) – torch.complex if True, else ComplexTensor.
use_noise_mask (bool, optional) – whether to estimate the mask of noise.

apply_masks
(masks: List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], real: torch.Tensor, imag: torch.Tensor)[source]¶ apply masks
 Parameters:
masks – est_masks, [(B, T, F), …]
real (torch.Tensor) – real part of the noisy spectrum, (B, F, T)
imag (torch.Tensor) – imag part of the noisy spectrum, (B, F, T)
 Returns:
[(B, T, F), …]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

create_masks
(mask_tensor: torch.Tensor)[source]¶ create estimated mask for each speaker
 Parameters:
mask_tensor (torch.Tensor) – output of decoder, shape(B, 2*num_spk, F1, T)

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
 Returns:
[(B, T, F), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.__init__¶
espnet2.enh.separator.rnn_separator¶

class
espnet2.enh.separator.rnn_separator.
RNNSeparator
(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'sigmoid', layer: int = 3, unit: int = 512, dropout: float = 0.0)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
RNN Separator
 Parameters:
input_dim – input feature dimension
rnn_type – string, select from ‘blstm’, ‘lstm’ etc.
bidirectional – bool, whether the interchunk RNN layers are bidirectional.
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of stacked RNN layers. Default is 3.
unit – int, dimension of the hidden state.
dropout – float, dropout ratio. Default is 0.

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.svoice_separator¶

class
espnet2.enh.separator.svoice_separator.
Decoder
(kernel_size)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(est_source)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.separator.svoice_separator.
Encoder
(enc_kernel_size: int, enc_feat_dim: int)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(mixture)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.separator.svoice_separator.
SVoiceSeparator
(input_dim: int, enc_dim: int, kernel_size: int, hidden_size: int, num_spk: int = 2, num_layers: int = 4, segment_size: int = 20, bidirectional: bool = True, input_normalize: bool = False)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
SVoice model for speech separation.
 Reference:
Voice Separation with an Unknown Number of Multiple Speakers; E. Nachmani et al., 2020; https://arxiv.org/abs/2003.01531
 Parameters:
enc_dim – int, dimension of the encoder module’s output. (Default: 128)
kernel_size – int, the kernel size of Conv1D layer in both encoder and decoder modules. (Default: 8)
hidden_size – int, dimension of the hidden state in RNN layers. (Default: 128)
num_spk – int, the number of speakers in the output. (Default: 2)
num_layers – int, number of stacked MulCat blocks. (Default: 4)
segment_size – dualpath segment size. (Default: 20)
bidirectional – bool, whether the RNN layers are bidirectional. (Default: True)
input_normalize – bool, whether to apply GroupNorm on the input Tensor. (Default: False)

forward
(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[torch.Tensor], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶

espnet2.enh.separator.svoice_separator.
overlap_and_add
(signal, frame_step)[source]¶ Reconstructs a signal from a framed representation.
Adds potentially overlapping frames of a signal with shape […, frames, frame_length], offsetting subsequent frames by frame_step. The resulting tensor has shape […, output_size] where
output_size = (frames  1) * frame_step + frame_length
 Args:
 signal: A […, frames, frame_length] Tensor. All dimensions may be unknown,
and rank must be at least 2.
 frame_step: An integer denoting overlap offsets.
Must be less than or equal to frame_length.
 Returns:
 A Tensor with shape […, output_size] containing the
overlapadded frames of signal’s innermost two dimensions.
output_size = (frames  1) * frame_step + frame_length
Based on
espnet2.enh.separator.dan_separator¶

class
espnet2.enh.separator.dan_separator.
DANSeparator
(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, nonlinear: str = 'tanh', layer: int = 2, unit: int = 512, emb_D: int = 40, dropout: float = 0.0)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Deep Attractor Network Separator
 Reference:
DEEP ATTRACTOR NETWORK FOR SINGLEMICROPHONE SPEAKER SEPARATION; Zhuo Chen. et al., 2017; https://pubmed.ncbi.nlm.nih.gov/29430212/
 Parameters:
input_dim – input feature dimension
rnn_type – string, select from ‘blstm’, ‘lstm’ etc.
bidirectional – bool, whether the interchunk RNN layers are bidirectional.
num_spk – number of speakers
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of stacked RNN layers. Default is 3.
unit – int, dimension of the hidden state.
emb_D – int, dimension of the attribute vector for one tfbin.
dropout – float, dropout ratio. Default is 0.

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model e.g. “feature_ref”: list of reference spectra List[(B, T, F)]
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.dptnet_separator¶

class
espnet2.enh.separator.dptnet_separator.
DPTNetSeparator
(input_dim: int, post_enc_relu: bool = True, rnn_type: str = 'lstm', bidirectional: bool = True, num_spk: int = 2, predict_noise: bool = False, unit: int = 256, att_heads: int = 4, dropout: float = 0.0, activation: str = 'relu', norm_type: str = 'gLN', layer: int = 6, segment_size: int = 20, nonlinear: str = 'relu')[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
DualPath Transformer Network (DPTNet) Separator
 Parameters:
input_dim – input feature dimension
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
bidirectional – bool, whether the interchunk RNN layers are bidirectional.
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
unit – int, dimension of the hidden state.
att_heads – number of attention heads.
dropout – float, dropout ratio. Default is 0.
activation – activation function applied at the output of RNN.
norm_type – type of normalization to use after each inter or intrachunk Transformer block.
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of stacked RNN layers. Default is 3.
segment_size – dualpath segment size

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.dpcl_e2e_separator¶

class
espnet2.enh.separator.dpcl_e2e_separator.
DPCLE2ESeparator
(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'tanh', layer: int = 2, unit: int = 512, emb_D: int = 40, dropout: float = 0.0, alpha: float = 5.0, max_iteration: int = 500, threshold: float = 1e05)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Deep Clustering EndtoEnd Separator
References
SingleChannel MultiSpeaker Separation using Deep Clustering; Yusuf Isik. et al., 2016; https://www.iscaspeech.org/archive/interspeech_2016/isik16_interspeech.html
 Parameters:
input_dim – input feature dimension
rnn_type – string, select from ‘blstm’, ‘lstm’ etc.
bidirectional – bool, whether the interchunk RNN layers are bidirectional.
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of stacked RNN layers. Default is 3.
unit – int, dimension of the hidden state.
emb_D – int, dimension of the feature vector for a tfbin.
dropout – float, dropout ratio. Default is 0.
alpha – float, the clustering hardness parameter.
max_iteration – int, the max iterations of soft kmeans.
threshold – float, the threshold to end the soft kmeans process.

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]
ilens (torch.Tensor) – input lengths [Batch]
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. V: OrderedDict[
others predicted data, e.g. masks: OrderedDict[ ‘mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.conformer_separator¶

class
espnet2.enh.separator.conformer_separator.
ConformerSeparator
(input_dim: int, num_spk: int = 2, predict_noise: bool = False, adim: int = 384, aheads: int = 4, layers: int = 6, linear_units: int = 1536, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, normalize_before: bool = False, concat_after: bool = False, dropout_rate: float = 0.1, input_layer: str = 'linear', positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.1, nonlinear: str = 'relu', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, conformer_enc_kernel_size: int = 7, padding_idx: int = 1)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Conformer separator.
 Parameters:
input_dim – input feature dimension
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
adim (int) – Dimension of attention.
aheads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of positionwise feed forward.
layers (int) – The number of transformer blocks.
dropout_rate (float) – Dropout rate.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
attention_dropout_rate (float) – Dropout rate in attention.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x > x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x > x + att(x)
conformer_pos_enc_layer_type (str) – Encoder positional encoding layer type.
conformer_self_attn_layer_type (str) – Encoder attention layer type.
conformer_activation_type (str) – Encoder activation function type.
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1dlinear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
use_macaron_style_in_conformer (bool) – Whether to use macaron style for positionwise layer.
use_cnn_in_conformer (bool) – Whether to use convolution module.
conformer_enc_kernel_size (int) – Kernerl size of convolution module.
padding_idx (int) – Padding idx for input_layer=embed.
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.asteroid_models¶

class
espnet2.enh.separator.asteroid_models.
AsteroidModel_Converter
(encoder_output_dim: int, model_name: str, num_spk: int, pretrained_path: str = '', loss_type: str = 'si_snr', **model_related_kwargs)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
The class to convert the models from asteroid to AbsSeprator.
 Parameters:
encoder_output_dim – input feature dimension, default=1 after the NullEncoder
num_spk – number of speakers
loss_type – loss type of enhancement
model_name – Asteroid model names, e.g. ConvTasNet, DPTNet. Refers to https://github.com/asteroidteam/asteroid/ blob/master/asteroid/models/__init__.py
pretrained_path – the name of pretrained model from Asteroid in HF hub. Refers to: https://github.com/asteroidteam/asteroid/ blob/master/docs/source/readmes/pretrained_models.md and https://huggingface.co/models?filter=asteroid
model_related_kwargs – more args towards each specific asteroid model.

forward
(input: torch.Tensor, ilens: torch.Tensor = None, additional: Optional[Dict] = None)[source]¶ Whole forward of asteroid models.
 Parameters:
input (torch.Tensor) – Raw Waveforms [B, T]
ilens (torch.Tensor) – input lengths [B]
additional (Dict or None) – other data included in model
 Returns:
[(B, T), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, T), ‘mask_spk2’: torch.Tensor(Batch, T), … ‘mask_spkn’: torch.Tensor(Batch, T),
]
 Return type:
estimated Waveforms(List[Union(torch.Tensor])

forward_rawwav
(input: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Output with waveforms.

property
num_spk
¶
espnet2.enh.separator.tfgridnetv2_separator¶

class
espnet2.enh.separator.tfgridnetv2_separator.
AllHeadPReLULayerNormalization4DCF
(input_dimension, eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.separator.tfgridnetv2_separator.
GridNetV2Block
(emb_dim, emb_ks, emb_hs, n_freqs, hidden_channels, n_head=4, approx_qk_dim=512, activation='prelu', eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module

class
espnet2.enh.separator.tfgridnetv2_separator.
LayerNormalization4DCF
(input_dimension, eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.separator.tfgridnetv2_separator.
TFGridNetV2
(input_dim, n_srcs=2, n_fft=128, stride=64, window='hann', n_imics=1, n_layers=6, lstm_hidden_units=192, attn_n_head=4, attn_approx_qk_dim=512, emb_dim=48, emb_ks=4, emb_hs=1, activation='prelu', eps=1e05, use_builtin_complex=False)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Offline TFGridNetV2. Compared with TFGridNet, TFGridNetV2 speeds up the code
by vectorizing multiple heads in selfattention, and better dealing with Deconv1D in each intra and interblock when emb_ks == emb_hs.
Reference: [1] Z.Q. Wang, S. Cornell, S. Choi, Y. Lee, B.Y. Kim, and S. Watanabe, “TFGridNet: Integrating Full and SubBand Modeling for Speech Separation”, in TASLP, 2023. [2] Z.Q. Wang, S. Cornell, S. Choi, Y. Lee, B.Y. Kim, and S. Watanabe, “TFGridNet: Making TimeFrequency Domain Models Great Again for Monaural Speaker Separation”, in ICASSP, 2023.
NOTES: As outlined in the Reference, this model works best when trained with variance normalized mixture input and target, e.g., with mixture of shape [batch, samples, microphones], you normalize it by dividing with torch.std(mixture, (1, 2)). You must do the same for the target signals. It is encouraged to do so when not using scaleinvariant loss functions such as SISDR. Specifically, use:
 Parameters:
input_dim – placeholder, not used
n_srcs – number of output sources/speakers.
n_fft – stft window size.
stride – stft stride.
window – stft window type choose between ‘hamming’, ‘hanning’ or None.
n_imics – number of microphones channels (only fixedarray geometry supported).
n_layers – number of TFGridNetV2 blocks.
lstm_hidden_units – number of hidden units in LSTM.
attn_n_head – number of heads in selfattention
attn_approx_qk_dim – approximate dimention of framelevel key and value tensors
emb_dim – embedding dimension
emb_ks – kernel size for unfolding and deconv1D
emb_hs – hop size for unfolding and deconv1D
activation – activation function to use in the whole TFGridNetV2 model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.
eps – small epsilon for normalization layers.
use_builtin_complex – whether to use builtin complex type or not.

forward
(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[torch.Tensor], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor) – batched multichannel audio tensor with M audio channels and N samples [B, N, M]
ilens (torch.Tensor) – input lengths [B]
additional (Dict or None) – other data, currently unused in this model.
 Returns:
 [(B, T), …] list of len n_srcs
of mono audio tensors with T samples.
ilens (torch.Tensor): (B,) additional (Dict or None): other data, currently unused in this model,
we return it also in output.
 Return type:
enhanced (List[Union(torch.Tensor)])

property
num_spk
¶
espnet2.enh.separator.tfgridnetv3_separator¶

class
espnet2.enh.separator.tfgridnetv3_separator.
AllHeadPReLULayerNormalization4DC
(input_dimension, eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.separator.tfgridnetv3_separator.
GridNetV3Block
(emb_dim, emb_ks, emb_hs, hidden_channels, n_head=4, qk_output_channel=4, activation='prelu', eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module

class
espnet2.enh.separator.tfgridnetv3_separator.
LayerNormalization
(input_dim, dim=1, total_dim=4, eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.separator.tfgridnetv3_separator.
TFGridNetV3
(input_dim, n_srcs=2, n_imics=1, n_layers=6, lstm_hidden_units=192, attn_n_head=4, attn_qk_output_channel=4, emb_dim=48, emb_ks=4, emb_hs=1, activation='prelu', eps=1e05)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Offline TFGridNetV3.
On top of TFGridNetV2, TFGridNetV3 slightly modifies the internal architecture to make the model samplingfrequencyindependent (SFI). This is achieved by making all network layers independent of the input time and frequency dimensions.
Reference: [1] Z.Q. Wang, S. Cornell, S. Choi, Y. Lee, B.Y. Kim, and S. Watanabe, “TFGridNet: Integrating Full and SubBand Modeling for Speech Separation”, in TASLP, 2023. [2] Z.Q. Wang, S. Cornell, S. Choi, Y. Lee, B.Y. Kim, and S. Watanabe, “TFGridNet: Making TimeFrequency Domain Models Great Again for Monaural Speaker Separation”, in ICASSP, 2023.
NOTES: As outlined in the Reference, this model works best when trained with variance normalized mixture input and target, e.g., with mixture of shape [batch, samples, microphones], you normalize it by dividing with torch.std(mixture, (1, 2)). You must do the same for the target signals. It is encouraged to do so when not using scaleinvariant loss functions such as SISDR. Specifically, use:
 Parameters:
input_dim – placeholder, not used
n_srcs – number of output sources/speakers.
n_fft – stft window size.
stride – stft stride.
window – stft window type choose between ‘hamming’, ‘hanning’ or None.
n_imics – number of microphones channels (only fixedarray geometry supported).
n_layers – number of TFGridNetV3 blocks.
lstm_hidden_units – number of hidden units in LSTM.
attn_n_head – number of heads in selfattention
attn_attn_qk_output_channel – output channels of pointwise conv2d for getting key and query
emb_dim – embedding dimension
emb_ks – kernel size for unfolding and deconv1D
emb_hs – hop size for unfolding and deconv1D
activation – activation function to use in the whole TFGridNetV3 model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.
eps – small epsilon for normalization layers.
use_builtin_complex – whether to use builtin complex type or not.

forward
(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[torch.Tensor], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor) – batched multichannel audio tensor with M audio channels and N samples [B, T, F]
ilens (torch.Tensor) – input lengths [B]
additional (Dict or None) – other data, currently unused in this model.
 Returns:
 [(B, T), …] list of len n_srcs
of mono audio tensors with T samples.
ilens (torch.Tensor): (B,) additional (Dict or None): other data, currently unused in this model,
we return it also in output.
 Return type:
enhanced (List[Union(torch.Tensor)])

property
num_spk
¶
espnet2.enh.separator.dpcl_separator¶

class
espnet2.enh.separator.dpcl_separator.
DPCLSeparator
(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, nonlinear: str = 'tanh', layer: int = 2, unit: int = 512, emb_D: int = 40, dropout: float = 0.0)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Deep Clustering Separator.
References
 [1] Deep clustering: Discriminative embeddings for segmentation and
separation; John R. Hershey. et al., 2016; https://ieeexplore.ieee.org/document/7471631
 [2] ManifoldAware Deep Clustering: Maximizing Angles Between Embedding
Vectors Based on Regular Simplex; Tanaka, K. et al., 2021; https://www.iscaspeech.org/archive/interspeech_2021/tanaka21_interspeech.html
 Parameters:
input_dim – input feature dimension
rnn_type – string, select from ‘blstm’, ‘lstm’ etc.
bidirectional – bool, whether the interchunk RNN layers are bidirectional.
num_spk – number of speakers
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of stacked RNN layers. Default is 3.
unit – int, dimension of the hidden state.
emb_D – int, dimension of the feature vector for a tfbin.
dropout – float, dropout ratio. Default is 0.

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. tf_embedding: OrderedDict[
’tf_embedding’: learned embedding of all TF bins (B, T * F, D),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.tfgridnet_separator¶

class
espnet2.enh.separator.tfgridnet_separator.
GridNetBlock
(emb_dim, emb_ks, emb_hs, n_freqs, hidden_channels, n_head=4, approx_qk_dim=512, activation='prelu', eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module

class
espnet2.enh.separator.tfgridnet_separator.
LayerNormalization4D
(input_dimension, eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.separator.tfgridnet_separator.
LayerNormalization4DCF
(input_dimension, eps=1e05)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.separator.tfgridnet_separator.
TFGridNet
(input_dim, n_srcs=2, n_fft=128, stride=64, window='hann', n_imics=1, n_layers=6, lstm_hidden_units=192, attn_n_head=4, attn_approx_qk_dim=512, emb_dim=48, emb_ks=4, emb_hs=1, activation='prelu', eps=1e05, use_builtin_complex=False, ref_channel=1)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Offline TFGridNet
Reference: [1] Z.Q. Wang, S. Cornell, S. Choi, Y. Lee, B.Y. Kim, and S. Watanabe, “TFGridNet: Integrating Full and SubBand Modeling for Speech Separation”, in arXiv preprint arXiv:2211.12433, 2022. [2] Z.Q. Wang, S. Cornell, S. Choi, Y. Lee, B.Y. Kim, and S. Watanabe, “TFGridNet: Making TimeFrequency Domain Models Great Again for Monaural Speaker Separation”, in arXiv preprint arXiv:2209.03952, 2022.
NOTES: As outlined in the Reference, this model works best when trained with variance normalized mixture input and target, e.g., with mixture of shape [batch, samples, microphones], you normalize it by dividing with torch.std(mixture, (1, 2)). You must do the same for the target signals. It is encouraged to do so when not using scaleinvariant loss functions such as SISDR.
 Parameters:
input_dim – placeholder, not used
n_srcs – number of output sources/speakers.
n_fft – stft window size.
stride – stft stride.
window – stft window type choose between ‘hamming’, ‘hanning’ or None.
n_imics – number of microphones channels (only fixedarray geometry supported).
n_layers – number of TFGridNet blocks.
lstm_hidden_units – number of hidden units in LSTM.
attn_n_head – number of heads in selfattention
attn_approx_qk_dim – approximate dimention of framelevel key and value tensors
emb_dim – embedding dimension
emb_ks – kernel size for unfolding and deconv1D
emb_hs – hop size for unfolding and deconv1D
activation – activation function to use in the whole TFGridNet model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.
eps – small epsilon for normalization layers.
use_builtin_complex – whether to use builtin complex type or not.

forward
(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[torch.Tensor], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor) – batched multichannel audio tensor with M audio channels and N samples [B, N, M]
ilens (torch.Tensor) – input lengths [B]
additional (Dict or None) – other data, currently unused in this model.
 Returns:
 [(B, T), …] list of len n_srcs
of mono audio tensors with T samples.
ilens (torch.Tensor): (B,) additional (Dict or None): other data, currently unused in this model,
we return it also in output.
 Return type:
enhanced (List[Union(torch.Tensor)])

property
num_spk
¶
espnet2.enh.separator.abs_separator¶

class
espnet2.enh.separator.abs_separator.
AbsSeparator
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract
forward
(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[Tuple[torch.Tensor], torch.Tensor, collections.OrderedDict][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract property
num_spk
¶

abstract
espnet2.enh.separator.fasnet_separator¶

class
espnet2.enh.separator.fasnet_separator.
FaSNetSeparator
(input_dim: int, enc_dim: int, feature_dim: int, hidden_dim: int, layer: int, segment_size: int, num_spk: int, win_len: int, context_len: int, fasnet_type: str, dropout: float = 0.0, sr: int = 16000, predict_noise: bool = False)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Filterandsum Network (FaSNet) Separator
 Parameters:
input_dim – required by AbsSeparator. Not used in this model.
enc_dim – encoder dimension
feature_dim – feature dimension
hidden_dim – hidden dimension in DPRNN
layer – number of DPRNN blocks in iFaSNet
segment_size – dualpath segment size
num_spk – number of speakers
win_len – window length in millisecond
context_len – context length in millisecond
fasnet_type – ‘fasnet’ or ‘ifasnet’. Select from origin fasnet or Implicit fasnet
dropout – dropout rate. Default is 0.
sr – samplerate of input audio
predict_noise – whether to output the estimated noise signal

forward
(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[torch.Tensor], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor) – (Batch, samples, channels)
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
separated (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.ineube_separator¶

class
espnet2.enh.separator.ineube_separator.
iNeuBe
(n_spk=1, n_fft=512, stride=128, window='hann', mic_channels=1, hid_chans=32, hid_chans_dense=32, ksz_dense=(3, 3), ksz_tcn=3, tcn_repeats=4, tcn_blocks=7, tcn_channels=384, activation='elu', output_from='dnn1', n_chunks=3, freeze_dnn1=False, tik_eps=1e08)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
iNeuBe, iterative neural/beamforming enhancement
Reference: Lu, Y. J., Cornell, S., Chang, X., Zhang, W., Li, C., Ni, Z., … & Watanabe, S. Towards LowDistortion MultiChannel Speech Enhancement: The ESPNETSe Submission to the L3DAS22 Challenge. ICASSP 2022 p. 92019205.
NOTES: As outlined in the Reference, this model works best when coupled with the MultiResL1SpecLoss defined in criterions/time_domain.py. The model is trained with variance normalized mixture input and target. e.g. with mixture of shape [batch, microphones, samples] you normalize it by dividing with torch.std(mixture, (1, 2)). You must do the same for the target signal. In the Reference, the variance normalization was performed offline (we normalized by the std computed on the entire training set and not for each input separately). However we found out that also normalizing each input and target separately works well.
 Parameters:
n_spk – number of output sources/speakers.
n_fft – stft window size.
stride – stft stride.
window – stft window type choose between ‘hamming’, ‘hanning’ or None.
mic_channels – number of microphones channels (only fixedarray geometry supported).
hid_chans – number of channels in the subsampling/upsampling conv layers.
hid_chans_dense – number of channels in the densenet layers (reduce this to reduce VRAM requirements).
ksz_dense – kernel size in the densenet layers thorough iNeuBe.
ksz_tcn – kernel size in the TCN submodule.
tcn_repeats – number of repetitions of blocks in the TCN submodule.
tcn_blocks – number of blocks in the TCN submodule.
tcn_channels – number of channels in the TCN submodule.
activation – activation function to use in the whole iNeuBe model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.
output_from – output the estimate from ‘dnn1’, ‘mfmcwf’ or ‘dnn2’.
n_chunks – number of future and past frames to consider for mfMCWF computation.
freeze_dnn1 – whether or not freezing dnn1 parameters during training of dnn2.
tik_eps – diagonal loading in the mfMCWF computation.

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor/ComplexTensor) – batched multichannel audio tensor with C audio channels and T samples [B, T, C]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data, currently unused in this model.
 Returns:
 [(B, T), …] list of len n_spk
of mono audio tensors with T samples.
ilens (torch.Tensor): (B,) additional (Dict or None): other data, currently unused in this model,
we return it also in output.
 Return type:
enhanced (List[Union[torch.Tensor, ComplexTensor]])

static
mfmcwf
(mixture, estimate, n_chunks, tik_eps)[source]¶ multiframe multichannel wiener filter.
 Parameters:
mixture (torch.Tensor) – multichannel STFT complex mixture tensor, of shape [B, T, C, F] batch, frames, microphones, frequencies.
estimate (torch.Tensor) – monaural STFT complex estimate of target source [B, T, F] batch, frames, frequencies.
n_chunks (int) – number of past and future mfMCWF frames. If 0 then standard MCWF.
tik_eps (float) – diagonal loading for matrix inversion in MCWF computation.
 Returns:
 monaural STFT complex estimate
of target source after MFMCWF [B, T, F] batch, frames, frequencies.
 Return type:
beamformed (torch.Tensor)

property
num_spk
¶

static
unfold
(tf_rep, chunk_size)[source]¶ unfolding STFT representation to add context in the mics channel.
 Parameters:
mixture (torch.Tensor) – 3D tensor (monaural complex STFT) of shape [B, T, F] batch, frames, microphones, frequencies.
n_chunks (int) – number of past and future to consider.
 Returns:
 complex 3D tensor STFT with context channel.
shape now is [B, T, C, F] batch, frames, context, frequencies. Basically same shape as a multichannel STFT with C microphones.
 Return type:
est_unfolded (torch.Tensor)
espnet2.enh.separator.transformer_separator¶

class
espnet2.enh.separator.transformer_separator.
TransformerSeparator
(input_dim: int, num_spk: int = 2, predict_noise: bool = False, adim: int = 384, aheads: int = 4, layers: int = 6, linear_units: int = 1536, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, normalize_before: bool = False, concat_after: bool = False, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.1, use_scaled_pos_enc: bool = True, nonlinear: str = 'relu')[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Transformer separator.
 Parameters:
input_dim – input feature dimension
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
adim (int) – Dimension of attention.
aheads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of positionwise feed forward.
layers (int) – The number of transformer blocks.
dropout_rate (float) – Dropout rate.
attention_dropout_rate (float) – Dropout rate in attention.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x > x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x > x + att(x)
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1dlinear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
use_scaled_pos_enc (bool) – use scaled positional encoding or not
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.tcn_separator¶

class
espnet2.enh.separator.tcn_separator.
TCNSeparator
(input_dim: int, num_spk: int = 2, predict_noise: bool = False, layer: int = 8, stack: int = 3, bottleneck_dim: int = 128, hidden_dim: int = 512, kernel: int = 3, causal: bool = False, norm_type: str = 'gLN', nonlinear: str = 'relu', pre_mask_nonlinear: str = 'prelu', masking: bool = True)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Temporal Convolution Separator
 Parameters:
input_dim – input feature dimension
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
layer – int, number of layers in each stack.
stack – int, number of stacks
bottleneck_dim – bottleneck dimension
hidden_dim – number of convolution channel
kernel – int, kernel size.
causal – bool, defalut False.
norm_type – str, choose from ‘BN’, ‘gLN’, ‘cLN’
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’, ‘linear’
pre_mask_nonlinear – the nonlinear function before masknet
masking – whether to use the masking or mapping based method

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.dc_crn_separator¶

class
espnet2.enh.separator.dc_crn_separator.
DC_CRNSeparator
(input_dim: int, num_spk: int = 2, predict_noise: bool = False, input_channels: List = [2, 16, 32, 64, 128, 256], enc_hid_channels: int = 8, enc_kernel_size: Tuple = (1, 3), enc_padding: Tuple = (0, 1), enc_last_kernel_size: Tuple = (1, 4), enc_last_stride: Tuple = (1, 2), enc_last_padding: Tuple = (0, 1), enc_layers: int = 5, skip_last_kernel_size: Tuple = (1, 3), skip_last_stride: Tuple = (1, 1), skip_last_padding: Tuple = (0, 1), glstm_groups: int = 2, glstm_layers: int = 2, glstm_bidirectional: bool = False, glstm_rearrange: bool = False, mode: str = 'masking', ref_channel: int = 0)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
DenselyConnected Convolutional Recurrent Network (DCCRN) Separator
 Reference:
Deep Learning Based RealTime Speech Enhancement for DualMicrophone Mobile Phones; Tan et al., 2020 https://web.cse.ohiostate.edu/~wang.77/papers/TZW.taslp21.pdf
 Parameters:
input_dim – input feature dimension
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
input_channels (list) – number of input channels for the stacked DenselyConnectedBlock layers Its length should be (number of DenselyConnectedBlock layers).
enc_hid_channels (int) – common number of intermediate channels for all DenselyConnectedBlock of the encoder
enc_kernel_size (tuple) – common kernel size for all DenselyConnectedBlock of the encoder
enc_padding (tuple) – common padding for all DenselyConnectedBlock of the encoder
enc_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the encoder
enc_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the encoder
enc_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the encoder
enc_layers (int) – common total number of Conv layers for all DenselyConnectedBlock layers of the encoder
skip_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the skip pathways
skip_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the skip pathways
skip_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the skip pathways
glstm_groups (int) – number of groups in each Grouped LSTM layer
glstm_layers (int) – number of Grouped LSTM layers
glstm_bidirectional (bool) – whether to use BLSTM or unidirectional LSTM in Grouped LSTM layers
glstm_rearrange (bool) – whether to apply the rearrange operation after each grouped LSTM layer
output_channels (int) – number of output channels (even number)
mode (str) – one of (“mapping”, “masking”) “mapping”: complex spectral mapping “masking”: complex masking
ref_channel (int) – index of the reference microphone

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ DCCRN Separator Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [Batch, T, F] or [Batch, T, C, F]
ilens (torch.Tensor) – input lengths [Batch,]
 Returns:
[(Batch, T, F), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.neural_beamformer¶

class
espnet2.enh.separator.neural_beamformer.
NeuralBeamformer
(input_dim: int, num_spk: int = 1, loss_type: str = 'mask_mse', use_wpe: bool = False, wnet_type: str = 'blstmp', wlayers: int = 3, wunits: int = 300, wprojs: int = 320, wdropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask_for_wpe: bool = True, wnonlinear: str = 'crelu', multi_source_wpe: bool = True, wnormalization: bool = False, use_beamformer: bool = True, bnet_type: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, badim: int = 320, ref_channel: int = 1, use_noise_mask: bool = True, bnonlinear: str = 'sigmoid', beamformer_type: str = 'mvdr_souden', rtf_iterations: int = 2, bdropout_rate: float = 0.0, shared_power: bool = True, use_torchaudio_api: bool = False, diagonal_loading: bool = True, diag_eps_wpe: float = 1e07, diag_eps_bf: float = 1e07, mask_flooring: bool = False, flooring_thres_wpe: float = 1e06, flooring_thres_bf: float = 1e06, use_torch_solver: bool = True)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.complex64/ComplexTensor) – mixed speech [Batch, Frames, Channel, Freq]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
 Returns:
List[torch.complex64/ComplexTensor] output lengths other predcited data: OrderedDict[
’dereverb1’: ComplexTensor(Batch, Frames, Channel, Freq), ‘mask_dereverb1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_noise1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_spk1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Channel, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Channel, Freq),
]
 Return type:
enhanced speech (singlechannel)

property
num_spk
¶

espnet2.enh.separator.uses_separator¶

class
espnet2.enh.separator.uses_separator.
USESSeparator
(input_dim: int, num_spk: int = 2, enc_channels: int = 256, bottleneck_size: int = 64, num_blocks: int = 6, num_spatial_blocks: int = 3, ref_channel: Optional[int] = None, segment_size: int = 64, memory_size: int = 20, memory_types: int = 1, rnn_type: str = 'lstm', bidirectional: bool = True, hidden_size: int = 128, att_heads: int = 4, dropout: float = 0.0, norm_type: str = 'cLN', activation: str = 'relu', ch_mode: Union[str, List[str]] = 'att', ch_att_dim: int = 256, eps: float = 1e05, additional: dict = {})[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Unconstrained Speech Enhancement and Separation (USES) Network.
 Reference:
[1] W. Zhang, K. Saijo, Z.Q., Wang, S. Watanabe, and Y. Qian, “Toward Universal Speech Enhancement for Diverse Input Conditions,” in Proc. ASRU, 2023.
 Parameters:
input_dim (int) – input feature dimension. Not used as the model is independent of the input size.
num_spk (int) – number of speakers.
enc_channels (int) – feature dimension after the Conv1D encoder.
bottleneck_size (int) – dimension of the bottleneck feature. Must be a multiple of att_heads.
num_blocks (int) – number of processing blocks.
num_spatial_blocks (int) – number of processing blocks with channel modeling.
ref_channel (int) – reference channel (used in channel modeling modules).
segment_size (int) – number of frames in each nonoverlapping segment. This is used to segment long utterances into smaller chunks for efficient processing.
memory_size (int) – group size of global memory tokens. The basic use of memory tokens is to store the history information from previous segments. The memory tokens are updated by the output of the last block after processing each segment.
memory_types (int) –
numbre of memory token groups. Each group corresponds to a different type of processing, i.e.,
the first group is used for denoising without dereverberation, the second group is used for denoising with dereverberation,
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
bidirectional (bool) – whether the interchunk RNN layers are bidirectional.
hidden_size (int) – dimension of the hidden state.
att_heads (int) – number of attention heads.
dropout (float) – dropout ratio. Default is 0.
norm_type – type of normalization to use after each inter or intrachunk NN block.
activation – the nonlinear activation function.
ch_mode – str or list, mode of channel modeling. Select from “att” and “tac”.
ch_att_dim (int) – dimension of the channel attention.
ref_channel – Optional[int], index of the reference channel.
eps (float) – epsilon for layer normalization.

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – STFT spectrum [B, T, (C,) F (,2)] B is the batch size T is the number of time frames C is the number of microphone channels (optional) F is the number of frequency bins 2 is real and imaginary parts (optional if input is a complex tensor)
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) –
other data included in model “mode”: one of (“no_dereverb”, “dereverb”, “both”) 1. “no_dereverb”: only use the first memory group for denoising
without dereverberation
 ”dereverb”: only use the second memory group for denoising
with dereverberation
 ”both”: use both memory groups for denoising with and without
dereverberation
 Returns:
[(B, T, F), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.bsrnn_separator¶

class
espnet2.enh.separator.bsrnn_separator.
BSRNNSeparator
(input_dim: int, num_spk: int = 1, num_channels: int = 16, num_layers: int = 6, target_fs: int = 48000, causal: bool = True, ref_channel: Optional[int] = None)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Bandsplit RNN (BSRNN) separator.
 Reference:
[1] J. Yu, H. Chen, Y. Luo, R. Gu, and C. Weng, “High fidelity speech enhancement with bandsplit RNN,” in Proc. ISCA Interspeech, 2023. https://iscaspeech.org/archive/interspeech_2023/yu23b_interspeech.html [2] J. Yu, and Y. Luo, “Efficient monaural speech enhancement with universal sample rate bandsplit RNN,” in Proc. ICASSP, 2023. https://ieeexplore.ieee.org/document/10096020
 Parameters:
input_dim – (int) maximum number of frequency bins corresponding to target_fs
num_spk – (int) number of speakers.
num_channels – (int) feature dimension in the BandSplit block.
num_layers – (int) number of processing layers.
target_fs – (int) max sampling frequency that the model can handle.
causal (bool) – whether or not to apply causal modeling. if True, LSTM will be used instead of BLSTM for time modeling
ref_channel – (int) reference channel. not used for now.

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ BSRNN Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – STFT spectrum [B, T, (C,) F (,2)]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model. unused in this model.
 Returns:
[(B, T, F), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.dprnn_separator¶

class
espnet2.enh.separator.dprnn_separator.
DPRNNSeparator
(input_dim: int, rnn_type: str = 'lstm', bidirectional: bool = True, num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'relu', layer: int = 3, unit: int = 512, segment_size: int = 20, dropout: float = 0.0)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
DualPath RNN (DPRNN) Separator
 Parameters:
input_dim – input feature dimension
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
bidirectional – bool, whether the interchunk RNN layers are bidirectional.
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of stacked RNN layers. Default is 3.
unit – int, dimension of the hidden state.
segment_size – dualpath segment size
dropout – float, dropout ratio. Default is 0.

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.separator.skim_separator¶

class
espnet2.enh.separator.skim_separator.
SkiMSeparator
(input_dim: int, causal: bool = True, num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'relu', layer: int = 3, unit: int = 512, segment_size: int = 20, dropout: float = 0.0, mem_type: str = 'hc', seg_overlap: bool = False)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Skipping Memory (SkiM) Separator
 Parameters:
input_dim – input feature dimension
causal – bool, whether the system is causal.
num_spk – number of target speakers.
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of SkiM blocks. Default is 3.
unit – int, dimension of the hidden state.
segment_size – segmentation size for splitting long features
dropout – float, dropout ratio. Default is 0.
mem_type – ‘hc’, ‘h’, ‘c’, ‘id’ or None. It controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned. When mem_type is None, the MemLSTM will be removed.
seg_overlap – Bool, whether the segmentation will reserve 50% overlap for adjacent segments. Default is False.

forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
 Parameters:
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
 Returns:
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
 Return type:
masked (List[Union(torch.Tensor, ComplexTensor)])

property
num_spk
¶
espnet2.enh.loss.__init__¶
espnet2.enh.loss.criterions.__init__¶
espnet2.enh.loss.criterions.time_domain¶

class
espnet2.enh.loss.criterions.time_domain.
CISDRLoss
(filter_length=512, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss
CISDR loss
 Reference:
Convolutive Transfer Function Invariant SDR Training Criteria for MultiChannel Reverberant Speech Separation; C. Boeddeker et al., 2021; https://arxiv.org/abs/2011.15003
 Parameters:
ref – (Batch, samples)
inf – (Batch, samples)
filter_length (int) – a timeinvariant filter that allows slight distortion via filtering
 Returns:
(Batch,)
 Return type:
loss

forward
(ref: torch.Tensor, inf: torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class
espnet2.enh.loss.criterions.time_domain.
MultiResL1SpecLoss
(window_sz=[512], hop_sz=None, eps=1e08, time_domain_weight=0.5, normalize_variance=False, reduction='sum', name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss
MultiResolution L1 timedomain + STFT mag loss
Reference: Lu, Y. J., Cornell, S., Chang, X., Zhang, W., Li, C., Ni, Z., … & Watanabe, S. Towards LowDistortion MultiChannel Speech Enhancement: The ESPNETSe Submission to the L3DAS22 Challenge. ICASSP 2022 p. 92019205.

window_sz
¶ (list) list of STFT window sizes.

hop_sz
¶ (list, optional) list of hop_sizes, default is each window_sz // 2.

eps
¶ (float) stability epsilon

time_domain_weight
¶ (float) weight for time domain loss.

normalize_variance
¶ whether or not to normalize the variance when calculating the loss.
 Type:
bool

reduction
¶ select from “sum” and “mean”
 Type:
str

forward
(target: torch.Tensor, estimate: torch.Tensor)[source]¶ forward.
 Parameters:
target – (Batch, T)
estimate – (Batch, T)
 Returns:
(Batch,)
 Return type:
loss

property
name
¶


class
espnet2.enh.loss.criterions.time_domain.
SDRLoss
(filter_length=512, use_cg_iter=None, clamp_db=None, zero_mean=True, load_diag=None, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss
SDR loss.
 filter_length: int
The length of the distortion filter allowed (default:
512
) use_cg_iter:
If provided, an iterative method is used to solve for the distortion filter coefficients instead of direct Gaussian elimination. This can speed up the computation of the metrics in case the filters are long. Using a value of 10 here has been shown to provide good accuracy in most cases and is sufficient when using this loss to train neural separation networks.
 clamp_db: float
clamp the output value in [clamp_db, clamp_db]
 zero_mean: bool
When set to True, the mean of all signals is subtracted prior.
 load_diag:
If provided, this small value is added to the diagonal coefficients of the system metrices when solving for the filter coefficients. This can help stabilize the metric in the case where some of the reference signals may sometimes be zero

class
espnet2.enh.loss.criterions.time_domain.
SISNRLoss
(clamp_db=None, zero_mean=True, eps=None, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss
SISNR (or named SISDR) loss
A more stable SISNR loss with clamp from fast_bss_eval.

clamp_db
¶ float clamp the output value in [clamp_db, clamp_db]

zero_mean
¶ bool When set to True, the mean of all signals is subtracted prior.

eps
¶ float Deprecated. Kept for compatibility.


class
espnet2.enh.loss.criterions.time_domain.
SNRLoss
(eps=1.1920928955078125e07, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss

forward
(ref: torch.Tensor, inf: torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
espnet2.enh.loss.criterions.time_domain.
TimeDomainL1
(name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss

class
espnet2.enh.loss.criterions.time_domain.
TimeDomainLoss
(name, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss
,abc.ABC
Base class for all timedomain Enhancement loss modules.

property
is_dereverb_loss
¶

property
is_noise_loss
¶

property
name
¶

property
only_for_test
¶

property

class
espnet2.enh.loss.criterions.time_domain.
TimeDomainMSE
(name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss
espnet2.enh.loss.criterions.abs_loss¶

class
espnet2.enh.loss.criterions.abs_loss.
AbsEnhLoss
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Base class for all Enhancement loss modules.
Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract
forward
(ref, inf) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property
name
¶

property
only_for_test
¶

abstract
espnet2.enh.loss.criterions.tf_domain¶

class
espnet2.enh.loss.criterions.tf_domain.
FrequencyDomainAbsCoherence
(compute_on_mask=False, mask_type=None, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss

property
compute_on_mask
¶

forward
(ref, inf) → torch.Tensor[source]¶ timefrequency absolute coherence loss.
 Reference:
Independent Vector Analysis with Deep Neural Network Source Priors; Li et al 2020; https://arxiv.org/abs/2008.11273
 Parameters:
ref – (Batch, T, F) or (Batch, T, C, F)
inf – (Batch, T, F) or (Batch, T, C, F)
 Returns:
(Batch,)
 Return type:
loss

property
mask_type
¶

property

class
espnet2.enh.loss.criterions.tf_domain.
FrequencyDomainCrossEntropy
(compute_on_mask=False, mask_type=None, ignore_id=100, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss

property
compute_on_mask
¶

forward
(ref, inf) → torch.Tensor[source]¶ timefrequency crossentropy loss.
 Parameters:
ref – (Batch, T) or (Batch, T, C)
inf – (Batch, T, nclass) or (Batch, T, C, nclass)
 Returns:
(Batch,)
 Return type:
loss

property
mask_type
¶

property

class
espnet2.enh.loss.criterions.tf_domain.
FrequencyDomainDPCL
(compute_on_mask=False, mask_type='IBM', loss_type='dpcl', name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss

property
compute_on_mask
¶

forward
(ref, inf) → torch.Tensor[source]¶ timefrequency Deep Clustering loss.
References
 [1] Deep clustering: Discriminative embeddings for segmentation and
separation; John R. Hershey. et al., 2016; https://ieeexplore.ieee.org/document/7471631
 [2] ManifoldAware Deep Clustering: Maximizing Angles Between Embedding
Vectors Based on Regular Simplex; Tanaka, K. et al., 2021; https://www.iscaspeech.org/archive/interspeech_2021/tanaka21_interspeech.html
 Parameters:
ref – List[(Batch, T, F) * spks]
inf – (Batch, T*F, D)
 Returns:
(Batch,)
 Return type:
loss

property
mask_type
¶

property

class
espnet2.enh.loss.criterions.tf_domain.
FrequencyDomainL1
(compute_on_mask=False, mask_type='IBM', name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss

property
compute_on_mask
¶

forward
(ref, inf) → torch.Tensor[source]¶ timefrequency L1 loss.
 Parameters:
ref – (Batch, T, F) or (Batch, T, C, F)
inf – (Batch, T, F) or (Batch, T, C, F)
 Returns:
(Batch,)
 Return type:
loss

property
mask_type
¶

property

class
espnet2.enh.loss.criterions.tf_domain.
FrequencyDomainLoss
(name, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss
,abc.ABC
Base class for all frequencedomain Enhancement loss modules.

abstract property
compute_on_mask
¶

property
is_dereverb_loss
¶

property
is_noise_loss
¶

abstract property
mask_type
¶

property
name
¶

property
only_for_test
¶

abstract property

class
espnet2.enh.loss.criterions.tf_domain.
FrequencyDomainMSE
(compute_on_mask=False, mask_type='IBM', name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss

property
compute_on_mask
¶

forward
(ref, inf) → torch.Tensor[source]¶ timefrequency MSE loss.
 Parameters:
ref – (Batch, T, F) or (Batch, T, C, F)
inf – (Batch, T, F) or (Batch, T, C, F)
 Returns:
(Batch,)
 Return type:
loss

property
mask_type
¶

property
espnet2.enh.loss.wrappers.abs_wrapper¶

class
espnet2.enh.loss.wrappers.abs_wrapper.
AbsLossWrapper
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Base class for all Enhancement loss wrapper modules.
Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract
forward
(ref: List, inf: List, others: Dict) → Tuple[torch.Tensor, Dict, Dict][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

weight
= 1.0¶

abstract
espnet2.enh.loss.wrappers.__init__¶
espnet2.enh.loss.wrappers.pit_solver¶

class
espnet2.enh.loss.wrappers.pit_solver.
PITSolver
(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0, independent_perm=True, flexible_numspk=False)[source]¶ Bases:
espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper
Permutation Invariant Training Solver.
 Parameters:
criterion (AbsEnhLoss) – an instance of AbsEnhLoss
weight (float) – weight (between 0 and 1) of current loss for multitask learning.
independent_perm (bool) –
If True, PIT will be performed in forward to find the best permutation; If False, the permutation from the last LossWrapper output will be inherited. NOTE (wangyou): You should be careful about the ordering of loss
wrappers defined in the yaml config, if this argument is False.
flexible_numspk (bool) – If True, num_spk will be taken from inf to handle flexible numbers of speakers. This is because ref may include dummy data in this case.

forward
(ref, inf, others={})[source]¶ PITSolver forward.
 Parameters:
ref (List[torch.Tensor]) – [(batch, …), …] x n_spk
inf (List[torch.Tensor]) – [(batch, …), …]
 Returns:
(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: dict, in this PIT solver, permutation order will be returned
 Return type:
loss
espnet2.enh.loss.wrappers.fixed_order¶

class
espnet2.enh.loss.wrappers.fixed_order.
FixedOrderSolver
(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0)[source]¶ Bases:
espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper

forward
(ref, inf, others={})[source]¶ An naive fixedorder solver
 Parameters:
ref (List[torch.Tensor]) – [(batch, …), …] x n_spk
inf (List[torch.Tensor]) – [(batch, …), …]
 Returns:
(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: reserved
 Return type:
loss

espnet2.enh.loss.wrappers.dpcl_solver¶

class
espnet2.enh.loss.wrappers.dpcl_solver.
DPCLSolver
(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0)[source]¶ Bases:
espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper

forward
(ref, inf, others={})[source]¶ A naive DPCL solver
 Parameters:
ref (List[torch.Tensor]) – [(batch, …), …] x n_spk
inf (List[torch.Tensor]) – [(batch, …), …]
others (List) – other data included in this solver e.g. “tf_embedding” learned embedding of all TF bins (B, T * F, D)
 Returns:
(torch.Tensor): minimum loss with the best permutation stats: (dict), for collecting training status others: reserved
 Return type:
loss

espnet2.enh.loss.wrappers.mixit_solver¶

class
espnet2.enh.loss.wrappers.mixit_solver.
MixITSolver
(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight: float = 1.0)[source]¶ Bases:
espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper
Mixture Invariant Training Solver.
 Parameters:
criterion (AbsEnhLoss) – an instance of AbsEnhLoss
weight (float) – weight (between 0 and 1) of current loss for multitask learning.

forward
(ref: Union[List[torch.Tensor], List[torch_complex.tensor.ComplexTensor]], inf: Union[List[torch.Tensor], List[torch_complex.tensor.ComplexTensor]], others: Dict = {})[source]¶ MixIT solver.
 Parameters:
ref (List[torch.Tensor]) – [(batch, …), …] x n_spk
inf (List[torch.Tensor]) – [(batch, …), …] x n_est
 Returns:
(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: dict, in this PIT solver, permutation order will be returned
 Return type:
loss

property
name
¶
espnet2.enh.loss.wrappers.multilayer_pit_solver¶

class
espnet2.enh.loss.wrappers.multilayer_pit_solver.
MultiLayerPITSolver
(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0, independent_perm=True, layer_weights=None)[source]¶ Bases:
espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper
MultiLayer Permutation Invariant Training Solver.
Compute the PIT loss given inferences of multiple layers and a single reference. It also support single inference and single reference in evaluation stage.
 Parameters:
criterion (AbsEnhLoss) – an instance of AbsEnhLoss
weight (float) – weight (between 0 and 1) of current loss for multitask learning.
independent_perm (bool) – If True, PIT will be performed in forward to find the best permutation; If False, the permutation from the last LossWrapper output will be inherited. Note: You should be careful about the ordering of loss wrappers defined in the yaml config, if this argument is False.
layer_weights (Optional[List[float]]) – weights for each layer If not None, the loss of each layer will be weightedsummed using the specified weights.

forward
(ref, infs, others={})[source]¶ Permutation invariant training solver.
 Parameters:
ref (List[torch.Tensor]) – [(batch, …), …] x n_spk
infs (Union[List[torch.Tensor], List[List[torch.Tensor]]]) – [(batch, …), …]
 Returns:
(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: dict, in this PIT solver, permutation order will be returned
 Return type:
loss
espnet2.enh.decoder.stft_decoder¶

class
espnet2.enh.decoder.stft_decoder.
STFTDecoder
(n_fft: int = 512, win_length: int = None, hop_length: int = 128, window='hann', center: bool = True, normalized: bool = False, onesided: bool = True, default_fs: int = 16000, spec_transform_type: str = None, spec_factor: float = 0.15, spec_abs_exponent: float = 0.5)[source]¶ Bases:
espnet2.enh.decoder.abs_decoder.AbsDecoder
STFT decoder for speech enhancement and separation

forward
(input: torch_complex.tensor.ComplexTensor, ilens: torch.Tensor, fs: int = None)[source]¶ Forward.
 Parameters:
input (ComplexTensor) – spectrum [Batch, T, (C,) F]
ilens (torch.Tensor) – input lengths [Batch]
fs (int) – sampling rate in Hz If not None, reconfigure iSTFT window and hop lengths for a new sampling rate while keeping their duration fixed.

forward_streaming
(input_frame: torch.Tensor)[source]¶ Forward.
 Parameters:
input (ComplexTensor) – spectrum [Batch, 1, F]
output – wavs [Batch, 1, self.win_length]

streaming_merge
(chunks, ilens=None)[source]¶ streaming_merge. It merges the framelevel processed audio chunks in the streaming simulation. It is noted that, in real applications, the processed audio should be sent to the output channel frame by frame. You may refer to this function to manage your streaming output buffer.
 Parameters:
chunks – List [(B, frame_size),]
ilens – [B]
 Returns:
[B, T]
 Return type:
merge_audio

espnet2.enh.decoder.__init__¶
espnet2.enh.decoder.null_decoder¶

class
espnet2.enh.decoder.null_decoder.
NullDecoder
[source]¶ Bases:
espnet2.enh.decoder.abs_decoder.AbsDecoder
Null decoder, return the same args.
espnet2.enh.decoder.abs_decoder¶

class
espnet2.enh.decoder.abs_decoder.
AbsDecoder
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract
forward
(input: torch.Tensor, ilens: torch.Tensor, fs: int = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

streaming_merge
(chunks: torch.Tensor, ilens: torch._VariableFunctionsClass.tensor = None)[source]¶ Stream merge.
It merges the framelevel processed audio chunks in the streaming simulation. It is noted that, in real applications, the processed audio should be sent to the output channel frame by frame. You may refer to this function to manage your streaming output buffer.
 Parameters:
chunks – List [(B, frame_size),]
ilens – [B]
 Returns:
[B, T]
 Return type:
merge_audio

abstract
espnet2.enh.decoder.conv_decoder¶

class
espnet2.enh.decoder.conv_decoder.
ConvDecoder
(channel: int, kernel_size: int, stride: int)[source]¶ Bases:
espnet2.enh.decoder.abs_decoder.AbsDecoder
Transposed Convolutional decoder for speech enhancement and separation

forward
(input: torch.Tensor, ilens: torch.Tensor, fs: int = None)[source]¶ Forward.
 Parameters:
input (torch.Tensor) – spectrum [Batch, T, F]
ilens (torch.Tensor) – input lengths [Batch]
fs (int) – sampling rate in Hz (Not used)

streaming_merge
(chunks: torch.Tensor, ilens: torch._VariableFunctionsClass.tensor = None)[source]¶ Stream Merge.
It merges the framelevel processed audio chunks in the streaming simulation. It is noted that, in real applications, the processed audio should be sent to the output channel frame by frame. You may refer to this function to manage your streaming output buffer.
 Parameters:
chunks – List [(B, frame_size),]
ilens – [B]
 Returns:
[B, T]
 Return type:
merge_audio

espnet2.enh.diffusion.abs_diffusion¶

class
espnet2.enh.diffusion.abs_diffusion.
AbsDiffusion
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract
forward
(input: torch.Tensor, ilens: torch.Tensor)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract
espnet2.enh.diffusion.__init__¶
espnet2.enh.diffusion.sdes¶
Abstract SDE classes, Reverse SDE, and VE/VP SDEs.
Taken and adapted from https://github.com/yangsong/score_sde_pytorch and https://github.com/spuhh/sgmse

class
espnet2.enh.diffusion.sdes.
OUVESDE
(theta=1.5, sigma_min=0.05, sigma_max=0.5, N=1000, **ignored_kwargs)[source]¶ Bases:
espnet2.enh.diffusion.sdes.SDE
Construct an OrnsteinUhlenbeck Variance Exploding SDE.
Note that the “steadystate mean” y is not provided at construction, but must rather be given as an argument to the methods which require it (e.g., sde or marginal_prob).
dx = theta (yx) dt + sigma(t) dw
with
sigma(t) = sigma_min (sigma_max/sigma_min)^t * sqrt(2 log(sigma_max/sigma_min))
 Parameters:
theta – stiffness parameter.
sigma_min – smallest sigma.
sigma_max – largest sigma.
N – number of discretization steps

property
T
¶ End time of the SDE.

marginal_prob
(x0, t, y)[source]¶ Parameters to determine the marginal distribution of
the SDE, $p_t(xargs)$.

prior_logp
(z)[source]¶ Compute logdensity of the prior distribution.
Useful for computing the loglikelihood via probability flow ODE.
 Parameters:
z – latent code
 Returns:
log probability density

class
espnet2.enh.diffusion.sdes.
OUVPSDE
(beta_min, beta_max, stiffness=1, N=1000, **ignored_kwargs)[source]¶ Bases:
espnet2.enh.diffusion.sdes.SDE
OUVPSDE class.
!!! SGMSE authors observed instabilities around t=0.2. !!!
Construct an OrnsteinUhlenbeck Variance Preserving SDE:
dx = 1/2 * beta(t) * stiffness * (yx) dt + sqrt(beta(t)) * dw
with
beta(t) = beta_min + t(beta_max  beta_min)
Note that the “steadystate mean” y is not provided at construction, but must rather be given as an argument to the methods which require it (e.g., sde or marginal_prob).
 Parameters:
beta_min – smallest sigma.
beta_max – largest sigma.
stiffness – stiffness factor of the drift. 1 by default.
N – number of discretization steps

property
T
¶ End time of the SDE.

marginal_prob
(x0, t, y)[source]¶ Parameters to determine the marginal distribution of
the SDE, $p_t(xargs)$.

prior_logp
(z)[source]¶ Compute logdensity of the prior distribution.
Useful for computing the loglikelihood via probability flow ODE.
 Parameters:
z – latent code
 Returns:
log probability density

class
espnet2.enh.diffusion.sdes.
SDE
(N)[source]¶ Bases:
abc.ABC
SDE abstract class. Functions are designed for a minibatch of inputs.
Construct an SDE.
 Parameters:
N – number of discretization time steps.

abstract property
T
¶ End time of the SDE.

discretize
(x, t, *args)[source]¶ Discretize the SDE in the form: x_{i+1} = x_i + f_i(x_i) + G_i z_i.
Useful for reverse diffusion sampling and probabiliy flow sampling. Defaults to EulerMaruyama discretization.
 Parameters:
x – a torch tensor
t – a torch float representing the time step (from 0 to self.T)
 Returns:
f, G

abstract
marginal_prob
(x, t, *args)[source]¶ Parameters to determine the marginal distribution of
the SDE, $p_t(xargs)$.

abstract
prior_logp
(z)[source]¶ Compute logdensity of the prior distribution.
Useful for computing the loglikelihood via probability flow ODE.
 Parameters:
z – latent code
 Returns:
log probability density

abstract
prior_sampling
(shape, *args)[source]¶ Generate one sample from the prior distribution,
$p_T(xargs)$ with shape shape.
espnet2.enh.diffusion.score_based_diffusion¶

class
espnet2.enh.diffusion.score_based_diffusion.
ScoreModel
(**kwargs)[source]¶ Bases:
espnet2.enh.diffusion.abs_diffusion.AbsDiffusion

enhance
(noisy_specturm, sampler_type='pc', predictor='reverse_diffusion', corrector='ald', N=30, corrector_steps=1, snr=0.5, **kwargs)[source]¶ Enhance function.
 Parameters:
noisy_specturm (torch.Tensor) – noisy feature in [Batch, T, F]
sampler_type (str) – sampler, ‘pc’ for PredictorCorrector and ‘ode’ for ODE sampler.
predictor (str) – the name of Predictor. ‘reverse_diffusion’, ‘euler_maruyama’, or ‘none’
corrector (str) – the name of Corrector. ‘langevin’, ‘ald’ or ‘none’
N (int) – The number of reverse sampling steps.
corrector_steps (int) – number of steps in the Corrector.
snr (float) – The SNR to use for the corrector.
 Returns:
enhanced feature in [Batch, T, F]
 Return type:
X_Hat (torch.Tensor)

forward
(feature_ref, feature_mix)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.diffusion.sampling.correctors¶

class
espnet2.enh.diffusion.sampling.correctors.
AnnealedLangevinDynamics
(sde, score_fn, snr, n_steps)[source]¶ Bases:
espnet2.enh.diffusion.sampling.correctors.Corrector
The original annealed Langevin dynamics predictor in NCSN/NCSNv2.

update_fn
(x, t, *args)[source]¶ One update of the corrector.
 Parameters:
x – A PyTorch tensor representing the current state
t – A PyTorch tensor representing the current time step.
*args – Possibly additional arguments, in particular y for OU processes
 Returns:
A PyTorch tensor of the next state. x_mean: A PyTorch tensor. The next state without random noise.
Useful for denoising.
 Return type:
x


class
espnet2.enh.diffusion.sampling.correctors.
Corrector
(sde, score_fn, snr, n_steps)[source]¶ Bases:
abc.ABC
The abstract class for a corrector algorithm.

abstract
update_fn
(x, t, *args)[source]¶ One update of the corrector.
 Parameters:
x – A PyTorch tensor representing the current state
t – A PyTorch tensor representing the current time step.
*args – Possibly additional arguments, in particular y for OU processes
 Returns:
A PyTorch tensor of the next state. x_mean: A PyTorch tensor. The next state without random noise.
Useful for denoising.
 Return type:
x

abstract

class
espnet2.enh.diffusion.sampling.correctors.
LangevinCorrector
(sde, score_fn, snr, n_steps)[source]¶ Bases:
espnet2.enh.diffusion.sampling.correctors.Corrector

update_fn
(x, t, *args)[source]¶ One update of the corrector.
 Parameters:
x – A PyTorch tensor representing the current state
t – A PyTorch tensor representing the current time step.
*args – Possibly additional arguments, in particular y for OU processes
 Returns:
A PyTorch tensor of the next state. x_mean: A PyTorch tensor. The next state without random noise.
Useful for denoising.
 Return type:
x


class
espnet2.enh.diffusion.sampling.correctors.
NoneCorrector
(*args, **kwargs)[source]¶ Bases:
espnet2.enh.diffusion.sampling.correctors.Corrector
An empty corrector that does nothing.

update_fn
(x, t, *args)[source]¶ One update of the corrector.
 Parameters:
x – A PyTorch tensor representing the current state
t – A PyTorch tensor representing the current time step.
*args – Possibly additional arguments, in particular y for OU processes
 Returns:
A PyTorch tensor of the next state. x_mean: A PyTorch tensor. The next state without random noise.
Useful for denoising.
 Return type:
x

espnet2.enh.diffusion.sampling.__init__¶
Various sampling methods.

class
espnet2.enh.diffusion.sampling.__init__.
Predictor
(sde, score_fn, probability_flow=False)[source]¶ Bases:
abc.ABC
The abstract class for a predictor algorithm.

abstract
update_fn
(x, t, *args)[source]¶ One update of the predictor.
 Parameters:
x – A PyTorch tensor representing the current state
t – A Pytorch tensor representing the current time step.
*args – Possibly additional arguments, in particular y for OU processes
 Returns:
A PyTorch tensor of the next state. x_mean: A PyTorch tensor. The next state without random noise.
Useful for denoising.
 Return type:
x

abstract

class
espnet2.enh.diffusion.sampling.__init__.
Corrector
(sde, score_fn, snr, n_steps)[source]¶ Bases:
abc.ABC
The abstract class for a corrector algorithm.

abstract
update_fn
(x, t, *args)[source]¶ One update of the corrector.
 Parameters:
x – A PyTorch tensor representing the current state
t – A PyTorch tensor representing the current time step.
*args – Possibly additional arguments, in particular y for OU processes
 Returns:
A PyTorch tensor of the next state. x_mean: A PyTorch tensor. The next state without random noise.
Useful for denoising.
 Return type:
x

abstract
espnet2.enh.diffusion.sampling.predictors¶

class
espnet2.enh.diffusion.sampling.predictors.
EulerMaruyamaPredictor
(sde, score_fn, probability_flow=False)[source]¶ Bases:
espnet2.enh.diffusion.sampling.predictors.Predictor

update_fn
(x, t, *args)[source]¶ One update of the predictor.
 Parameters:
x – A PyTorch tensor representing the current state
t – A Pytorch tensor representing the current time step.
*args – Possibly additional arguments, in particular y for OU processes
 Returns:
A PyTorch tensor of the next state. x_mean: A PyTorch tensor. The next state without random noise.
Useful for denoising.
 Return type:
x


class
espnet2.enh.diffusion.sampling.predictors.
NonePredictor
(*args, **kwargs)[source]¶ Bases:
espnet2.enh.diffusion.sampling.predictors.Predictor
An empty predictor that does nothing.

update_fn
(x, t, *args)[source]¶ One update of the predictor.
 Parameters:
x – A PyTorch tensor representing the current state
t – A Pytorch tensor representing the current time step.
*args – Possibly additional arguments, in particular y for OU processes
 Returns:
A PyTorch tensor of the next state. x_mean: A PyTorch tensor. The next state without random noise.
Useful for denoising.
 Return type:
x


class
espnet2.enh.diffusion.sampling.predictors.
Predictor
(sde, score_fn, probability_flow=False)[source]¶ Bases:
abc.ABC
The abstract class for a predictor algorithm.

abstract
update_fn
(x, t, *args)[source]¶ One update of the predictor.
 Parameters:
x – A PyTorch tensor representing the current state
t – A Pytorch tensor representing the current time step.
*args – Possibly additional arguments, in particular y for OU processes
 Returns:
A PyTorch tensor of the next state. x_mean: A PyTorch tensor. The next state without random noise.
Useful for denoising.
 Return type:
x

abstract

class
espnet2.enh.diffusion.sampling.predictors.
ReverseDiffusionPredictor
(sde, score_fn, probability_flow=False)[source]¶ Bases:
espnet2.enh.diffusion.sampling.predictors.Predictor

update_fn
(x, t, *args)[source]¶ One update of the predictor.
 Parameters:
x – A PyTorch tensor representing the current state
t – A Pytorch tensor representing the current time step.
*args – Possibly additional arguments, in particular y for OU processes
 Returns:
A PyTorch tensor of the next state. x_mean: A PyTorch tensor. The next state without random noise.
Useful for denoising.
 Return type:
x
