espnet2.enh package

espnet2.enh.espnet_model

class espnet2.enh.espnet_model.ESPnetEnhancementModel(enh_model: Optional[espnet2.enh.abs_enh.AbsEnhancement])[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

Speech enhancement or separation Frontend model

collect_feats(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]
forward(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters
  • speech_mix – (Batch, samples) or (Batch, samples, channels)

  • speech_ref – (Batch, num_speaker, samples) or (Batch, num_speaker, samples, channels)

  • speech_mix_lengths – (Batch,), default None for chunk interator, because the chunk-iterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py

static si_snr_loss(ref, inf)[source]

si-snr loss

Parameters
  • ref – (Batch, samples)

  • inf – (Batch, samples)

Returns

(Batch)

static si_snr_loss_zeromean(ref, inf)[source]

si_snr loss with zero-mean in pre-processing.

Parameters
  • ref – (Batch, samples)

  • inf – (Batch, samples)

Returns

(Batch)

static tf_l1_loss(ref, inf)[source]

time-frequency L1 loss.

Parameters
  • ref – (Batch, T, F) or (Batch, T, C, F)

  • inf – (Batch, T, F) or (Batch, T, C, F)

Returns

(Batch)

static tf_mse_loss(ref, inf)[source]

time-frequency MSE loss.

Parameters
  • ref – (Batch, T, F)

  • inf – (Batch, T, F)

Returns

(Batch)

espnet2.enh.abs_enh

class espnet2.enh.abs_enh.AbsEnhancement[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, collections.OrderedDict][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract forward_rawwav(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, collections.OrderedDict][source]

espnet2.enh.__init__

espnet2.enh.layers.conv_beamformer

This script is used to construct convolutional beamformers. Copyright 2020 Wangyou Zhang

espnet2.enh.layers.conv_beamformer.get_WPD_filter(Phi: torch_complex.tensor.ComplexTensor, Rf: torch_complex.tensor.ComplexTensor, reference_vector: torch.Tensor, eps: float = 1e-15) → torch_complex.tensor.ComplexTensor[source]

Return the WPD vector.

WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:

h = (Rf^-1 @ Phi_{xx}) / tr[(Rf^-1) @ Phi_{xx}] @ u

Reference:

T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903-907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481

Parameters
  • Phi (ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the PSD of zero-padded speech [x^T(t,f) 0 … 0]^T.

  • Rf (ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatio-temporal covariance matrix.

  • reference_vector (torch.Tensor) – (B, (btaps+1) * C) is the reference_vector.

  • eps (float) –

Returns

(B, F, (btaps + 1) * C)

Return type

filter_matrix (ComplexTensor)

espnet2.enh.layers.conv_beamformer.get_WPD_filter_v2(Phi: torch_complex.tensor.ComplexTensor, Rf: torch_complex.tensor.ComplexTensor, reference_vector: torch.Tensor, eps: float = 1e-15) → torch_complex.tensor.ComplexTensor[source]

Return the WPD vector with filter v2.

WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:

h = (Rf^-1 @ Phi_{xx}) @ u / tr[(Rf^-1) @ Phi_{xx}]

This implementaion is more efficient than get_WPD_filter as

it skips unnecessary computation with zeros.

Reference:

T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903-907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481

Parameters
  • Phi (ComplexTensor) – (B, F, C, C) is speech PSD.

  • Rf (ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatio-temporal covariance matrix.

  • reference_vector (torch.Tensor) – (B, C) is the reference_vector.

  • eps (float) –

Returns

(B, F, (btaps+1) * C)

Return type

filter_matrix (ComplexTensor)

espnet2.enh.layers.conv_beamformer.get_covariances(Y: torch_complex.tensor.ComplexTensor, inverse_power: torch.Tensor, bdelay: int, btaps: int, get_vector: bool = False) → torch_complex.tensor.ComplexTensor[source]

Calculates the power normalized spatio-temporal

covariance matrix of the framed signal.

Parameters
  • Y – Complext STFT signal with shape (B, F, C, T)

  • inverse_power – Weighting factor with shape (B, F, T)

Returns

Correlation matrix of shape (B, F, (btaps+1) * C, (btaps+1) * C) Correlation vector of shape (B, F, btaps + 1, C, C)

espnet2.enh.layers.conv_beamformer.inv(z)[source]
espnet2.enh.layers.conv_beamformer.perform_WPD_filtering(filter_matrix: torch_complex.tensor.ComplexTensor, Y: torch_complex.tensor.ComplexTensor, bdelay: int, btaps: int) → torch_complex.tensor.ComplexTensor[source]

perform_filter_operation

Parameters
  • filter_matrix – Filter matrix (B, F, (btaps + 1) * C)

  • Y – Complex STFT signal with shape (B, F, C, T)

Returns

(B, F, T)

Return type

enhanced (ComplexTensor)

espnet2.enh.layers.conv_beamformer.signal_framing(signal: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], frame_length: int, frame_step: int, bdelay: int, do_padding: bool = False, pad_value: int = 0, indices: List = None) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Expand signal into several frames, with each frame of length frame_length.

Parameters
  • signal – (…, T)

  • frame_length – length of each segment

  • frame_step – step for selecting frames

  • bdelay – delay for WPD

  • do_padding – whether or not to pad the input signal at the beginning of the time dimension

  • pad_value – value to fill in the padding

Returns

if do_padding: (…, T, frame_length) else: (…, T - bdelay - frame_length + 2, frame_length)

Return type

torch.Tensor

espnet2.enh.layers.dnn_beamformer

class espnet2.enh.layers.dnn_beamformer.AttentionReference(bidim, att_dim)[source]

Bases: torch.nn.modules.module.Module

forward(psd_in: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor, scaling: float = 2.0) → Tuple[torch.Tensor, torch.LongTensor][source]

The forward function

Parameters
  • psd_in (ComplexTensor) – (B, F, C, C)

  • ilens (torch.Tensor) – (B,)

  • scaling (float) –

Returns

(B, C) ilens (torch.Tensor): (B,)

Return type

u (torch.Tensor)

class espnet2.enh.layers.dnn_beamformer.DNN_Beamformer(bidim, btype: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, num_spk: int = 1, use_noise_mask: bool = True, nonlinear: str = 'sigmoid', dropout_rate: float = 0.0, badim: int = 320, ref_channel: int = -1, beamformer_type: str = 'mvdr', eps: float = 1e-06, btaps: int = 5, bdelay: int = 3)[source]

Bases: torch.nn.modules.module.Module

DNN mask based Beamformer

Citation:

Multichannel End-to-end Speech Recognition; T. Ochiai et al., 2017; https://arxiv.org/abs/1703.04783

forward(data: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[torch_complex.tensor.ComplexTensor, torch.LongTensor, torch.Tensor][source]

The forward function

Notation:

B: Batch C: Channel T: Time or Sequence length F: Freq

Parameters
  • data (ComplexTensor) – (B, T, C, F), double precision

  • ilens (torch.Tensor) – (B,)

Returns

(B, T, F), double precision ilens (torch.Tensor): (B,) masks (torch.Tensor): (B, T, C, F)

Return type

enhanced (ComplexTensor)

predict_mask(data: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]

Predict masks for beamforming

Parameters
  • data (ComplexTensor) – (B, T, C, F), double precision

  • ilens (torch.Tensor) – (B,)

Returns

(B, T, C, F) ilens (torch.Tensor): (B,)

Return type

masks (torch.Tensor)

espnet2.enh.layers.mask_estimator

class espnet2.enh.layers.mask_estimator.MaskEstimator(type, idim, layers, units, projs, dropout, nmask=1, nonlinear='sigmoid')[source]

Bases: torch.nn.modules.module.Module

forward(xs: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]

The forward function

Parameters
  • xs – (B, F, C, T)

  • ilens – (B,)

Returns

The hidden vector (B, F, C, T) masks: A tuple of the masks. (B, F, C, T) ilens: (B,)

Return type

hs (torch.Tensor)

espnet2.enh.layers.dnn_wpe

class espnet2.enh.layers.dnn_wpe.DNN_WPE(wtype: str = 'blstmp', widim: int = 257, wlayers: int = 3, wunits: int = 300, wprojs: int = 320, dropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask: bool = True, nonlinear: str = 'sigmoid', iterations: int = 1, normalization: bool = False)[source]

Bases: torch.nn.modules.module.Module

forward(data: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[torch_complex.tensor.ComplexTensor, torch.LongTensor, torch_complex.tensor.ComplexTensor][source]

The forward function

Notation:

B: Batch C: Channel T: Time or Sequence length F: Freq or Some dimension of the feature vector

Parameters
  • data – (B, C, T, F), double precision

  • ilens – (B,)

Returns

(B, C, T, F), double precision ilens: (B,)

Return type

data

predict_mask(data: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]

Predict mask for WPE dereverberation

Parameters
  • data (ComplexTensor) – (B, T, C, F), double precision

  • ilens (torch.Tensor) – (B,)

Returns

(B, T, C, F) ilens (torch.Tensor): (B,)

Return type

masks (torch.Tensor)

espnet2.enh.layers.__init__

espnet2.enh.nets.tf_mask_net

class espnet2.enh.nets.tf_mask_net.TFMaskingNet(n_fft: int = 512, win_length: int = None, hop_length: int = 128, rnn_type: str = 'blstm', layer: int = 3, unit: int = 512, dropout: float = 0.0, num_spk: int = 2, nonlinear: str = 'sigmoid', utt_mvn: bool = False, mask_type: str = 'IRM', loss_type: str = 'mask_mse')[source]

Bases: espnet2.enh.abs_enh.AbsEnhancement

TF Masking Speech Separation Net.

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward.

Parameters
  • input (torch.Tensor) – mixed speech [Batch, sample]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

[(B, T, F), …] ilens (torch.Tensor): (B,) predcited masks: OrderedDict[

’spk1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘spk2’: torch.Tensor(Batch, Frames, Channel, Freq), … ‘spkn’: torch.Tensor(Batch, Frames, Channel, Freq),

]

Return type

separated (list[ComplexTensor])

forward_rawwav(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Output with waveforms.

Parameters
  • input (torch.Tensor) – mixed speech [Batch, sample]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

predcited speech [Batch, num_speaker, sample] output lengths predcited masks: OrderedDict[

’spk1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘spk2’: torch.Tensor(Batch, Frames, Channel, Freq), … ‘spkn’: torch.Tensor(Batch, Frames, Channel, Freq),

]

espnet2.enh.nets.beamformer_net

class espnet2.enh.nets.beamformer_net.BeamformerNet(num_spk: int = 1, normalize_input: bool = False, mask_type: str = 'IPM^2', loss_type: str = 'mask_mse', n_fft: int = 512, win_length: int = None, hop_length: int = 128, center: bool = True, window: Optional[str] = 'hann', normalized: bool = False, onesided: bool = True, use_wpe: bool = False, wnet_type: str = 'blstmp', wlayers: int = 3, wunits: int = 300, wprojs: int = 320, wdropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask_for_wpe: bool = True, wnonlinear: str = 'crelu', use_beamformer: bool = True, bnet_type: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, badim: int = 320, ref_channel: int = -1, use_noise_mask: bool = True, bnonlinear: str = 'sigmoid', beamformer_type='mvdr', bdropout_rate=0.0)[source]

Bases: espnet2.enh.abs_enh.AbsEnhancement

TF Masking based beamformer

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward.

Parameters
  • input (torch.Tensor) – mixed speech [Batch, Nsample, Channel]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

torch.Tensor or List[torch.Tensor] output lengths predcited masks: OrderedDict[

’dereverb’: torch.Tensor(Batch, Frames, Channel, Freq), ‘spk1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘spk2’: torch.Tensor(Batch, Frames, Channel, Freq), … ‘spkn’: torch.Tensor(Batch, Frames, Channel, Freq), ‘noise1’: torch.Tensor(Batch, Frames, Channel, Freq),

]

Return type

enhanced speech (single-channel)

forward_rawwav(input: torch.Tensor, ilens: torch.Tensor)[source]

Output with wavformes.

Parameters
  • input (torch.Tensor) – mixed speech [Batch, Nsample, Channel]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

torch.Tensor(Batch, Nsamples), or List[torch.Tensor(Batch, Nsamples)] output lengths predcited masks: OrderedDict[

’dereverb’: torch.Tensor(Batch, Frames, Channel, Freq), ‘spk1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘spk2’: torch.Tensor(Batch, Frames, Channel, Freq), … ‘spkn’: torch.Tensor(Batch, Frames, Channel, Freq), ‘noise1’: torch.Tensor(Batch, Frames, Channel, Freq),

]

Return type

predcited speech wavs (single-channel)

espnet2.enh.nets.tasnet

class espnet2.enh.nets.tasnet.ChannelwiseLayerNorm(channel_size)[source]

Bases: torch.nn.modules.module.Module

Channel-wise Layer Normalization (cLN)

forward(y)[source]

Forward.

Parameters

y – [M, N, K], M is batch size, N is channel size, K is length

Returns

[M, N, K]

Return type

cLN_y

reset_parameters()[source]
class espnet2.enh.nets.tasnet.Chomp1d(chomp_size)[source]

Bases: torch.nn.modules.module.Module

To ensure the output length is the same as the input.

forward(x)[source]

Forward.

Parameters

x – [M, H, Kpad]

Returns

[M, H, K]

class espnet2.enh.nets.tasnet.Decoder(N, L)[source]

Bases: torch.nn.modules.module.Module

forward(mixture_w, est_mask)[source]

Forward

Parameters
  • mixture_w – [M, N, K]

  • est_mask – [M, C, N, K]

Returns

[M, C, T]

Return type

est_source

class espnet2.enh.nets.tasnet.DepthwiseSeparableConv(in_channels, out_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Forward.

Parameters

x – [M, H, K]

Returns

[M, B, K]

Return type

result

class espnet2.enh.nets.tasnet.Encoder(L, N)[source]

Bases: torch.nn.modules.module.Module

Estimation of the nonnegative mixture weight by a 1-D conv layer.

forward(mixture)[source]

Forward.

Parameters

mixture – [M, T], M is batch size, T is #samples

Returns

[M, N, K], where K = (T-L)/(L/2)+1 = 2T/L-1

Return type

mixture_w

class espnet2.enh.nets.tasnet.GlobalLayerNorm(channel_size)[source]

Bases: torch.nn.modules.module.Module

Global Layer Normalization (gLN)

forward(y)[source]

Forward.

Parameters

y – [M, N, K], M is batch size, N is channel size, K is length

Returns

[M, N, K]

Return type

gLN_y

reset_parameters()[source]
class espnet2.enh.nets.tasnet.TasNet(N: int = 256, L: int = 20, B: int = 256, H: int = 512, P: int = 3, X: int = 8, R: int = 4, num_spk: int = 2, norm_type: str = 'gLN', causal: bool = False, mask_nonlinear: str = 'relu', loss_type: str = 'si_snr')[source]

Bases: espnet2.enh.abs_enh.AbsEnhancement

Main tasnet class.

Parameters
  • N – Number of filters in autoencoder

  • L – Length of the filters (in samples)

  • B – Number of channels in bottleneck 1 * 1-conv block

  • H – Number of channels in convolutional blocks

  • P – Kernel size in convolutional blocks

  • X – Number of convolutional blocks in each repeat

  • R – Number of repeats

  • num_spk – Number of speakers

  • norm_type – BN, gLN, cLN

  • causal – causal or non-causal

  • mask_nonlinear – use which non-linear function to generate mask

Reference:

Luo Y, Mesgarani N. Tasnet: time-domain audio separation network for real-time, single-channel speech separation

Based on https://github.com/kaituoxu/Conv-TasNet

forward(mixture, ilens=None)[source]

Forward from mixture to estimation sources.

Parameters
  • mixture – [M, T], M is batch size, T is #samples

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

[M, C, T] lens: [Batch]

Return type

est_source

forward_rawwav(mixture, ilens=None)[source]
classmethod load_model(path)[source]
classmethod load_model_from_package(package)[source]
static serialize(model, optimizer, epoch, tr_loss=None, cv_loss=None)[source]
class espnet2.enh.nets.tasnet.TemporalBlock(in_channels, out_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Forward.

Parameters

x – [M, B, K]

Returns

[M, B, K]

class espnet2.enh.nets.tasnet.TemporalConvNet(N, B, H, P, X, R, C, norm_type='gLN', causal=False, mask_nonlinear='relu')[source]

Bases: torch.nn.modules.module.Module

Basic Module of tasnet.

Parameters
  • N – Number of filters in autoencoder

  • B – Number of channels in bottleneck 1 * 1-conv block

  • H – Number of channels in convolutional blocks

  • P – Kernel size in convolutional blocks

  • X – Number of convolutional blocks in each repeat

  • R – Number of repeats

  • C – Number of speakers

  • norm_type – BN, gLN, cLN

  • causal – causal or non-causal

  • mask_nonlinear – use which non-linear function to generate mask

forward(mixture_w)[source]

Keep this API same with TasNet

Parameters

mixture_w – [M, N, K], M is batch size

Returns

[M, C, N, K]

Return type

est_mask

espnet2.enh.nets.tasnet.check_nonlinear(nolinear_type)[source]
espnet2.enh.nets.tasnet.chose_norm(norm_type, channel_size)[source]

The input of normalization will be (M, C, K), where M is batch size.

C is channel size and K is sequence length.

espnet2.enh.nets.tasnet.overlap_and_add(signal, frame_step)[source]

Reconstructs a signal from a framed representation.

Adds potentially overlapping frames of a signal with shape […, frames, frame_length], offsetting subsequent frames by frame_step. The resulting tensor has shape […, output_size] where

output_size = (frames - 1) * frame_step + frame_length

Parameters
  • signal – A […, frames, frame_length] Tensor. All dimensions may be unknown, and rank must be at least 2.

  • frame_step – An integer denoting overlap offsets. Must be less than or equal to frame_length.

Returns

A Tensor with shape […, output_size] containing the

overlap-added frames of signal’s inner-most two dimensions.

output_size = (frames - 1) * frame_step + frame_length

Based on https://github.com/tensorflow/tensorflow/blob/r1.12/

tensorflow/contrib/signal/python/ops/reconstruction_ops.py

espnet2.enh.nets.tasnet.remove_pad(inputs, inputs_lengths)[source]

Remove pad.

Parameters
  • inputs – torch.Tensor, [B, C, T] or [B, T], B is batch size

  • inputs_lengths – torch.Tensor, [B]

Returns

a list containing B items, each item is [C, T], T varies

Return type

results

espnet2.enh.nets.__init__