espnet2.layers package

espnet2.layers.time_warp

Time warp module.

class espnet2.layers.time_warp.TimeWarp(window: int = 80, mode: str = 'bicubic')[source]

Bases: torch.nn.modules.module.Module

Time warping using torch.interpolate.

Parameters
  • window – time warp parameter

  • mode – Interpolate mode

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x: torch.Tensor, x_lengths: torch.Tensor = None)[source]

Forward function.

Parameters
  • x – (Batch, Time, Freq)

  • x_lengths – (Batch,)

espnet2.layers.time_warp.time_warp(x: torch.Tensor, window: int = 80, mode: str = 'bicubic')[source]

Time warping using torch.interpolate.

Parameters
  • x – (Batch, Time, Freq)

  • window – time warp parameter

  • mode – Interpolate mode

espnet2.layers.abs_normalize

class espnet2.layers.abs_normalize.AbsNormalize[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.layers.log_mel

class espnet2.layers.log_mel.LogMel(fs: int = 16000, n_fft: int = 512, n_mels: int = 80, fmin: float = None, fmax: float = None, htk: bool = False, log_base: float = None)[source]

Bases: torch.nn.modules.module.Module

Convert STFT to fbank feats

The arguments is same as librosa.filters.mel

Parameters
  • fs – number > 0 [scalar] sampling rate of the incoming signal

  • n_fft – int > 0 [scalar] number of FFT components

  • n_mels – int > 0 [scalar] number of Mel bands to generate

  • fmin – float >= 0 [scalar] lowest frequency (in Hz)

  • fmax – float >= 0 [scalar] highest frequency (in Hz). If None, use fmax = fs / 2.0

  • htk – use HTK formula instead of Slaney

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(feat: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.layers.mask_along_axis

class espnet2.layers.mask_along_axis.MaskAlongAxis(mask_width_range: Union[int, Sequence[int]] = (0, 30), num_mask: int = 2, dim: Union[int, str] = 'time', replace_with_zero: bool = True)[source]

Bases: torch.nn.modules.module.Module

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(spec: torch.Tensor, spec_lengths: torch.Tensor = None)[source]

Forward function.

Parameters

spec – (Batch, Length, Freq)

class espnet2.layers.mask_along_axis.MaskAlongAxisVariableMaxWidth(mask_width_ratio_range: Union[float, Sequence[float]] = (0.0, 0.05), num_mask: int = 2, dim: Union[int, str] = 'time', replace_with_zero: bool = True)[source]

Bases: torch.nn.modules.module.Module

Mask input spec along a specified axis with variable maximum width.

Formula:

max_width = max_width_ratio * seq_len

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(spec: torch.Tensor, spec_lengths: torch.Tensor = None)[source]

Forward function.

Parameters

spec – (Batch, Length, Freq)

espnet2.layers.mask_along_axis.mask_along_axis(spec: torch.Tensor, spec_lengths: torch.Tensor, mask_width_range: Sequence[int] = (0, 30), dim: int = 1, num_mask: int = 2, replace_with_zero: bool = True)[source]

Apply mask along the specified direction.

Parameters
  • spec – (Batch, Length, Freq)

  • spec_lengths – (Length): Not using lengths in this implementation

  • mask_width_range – Select the width randomly between this range

espnet2.layers.sinc_conv

Sinc convolutions.

class espnet2.layers.sinc_conv.BarkScale[source]

Bases: object

Bark frequency scale.

Has wider bandwidths at lower frequencies, see: Critical bandwidth: BARK Zwicker and Terhardt, 1980

classmethod bank(channels: int, fs: float) → torch.Tensor[source]

Obtain initialization values for the Bark scale.

Parameters
  • channels – Number of channels.

  • fs – Sample rate.

Returns

Filter start frequencíes. torch.Tensor: Filter stop frequencíes.

Return type

torch.Tensor

static convert(f)[source]

Convert Hz to Bark.

static invert(x)[source]

Convert Bark to Hz.

class espnet2.layers.sinc_conv.LogCompression[source]

Bases: torch.nn.modules.module.Module

Log Compression Activation.

Activation function log(abs(x) + 1).

Initialize.

forward(x: torch.Tensor) → torch.Tensor[source]

Forward.

Applies the Log Compression function elementwise on tensor x.

class espnet2.layers.sinc_conv.MelScale[source]

Bases: object

Mel frequency scale.

classmethod bank(channels: int, fs: float) → torch.Tensor[source]

Obtain initialization values for the mel scale.

Parameters
  • channels – Number of channels.

  • fs – Sample rate.

Returns

Filter start frequencíes. torch.Tensor: Filter stop frequencies.

Return type

torch.Tensor

static convert(f)[source]

Convert Hz to mel.

static invert(x)[source]

Convert mel to Hz.

class espnet2.layers.sinc_conv.SincConv(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, window_func: str = 'hamming', scale_type: str = 'mel', fs: Union[int, float] = 16000)[source]

Bases: torch.nn.modules.module.Module

Sinc Convolution.

This module performs a convolution using Sinc filters in time domain as kernel. Sinc filters function as band passes in spectral domain. The filtering is done as a convolution in time domain, and no transformation to spectral domain is necessary.

This implementation of the Sinc convolution is heavily inspired by Ravanelli et al. https://github.com/mravanelli/SincNet, and adapted for the ESpnet toolkit. Combine Sinc convolutions with a log compression activation function, as in: https://arxiv.org/abs/2010.07597

Notes: Currently, the same filters are applied to all input channels. The windowing function is applied on the kernel to obtained a smoother filter, and not on the input values, which is different to traditional ASR.

Initialize Sinc convolutions.

Parameters
  • in_channels – Number of input channels.

  • out_channels – Number of output channels.

  • kernel_size – Sinc filter kernel size (needs to be an odd number).

  • stride – See torch.nn.functional.conv1d.

  • padding – See torch.nn.functional.conv1d.

  • dilation – See torch.nn.functional.conv1d.

  • window_func – Window function on the filter, one of [“hamming”, “none”].

  • fs (str, int, float) – Sample rate of the input data

forward(xs: torch.Tensor) → torch.Tensor[source]

Sinc convolution forward function.

Parameters

xs – Batch in form of torch.Tensor (B, C_in, D_in).

Returns

Batch in form of torch.Tensor (B, C_out, D_out).

Return type

xs

get_odim(idim: int) → int[source]

Obtain the output dimension of the filter.

static hamming_window(x: torch.Tensor) → torch.Tensor[source]

Hamming Windowing function.

init_filters()[source]

Initialize filters with filterbank values.

static none_window(x: torch.Tensor) → torch.Tensor[source]

Identity-like windowing function.

static sinc(x: torch.Tensor) → torch.Tensor[source]

Sinc function.

espnet2.layers.__init__

espnet2.layers.utterance_mvn

class espnet2.layers.utterance_mvn.UtteranceMVN(norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20)[source]

Bases: espnet2.layers.abs_normalize.AbsNormalize

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Forward function

Parameters
  • x – (B, L, …)

  • ilens – (B,)

espnet2.layers.utterance_mvn.utterance_mvn(x: torch.Tensor, ilens: torch.Tensor = None, norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20) → Tuple[torch.Tensor, torch.Tensor][source]

Apply utterance mean and variance normalization

Parameters
  • x – (B, T, D), assumed zero padded

  • ilens – (B,)

  • norm_means

  • norm_vars

  • eps

espnet2.layers.stft

class espnet2.layers.stft.Stft(n_fft: int = 512, win_length: int = None, hop_length: int = 128, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True)[source]

Bases: torch.nn.modules.module.Module, espnet2.layers.inversible_interface.InversibleInterface

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(input: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

STFT forward function.

Parameters
  • input – (Batch, Nsamples) or (Batch, Nsample, Channels)

  • ilens – (Batch)

Returns

(Batch, Frames, Freq, 2) or (Batch, Frames, Channels, Freq, 2)

Return type

output

inverse(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Inverse STFT.

Parameters
  • input – Tensor(batch, T, F, 2) or ComplexTensor(batch, T, F)

  • ilens – (batch,)

Returns

(batch, samples) ilens: (batch,)

Return type

wavs

espnet2.layers.global_mvn

class espnet2.layers.global_mvn.GlobalMVN(stats_file: Union[pathlib.Path, str], norm_means: bool = True, norm_vars: bool = True, eps: float = 1e-20)[source]

Bases: espnet2.layers.abs_normalize.AbsNormalize, espnet2.layers.inversible_interface.InversibleInterface

Apply global mean and variance normalization

TODO(kamo): Make this class portable somehow

Parameters
  • stats_file – npy file

  • norm_means – Apply mean normalization

  • norm_vars – Apply var normalization

  • eps

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Forward function

Parameters
  • x – (B, L, …)

  • ilens – (B,)

inverse(x: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

espnet2.layers.inversible_interface

class espnet2.layers.inversible_interface.InversibleInterface[source]

Bases: abc.ABC

abstract inverse(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

espnet2.layers.label_aggregation

class espnet2.layers.label_aggregation.LabelAggregate(win_length: int = 512, hop_length: int = 128, center: bool = True)[source]

Bases: torch.nn.modules.module.Module

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(input: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

LabelAggregate forward function.

Parameters
  • input – (Batch, Nsamples, Label_dim)

  • ilens – (Batch)

Returns

(Batch, Frames, Label_dim)

Return type

output