espnet2.layers package

espnet2.layers.time_warp

Time warp module.

class espnet2.layers.time_warp.TimeWarp(window: int = 80, mode: str = 'bicubic')[source]

Bases: torch.nn.modules.module.Module

Time warping using torch.interpolate.

Parameters:
  • window – time warp parameter

  • mode – Interpolate mode

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x: torch.Tensor, x_lengths: torch.Tensor = None)[source]

Forward function.

Parameters:
  • x – (Batch, Time, Freq)

  • x_lengths – (Batch,)

espnet2.layers.time_warp.time_warp(x: torch.Tensor, window: int = 80, mode: str = 'bicubic')[source]

Time warping using torch.interpolate.

Parameters:
  • x – (Batch, Time, Freq)

  • window – time warp parameter

  • mode – Interpolate mode

espnet2.layers.label_aggregation

class espnet2.layers.label_aggregation.LabelAggregate(win_length: int = 512, hop_length: int = 128, center: bool = True)[source]

Bases: torch.nn.modules.module.Module

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(input: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

LabelAggregate forward function.

Parameters:
  • input – (Batch, Nsamples, Label_dim)

  • ilens – (Batch)

Returns:

(Batch, Frames, Label_dim)

Return type:

output

espnet2.layers.__init__

espnet2.layers.create_adapter_fn

espnet2.layers.create_adapter_fn.create_houlsby_adapter(model: torch.nn.modules.module.Module, bottleneck: int = 32, target_layers: List[int] = [])[source]
espnet2.layers.create_adapter_fn.create_lora_adapter(model: torch.nn.modules.module.Module, rank: int = 8, alpha: int = 8, dropout_rate: float = 0.0, target_modules: List[str] = ['query'], bias_type: Optional[str] = 'none')[source]

Create LoRA adapter for the base model.

See: https://arxiv.org/pdf/2106.09685.pdf

Parameters:
  • model (torch.nn.Module) – Base model to be adapted.

  • rank (int) – Rank of LoRA matrices. Defaults to 8.

  • alpha (int) – Constant number for LoRA scaling. Defaults to 8.

  • dropout_rate (float) – Dropout probability for LoRA layers. Defaults to 0.0.

  • target_modules (List[str]) – List of module(s) to apply LoRA adaptation. e.g. [“query”, “key”, “value”] for all layers, while [“encoder.encoders.blocks.0.attn.key”] for a specific layer.

  • bias_type (str) – Bias training type for LoRA adaptaion, can be one of [“none”, “all”, “lora_only”]. “none” means not training any bias vectors; “all” means training all bias vectors, include LayerNorm biases; “lora_only” means only training bias vectors in LoRA adapted modules.

espnet2.layers.create_adapter_fn.create_new_houlsby_module(target_module: torch.nn.modules.module.Module, bottleneck: int)[source]

Create a new houlsby adapter module for the given target module.

Currently, only support: Wav2Vec2EncoderLayerStableLayerNorm & TransformerSentenceEncoderLayer

espnet2.layers.create_adapter_fn.create_new_lora_module(target_module: torch.nn.modules.module.Module, rank: int, alpha: int, dropout_rate: float)[source]

Create a new lora module for the given target module.

espnet2.layers.create_adapter_utils

espnet2.layers.create_adapter_utils.check_target_module_exists(key: str, target_modules: List[str])[source]

Check if the target_modules matchs the given key.

espnet2.layers.create_adapter_utils.get_submodules(model: torch.nn.modules.module.Module, key: str)[source]

Return the submodules of the given key.

espnet2.layers.create_adapter_utils.replace_module(parent_module: torch.nn.modules.module.Module, child_name: str, old_module: torch.nn.modules.module.Module, new_module: torch.nn.modules.module.Module)[source]

Replace the target module with the new module.

espnet2.layers.log_mel

class espnet2.layers.log_mel.LogMel(fs: int = 16000, n_fft: int = 512, n_mels: int = 80, fmin: float = None, fmax: float = None, htk: bool = False, log_base: float = None)[source]

Bases: torch.nn.modules.module.Module

Convert STFT to fbank feats

The arguments is same as librosa.filters.mel

Parameters:
  • fs – number > 0 [scalar] sampling rate of the incoming signal

  • n_fft – int > 0 [scalar] number of FFT components

  • n_mels – int > 0 [scalar] number of Mel bands to generate

  • fmin – float >= 0 [scalar] lowest frequency (in Hz)

  • fmax – float >= 0 [scalar] highest frequency (in Hz). If None, use fmax = fs / 2.0

  • htk – use HTK formula instead of Slaney

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(feat: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.layers.mask_along_axis

class espnet2.layers.mask_along_axis.MaskAlongAxis(mask_width_range: Union[int, Sequence[int]] = (0, 30), num_mask: int = 2, dim: Union[int, str] = 'time', replace_with_zero: bool = True)[source]

Bases: torch.nn.modules.module.Module

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(spec: torch.Tensor, spec_lengths: torch.Tensor = None)[source]

Forward function.

Parameters:

spec – (Batch, Length, Freq)

class espnet2.layers.mask_along_axis.MaskAlongAxisVariableMaxWidth(mask_width_ratio_range: Union[float, Sequence[float]] = (0.0, 0.05), num_mask: int = 2, dim: Union[int, str] = 'time', replace_with_zero: bool = True)[source]

Bases: torch.nn.modules.module.Module

Mask input spec along a specified axis with variable maximum width.

Formula:

max_width = max_width_ratio * seq_len

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(spec: torch.Tensor, spec_lengths: torch.Tensor = None)[source]

Forward function.

Parameters:

spec – (Batch, Length, Freq)

espnet2.layers.mask_along_axis.mask_along_axis(spec: torch.Tensor, spec_lengths: torch.Tensor, mask_width_range: Sequence[int] = (0, 30), dim: int = 1, num_mask: int = 2, replace_with_zero: bool = True)[source]

Apply mask along the specified direction.

Parameters:
  • spec – (Batch, Length, Freq)

  • spec_lengths – (Length): Not using lengths in this implementation

  • mask_width_range – Select the width randomly between this range

espnet2.layers.inversible_interface

class espnet2.layers.inversible_interface.InversibleInterface[source]

Bases: abc.ABC

abstract inverse(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

espnet2.layers.stft

class espnet2.layers.stft.Stft(n_fft: int = 512, win_length: Optional[int] = None, hop_length: int = 128, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True)[source]

Bases: torch.nn.modules.module.Module, espnet2.layers.inversible_interface.InversibleInterface

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(input: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

STFT forward function.

Parameters:
  • input – (Batch, Nsamples) or (Batch, Nsample, Channels)

  • ilens – (Batch)

Returns:

(Batch, Frames, Freq, 2) or (Batch, Frames, Channels, Freq, 2)

Return type:

output

inverse(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Inverse STFT.

Parameters:
  • input – Tensor(batch, T, F, 2) or ComplexTensor(batch, T, F)

  • ilens – (batch,)

Returns:

(batch, samples) ilens: (batch,)

Return type:

wavs

espnet2.layers.utterance_mvn

class espnet2.layers.utterance_mvn.UtteranceMVN(norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20)[source]

Bases: espnet2.layers.abs_normalize.AbsNormalize

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Forward function

Parameters:
  • x – (B, L, …)

  • ilens – (B,)

espnet2.layers.utterance_mvn.utterance_mvn(x: torch.Tensor, ilens: torch.Tensor = None, norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20) → Tuple[torch.Tensor, torch.Tensor][source]

Apply utterance mean and variance normalization

Parameters:
  • x – (B, T, D), assumed zero padded

  • ilens – (B,)

  • norm_means

  • norm_vars

  • eps

espnet2.layers.abs_normalize

class espnet2.layers.abs_normalize.AbsNormalize(*args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.layers.houlsby_adapter_layer

class espnet2.layers.houlsby_adapter_layer.HoulsbyTransformerSentenceEncoderLayer(bottleneck: int = 32, **kwargs)[source]

Bases: s3prl.upstream.wav2vec2.wav2vec2_model.TransformerSentenceEncoderLayer

Implements a Transformer Encoder Layer used in BERT/XLM style pre-trained

models.

forward(x: torch.Tensor, self_attn_mask: torch.Tensor = None, self_attn_padding_mask: torch.Tensor = None, need_weights: bool = False, att_args=None)[source]

LayerNorm is applied either before or after the self-attention/ffn

modules similar to the original Transformer imlementation.

class espnet2.layers.houlsby_adapter_layer.Houlsby_Adapter(input_size: int, bottleneck: int)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.layers.create_adapter

Definition of the low-rank adaptation (LoRA) for large models.

References

  1. LoRA: Low-Rank Adaptation of Large Language Models (https://arxiv.org/pdf/2106.09685.pdf)

  2. https://github.com/microsoft/LoRA.git

  3. https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora.py

espnet2.layers.create_adapter.create_adapter(model: torch.nn.modules.module.Module, adapter: str, adapter_conf: dict)[source]

Create adapter for the base model.

Parameters:
  • model (torch.nn.Module) – Base model to be adapted.

  • adapter_type (str) – Name of adapter

  • adapter_conf (dict) – Configuration for the adapter e.g. {“rank”: 8, “alpha”: 8, …} for lora

espnet2.layers.augmentation

class espnet2.layers.augmentation.DataAugmentation(effects: List[Union[Tuple[float, List[Tuple[float, str, Dict]]], Tuple[float, str, Dict]]], apply_n: Tuple[int, int] = [1, 1])[source]

Bases: object

A series of data augmentation effects that can be applied to a given waveform.

Note: Currently we only support single-channel waveforms.

Parameters:
  • effects (list) –

    a list of effects to be applied to the waveform. .. rubric:: Example

    [

    [0.1, “lowpass”, {“cutoff_freq”: 1000, “Q”: 0.707}], [0.1, “highpass”, {“cutoff_freq”: 3000, “Q”: 0.707}], [0.1, “equalization”, {“center_freq”: 1000, “gain”: 0, “Q”: 0.707}], [

    0.1, [

    [0.3, “speed_perturb”, {“factor”: 0.9}], [0.3, “speed_perturb”, {“factor”: 1.1}],

    ]

    ],

    ]

    Description:
    • The above list defines a series of data augmentation effects that will be randomly sampled to apply to a given waveform.

    • The data structure of each element can be either type1=Tuple[float, str, Dict] or type2=Tuple[float, type1].

    • In type1, the three values are the weight of sampling this effect, the name (key) of the effect, and the keyword arguments for the effect.

    • In type2, the first value is the weight of sampling this effect. The second value is a list of type1 elements which are similarly defined as above.

    • Note that he effects defined in each type2 data are mutually exclusive (i.e., only one of them can be applied each time). This can be useful when you want to avoid applying some specific effects at the same time.

  • apply_n (list) – range of the number of effects to be applied to the waveform.

espnet2.layers.augmentation.bandpass_filtering(waveform, sample_rate: int, center_freq: int = 3000, Q: float = 0.707, const_skirt_gain: bool = False)[source]

Bandpass filter the input signal.

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz

  • center_freq_freq (int) – filter’s center_freq frequency

  • Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor

  • const_skirt_gain (bool) – If True, uses a constant skirt gain (peak gain = Q). If False, uses a constant 0dB peak gain.

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.bandreject_filtering(waveform, sample_rate: int, center_freq: int = 3000, Q: float = 0.707)[source]

Two-pole band-reject filter the input signal.

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz

  • center_freq_freq (int) – filter’s center_freq frequency

  • Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.bandwidth_limitation(waveform, sample_rate: int, res_type='random')[source]

Apply the bandwidth limitation distortion to the input signal.

Parameters:
  • waveform (np.ndarray) – a single speech sample (…, Time)

  • sample_rate (int) – input sampling rate in Hz

  • fs_new (int) – effective sampling rate in Hz

  • res_type (str) – resampling method

Returns:

bandwidth-limited speech sample (…, Time)

Return type:

ret (np.ndarray)

espnet2.layers.augmentation.clipping(waveform, sample_rate: int, min_quantile: float = 0.0, max_quantile: float = 0.9)[source]

Apply the clipping distortion to the input signal.

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz (not used)

  • min_quantile (float) – lower bound on the total percent of samples to be clipped

  • max_quantile (float) – upper bound on the total percent of samples to be clipped

Returns:

clipped signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.codecs(waveform, sample_rate: int, format: str, compression: Optional[float] = None, encoding: Optional[str] = None, bits_per_sample: Optional[int] = None)[source]

Apply the specified codecs to the input signal.

Warning: Wait until torchaudio 2.1 for this function to work.

Note

  1. This function only supports CPU backend.

  2. The GSM codec can be used to emulate phone line channel effects.

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz

  • format (str) – file format. Valid values are “wav”, “mp3”, “ogg”, “vorbis”, “amr-nb”, “amb”, “flac”, “sph”, “gsm”, and “htk”.

  • compression (float or None, optional) –

    used for formats other than WAV

    For more details see torchaudio.backend.sox_io_backend.save().

  • encoding (str or None, optional) – change the encoding for the supported formats Valid values are “PCM_S” (signed integer Linear PCM), “PCM_U” (unsigned integer Linear PCM), “PCM_F” (floating point PCM), “ULAW” (mu-law), and “ALAW” (a-law). For more details see torchaudio.backend.sox_io_backend.save().

  • bits_per_sample (int or None, optional) – change the bit depth for the supported formats For more details see torchaudio.backend.sox_io_backend.save().

Returns:

compressed signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.contrast(waveform, sample_rate: int = 16000, enhancement_amount: float = 75.0)[source]

Apply contrast effect to the input signal to make it sound louder.

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz (not used)

  • enhancement_amount (float) – controls the amount of the enhancement Allowed range of values for enhancement_amount : 0-100 Note that enhancement_amount = 0 still gives a significant contrast enhancement.

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.corrupt_phase(waveform, sample_rate, scale: float = 0.5, n_fft: float = 0.032, win_length: Optional[float] = None, hop_length: float = 0.008, window: Optional[str] = 'hann')[source]

Adding random noise to the phase of input waveform.

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz

  • scale (float) – scale factor for the phase noise

  • n_fft (float) – length of FFT (in second)

  • win_length (float or None) – The window length (in second) used for STFT If None, it is treated as equal to n_fft

  • hop_length (float) – The hop size (in second) used for STFT

  • window (str or None) – The windowing function applied to the signal after padding with zeros

Returns:

phase-corrupted signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.deemphasis(waveform, sample_rate: int, coeff: float = 0.97)[source]

De-emphasize a waveform along the time dimension.

y[i] = x[i] + coeff * y[i - 1]

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz (not used)

  • coeff (float) – de-emphasis coefficient. Typically between 0.0 and 1.0.

Returns:

de-emphasized signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.equalization_filtering(waveform, sample_rate: int, center_freq: int = 1000, gain: float = 0.0, Q: float = 0.707)[source]

Equalization filter the input signal.

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz

  • center_freq (int) – filter’s center frequency

  • gain (float or torch.Tensor) – desired gain at the boost (or attenuation) in dB

  • Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.highpass_filtering(waveform, sample_rate: int, cutoff_freq: int = 3000, Q: float = 0.707)[source]

Highpass filter the input signal.

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz

  • cutoff_freq (int) – filter cutoff frequency

  • Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.lowpass_filtering(waveform, sample_rate: int, cutoff_freq: int = 1000, Q: float = 0.707)[source]

Lowpass filter the input signal.

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz

  • cutoff_freq (int) – filter cutoff frequency

  • Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.pitch_shift(waveform, sample_rate: int, n_steps: int, bins_per_octave: int = 12, n_fft: float = 0.032, win_length: Optional[float] = None, hop_length: float = 0.008, window: Optional[str] = 'hann')[source]

Shift the pitch of a waveform by n_steps steps.

Note: this function is slow.

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz

  • n_steps (int) – the (fractional) steps to shift the pitch -4 for shifting pitch down by 4/bins_per_octave octaves 4 for shifting pitch up by 4/bins_per_octave octaves

  • bins_per_octave (int) – number of steps per octave

  • n_fft (float) – length of FFT (in second)

  • win_length (float or None) – The window length (in second) used for STFT If None, it is treated as equal to n_fft

  • hop_length (float) – The hop size (in second) used for STFT

  • window (str or None) – The windowing function applied to the signal after padding with zeros

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.polarity_inverse(waveform, sample_rate)[source]
espnet2.layers.augmentation.preemphasis(waveform, sample_rate: int, coeff: float = 0.97)[source]

Pre-emphasize a waveform along the time dimension.

y[i] = x[i] - coeff * x[i - 1]

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz (not used)

  • coeff (float) – pre-emphasis coefficient. Typically between 0.0 and 1.0.

Returns:

pre-emphasized signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.reverse(waveform, sample_rate)[source]
espnet2.layers.augmentation.speed_perturb(waveform, sample_rate: int, factor: float)[source]

Speed perturbation which also changes the pitch.

Note: This function should be used with caution as it changes the signal duration.

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz

  • factor (float) – speed factor (e.g., 0.9 for 90% speed)

  • lengths (torch.Tensor) – lengths of the input signals

Returns:

perturbed signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.time_stretch(waveform, sample_rate: int, factor: float, n_fft: float = 0.032, win_length: Optional[float] = None, hop_length: float = 0.008, window: Optional[str] = 'hann')[source]

Time scaling (speed up in time without modifying pitch) via phase vocoder.

Note: This function should be used with caution as it changes the signal duration.

Parameters:
  • waveform (torch.Tensor) – audio signal (…, time)

  • sample_rate (int) – sampling rate in Hz

  • factor (float) – speed-up factor (e.g., 0.9 for 90% speed and 1.3 for 130% speed)

  • n_fft (float) – length of FFT (in second)

  • win_length (float or None) – The window length (in second) used for STFT If None, it is treated as equal to n_fft

  • hop_length (float) – The hop size (in second) used for STFT

  • window (str or None) – The windowing function applied to the signal after padding with zeros

Returns:

perturbed signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.weighted_sample_without_replacement(population, weights, k, rng=<module 'random' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/random.py'>)[source]

espnet2.layers.sinc_conv

Sinc convolutions.

class espnet2.layers.sinc_conv.BarkScale[source]

Bases: object

Bark frequency scale.

Has wider bandwidths at lower frequencies, see: Critical bandwidth: BARK Zwicker and Terhardt, 1980

classmethod bank(channels: int, fs: float) → torch.Tensor[source]

Obtain initialization values for the Bark scale.

Parameters:
  • channels – Number of channels.

  • fs – Sample rate.

Returns:

Filter start frequencíes. torch.Tensor: Filter stop frequencíes.

Return type:

torch.Tensor

static convert(f)[source]

Convert Hz to Bark.

static invert(x)[source]

Convert Bark to Hz.

class espnet2.layers.sinc_conv.LogCompression[source]

Bases: torch.nn.modules.module.Module

Log Compression Activation.

Activation function log(abs(x) + 1).

Initialize.

forward(x: torch.Tensor) → torch.Tensor[source]

Forward.

Applies the Log Compression function elementwise on tensor x.

class espnet2.layers.sinc_conv.MelScale[source]

Bases: object

Mel frequency scale.

classmethod bank(channels: int, fs: float) → torch.Tensor[source]

Obtain initialization values for the mel scale.

Parameters:
  • channels – Number of channels.

  • fs – Sample rate.

Returns:

Filter start frequencíes. torch.Tensor: Filter stop frequencies.

Return type:

torch.Tensor

static convert(f)[source]

Convert Hz to mel.

static invert(x)[source]

Convert mel to Hz.

class espnet2.layers.sinc_conv.SincConv(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, window_func: str = 'hamming', scale_type: str = 'mel', fs: Union[int, float] = 16000)[source]

Bases: torch.nn.modules.module.Module

Sinc Convolution.

This module performs a convolution using Sinc filters in time domain as kernel. Sinc filters function as band passes in spectral domain. The filtering is done as a convolution in time domain, and no transformation to spectral domain is necessary.

This implementation of the Sinc convolution is heavily inspired by Ravanelli et al. https://github.com/mravanelli/SincNet, and adapted for the ESpnet toolkit. Combine Sinc convolutions with a log compression activation function, as in: https://arxiv.org/abs/2010.07597

Notes: Currently, the same filters are applied to all input channels. The windowing function is applied on the kernel to obtained a smoother filter, and not on the input values, which is different to traditional ASR.

Initialize Sinc convolutions.

Parameters:
  • in_channels – Number of input channels.

  • out_channels – Number of output channels.

  • kernel_size – Sinc filter kernel size (needs to be an odd number).

  • stride – See torch.nn.functional.conv1d.

  • padding – See torch.nn.functional.conv1d.

  • dilation – See torch.nn.functional.conv1d.

  • window_func – Window function on the filter, one of [“hamming”, “none”].

  • fs (str, int, float) – Sample rate of the input data

forward(xs: torch.Tensor) → torch.Tensor[source]

Sinc convolution forward function.

Parameters:

xs – Batch in form of torch.Tensor (B, C_in, D_in).

Returns:

Batch in form of torch.Tensor (B, C_out, D_out).

Return type:

xs

get_odim(idim: int) → int[source]

Obtain the output dimension of the filter.

static hamming_window(x: torch.Tensor) → torch.Tensor[source]

Hamming Windowing function.

init_filters()[source]

Initialize filters with filterbank values.

static none_window(x: torch.Tensor) → torch.Tensor[source]

Identity-like windowing function.

static sinc(x: torch.Tensor) → torch.Tensor[source]

Sinc function.

espnet2.layers.global_mvn

class espnet2.layers.global_mvn.GlobalMVN(stats_file: Union[pathlib.Path, str], norm_means: bool = True, norm_vars: bool = True, eps: float = 1e-20)[source]

Bases: espnet2.layers.abs_normalize.AbsNormalize, espnet2.layers.inversible_interface.InversibleInterface

Apply global mean and variance normalization

TODO(kamo): Make this class portable somehow

Parameters:
  • stats_file – npy file

  • norm_means – Apply mean normalization

  • norm_vars – Apply var normalization

  • eps

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Forward function

Parameters:
  • x – (B, L, …)

  • ilens – (B,)

inverse(x: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]