espnet2.layers package¶

espnet2.layers.time_warp¶

Time warp module.

class espnet2.layers.time_warp.TimeWarp(window: int = 80, mode: str = 'bicubic')[source]¶

Bases: torch.nn.modules.module.Module

Time warping using torch.interpolate.

Parameters:

window – time warp parameter
mode – Interpolate mode

extra_repr()[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x: torch.Tensor, x_lengths: torch.Tensor = None)[source]¶

Forward function.

Parameters:

x – (Batch, Time, Freq)
x_lengths – (Batch,)

espnet2.layers.time_warp.time_warp(x: torch.Tensor, window: int = 80, mode: str = 'bicubic')[source]¶

Time warping using torch.interpolate.

Parameters:

x – (Batch, Time, Freq)
window – time warp parameter
mode – Interpolate mode

espnet2.layers.label_aggregation¶

class espnet2.layers.label_aggregation.LabelAggregate(win_length: int = 512, hop_length: int = 128, center: bool = True)[source]¶

Bases: torch.nn.modules.module.Module

extra_repr()[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(input: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶

LabelAggregate forward function.

Parameters:

input – (Batch, Nsamples, Label_dim)
ilens – (Batch)

Returns:

(Batch, Frames, Label_dim)

Return type:

output

espnet2.layers.init¶

espnet2.layers.create_adapter_fn¶

espnet2.layers.create_adapter_fn.create_houlsby_adapter(model: torch.nn.modules.module.Module, bottleneck: int = 32, target_layers: List[int] = [])[source]¶

espnet2.layers.create_adapter_fn.create_lora_adapter(model: torch.nn.modules.module.Module, rank: int = 8, alpha: int = 8, dropout_rate: float = 0.0, target_modules: List[str] = ['query'], bias_type: Optional[str] = 'none')[source]¶

Create LoRA adapter for the base model.

See: https://arxiv.org/pdf/2106.09685.pdf

Parameters:

model (torch.nn.Module) – Base model to be adapted.
rank (int) – Rank of LoRA matrices. Defaults to 8.
alpha (int) – Constant number for LoRA scaling. Defaults to 8.
dropout_rate (float) – Dropout probability for LoRA layers. Defaults to 0.0.
target_modules (List[str]) – List of module(s) to apply LoRA adaptation. e.g. [“query”, “key”, “value”] for all layers, while [“encoder.encoders.blocks.0.attn.key”] for a specific layer.
bias_type (str) – Bias training type for LoRA adaptaion, can be one of [“none”, “all”, “lora_only”]. “none” means not training any bias vectors; “all” means training all bias vectors, include LayerNorm biases; “lora_only” means only training bias vectors in LoRA adapted modules.

espnet2.layers.create_adapter_fn.create_new_houlsby_module(target_module: torch.nn.modules.module.Module, bottleneck: int)[source]¶

Create a new houlsby adapter module for the given target module.

Currently, only support: Wav2Vec2EncoderLayerStableLayerNorm & TransformerSentenceEncoderLayer

espnet2.layers.create_adapter_fn.create_new_lora_module(target_module: torch.nn.modules.module.Module, rank: int, alpha: int, dropout_rate: float)[source]¶: Create a new lora module for the given target module.

espnet2.layers.create_adapter_utils¶

espnet2.layers.create_adapter_utils.check_target_module_exists(key: str, target_modules: List[str])[source]¶: Check if the target_modules matchs the given key.

espnet2.layers.create_adapter_utils.get_submodules(model: torch.nn.modules.module.Module, key: str)[source]¶: Return the submodules of the given key.

espnet2.layers.create_adapter_utils.replace_module(parent_module: torch.nn.modules.module.Module, child_name: str, old_module: torch.nn.modules.module.Module, new_module: torch.nn.modules.module.Module)[source]¶: Replace the target module with the new module.

espnet2.layers.log_mel¶

class espnet2.layers.log_mel.LogMel(fs: int = 16000, n_fft: int = 512, n_mels: int = 80, fmin: float = None, fmax: float = None, htk: bool = False, log_base: float = None)[source]¶

Bases: torch.nn.modules.module.Module

Convert STFT to fbank feats

The arguments is same as librosa.filters.mel

Parameters:

fs – number > 0 [scalar] sampling rate of the incoming signal
n_fft – int > 0 [scalar] number of FFT components
n_mels – int > 0 [scalar] number of Mel bands to generate
fmin – float >= 0 [scalar] lowest frequency (in Hz)
fmax – float >= 0 [scalar] highest frequency (in Hz). If None, use fmax = fs / 2.0
htk – use HTK formula instead of Slaney

extra_repr()[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(feat: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.layers.mask_along_axis¶

class espnet2.layers.mask_along_axis.MaskAlongAxis(mask_width_range: Union[int, Sequence[int]] = (0, 30), num_mask: int = 2, dim: Union[int, str] = 'time', replace_with_zero: bool = True)[source]¶

Bases: torch.nn.modules.module.Module

extra_repr()[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(spec: torch.Tensor, spec_lengths: torch.Tensor = None)[source]¶

Forward function.

Parameters:: spec – (Batch, Length, Freq)

class espnet2.layers.mask_along_axis.MaskAlongAxisVariableMaxWidth(mask_width_ratio_range: Union[float, Sequence[float]] = (0.0, 0.05), num_mask: int = 2, dim: Union[int, str] = 'time', replace_with_zero: bool = True)[source]¶

Bases: torch.nn.modules.module.Module

Mask input spec along a specified axis with variable maximum width.

Formula:: max_width = max_width_ratio * seq_len

extra_repr()[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(spec: torch.Tensor, spec_lengths: torch.Tensor = None)[source]¶

Forward function.

Parameters:: spec – (Batch, Length, Freq)

espnet2.layers.mask_along_axis.mask_along_axis(spec: torch.Tensor, spec_lengths: torch.Tensor, mask_width_range: Sequence[int] = (0, 30), dim: int = 1, num_mask: int = 2, replace_with_zero: bool = True)[source]¶

Apply mask along the specified direction.

Parameters:

spec – (Batch, Length, Freq)
spec_lengths – (Length): Not using lengths in this implementation
mask_width_range – Select the width randomly between this range

espnet2.layers.inversible_interface¶

class espnet2.layers.inversible_interface.InversibleInterface[source]¶

Bases: abc.ABC

abstract inverse(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶

espnet2.layers.stft¶

class espnet2.layers.stft.Stft(n_fft: int = 512, win_length: Optional[int] = None, hop_length: int = 128, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True)[source]¶

Bases: torch.nn.modules.module.Module, espnet2.layers.inversible_interface.InversibleInterface

extra_repr()[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(input: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶

STFT forward function.

Parameters:

input – (Batch, Nsamples) or (Batch, Nsample, Channels)
ilens – (Batch)

Returns:

(Batch, Frames, Freq, 2) or (Batch, Frames, Channels, Freq, 2)

Return type:

output

inverse(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶

Inverse STFT.

Parameters:

input – Tensor(batch, T, F, 2) or ComplexTensor(batch, T, F)
ilens – (batch,)

Returns:

(batch, samples) ilens: (batch,)

Return type:

wavs

espnet2.layers.utterance_mvn¶

class espnet2.layers.utterance_mvn.UtteranceMVN(norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20)[source]¶

Bases: espnet2.layers.abs_normalize.AbsNormalize

extra_repr()[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward function

Parameters:

x – (B, L, …)
ilens – (B,)

espnet2.layers.utterance_mvn.utterance_mvn(x: torch.Tensor, ilens: torch.Tensor = None, norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20) → Tuple[torch.Tensor, torch.Tensor][source]¶

Apply utterance mean and variance normalization

Parameters:

x – (B, T, D), assumed zero padded
ilens – (B,)
norm_means –
norm_vars –
eps –

espnet2.layers.abs_normalize¶

class espnet2.layers.abs_normalize.AbsNormalize(*args, **kwargs)[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.layers.houlsby_adapter_layer¶

class espnet2.layers.houlsby_adapter_layer.HoulsbyTransformerSentenceEncoderLayer(bottleneck: int = 32, **kwargs)[source]¶

Bases: s3prl.upstream.wav2vec2.wav2vec2_model.TransformerSentenceEncoderLayer

Implements a Transformer Encoder Layer used in BERT/XLM style pre-trained

models.

forward(x: torch.Tensor, self_attn_mask: torch.Tensor = None, self_attn_padding_mask: torch.Tensor = None, need_weights: bool = False, att_args=None)[source]¶

LayerNorm is applied either before or after the self-attention/ffn

modules similar to the original Transformer imlementation.

class espnet2.layers.houlsby_adapter_layer.Houlsby_Adapter(input_size: int, bottleneck: int)[source]¶

Bases: torch.nn.modules.module.Module

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.layers.create_adapter¶

Definition of the low-rank adaptation (LoRA) for large models.

References

LoRA: Low-Rank Adaptation of Large Language Models (https://arxiv.org/pdf/2106.09685.pdf)
https://github.com/microsoft/LoRA.git
https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora.py

espnet2.layers.create_adapter.create_adapter(model: torch.nn.modules.module.Module, adapter: str, adapter_conf: dict)[source]¶

Create adapter for the base model.

Parameters:

model (torch.nn.Module) – Base model to be adapted.
adapter_type (str) – Name of adapter
adapter_conf (dict) – Configuration for the adapter e.g. {“rank”: 8, “alpha”: 8, …} for lora

espnet2.layers.augmentation¶

class espnet2.layers.augmentation.DataAugmentation(effects: List[Union[Tuple[float, List[Tuple[float, str, Dict]]], Tuple[float, str, Dict]]], apply_n: Tuple[int, int] = [1, 1])[source]¶

Bases: object

A series of data augmentation effects that can be applied to a given waveform.

Note: Currently we only support single-channel waveforms.

Parameters:

effects (list) –
a list of effects to be applied to the waveform. .. rubric:: Example

[
[0.1, “lowpass”, {“cutoff_freq”: 1000, “Q”: 0.707}], [0.1, “highpass”, {“cutoff_freq”: 3000, “Q”: 0.707}], [0.1, “equalization”, {“center_freq”: 1000, “gain”: 0, “Q”: 0.707}], [

0.1, [

[0.3, “speed_perturb”, {“factor”: 0.9}], [0.3, “speed_perturb”, {“factor”: 1.1}],

]

],

]
Description:
- The above list defines a series of data augmentation effects that will be randomly sampled to apply to a given waveform.
- The data structure of each element can be either type1=Tuple[float, str, Dict] or type2=Tuple[float, type1].
- In type1, the three values are the weight of sampling this effect, the name (key) of the effect, and the keyword arguments for the effect.
- In type2, the first value is the weight of sampling this effect. The second value is a list of type1 elements which are similarly defined as above.
- Note that he effects defined in each type2 data are mutually exclusive (i.e., only one of them can be applied each time). This can be useful when you want to avoid applying some specific effects at the same time.
apply_n (list) – range of the number of effects to be applied to the waveform.

espnet2.layers.augmentation.bandpass_filtering(waveform, sample_rate: int, center_freq: int = 3000, Q: float = 0.707, const_skirt_gain: bool = False)[source]¶

Bandpass filter the input signal.

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
center_freq_freq (int) – filter’s center_freq frequency
Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor
const_skirt_gain (bool) – If True, uses a constant skirt gain (peak gain = Q). If False, uses a constant 0dB peak gain.

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.bandreject_filtering(waveform, sample_rate: int, center_freq: int = 3000, Q: float = 0.707)[source]¶

Two-pole band-reject filter the input signal.

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
center_freq_freq (int) – filter’s center_freq frequency
Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.bandwidth_limitation(waveform, sample_rate: int, res_type='random')[source]¶

Apply the bandwidth limitation distortion to the input signal.

Parameters:

waveform (np.ndarray) – a single speech sample (…, Time)
sample_rate (int) – input sampling rate in Hz
fs_new (int) – effective sampling rate in Hz
res_type (str) – resampling method

Returns:

bandwidth-limited speech sample (…, Time)

Return type:

ret (np.ndarray)

espnet2.layers.augmentation.clipping(waveform, sample_rate: int, min_quantile: float = 0.0, max_quantile: float = 0.9)[source]¶

Apply the clipping distortion to the input signal.

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz (not used)
min_quantile (float) – lower bound on the total percent of samples to be clipped
max_quantile (float) – upper bound on the total percent of samples to be clipped

Returns:

clipped signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.codecs(waveform, sample_rate: int, format: str, compression: Optional[float] = None, encoding: Optional[str] = None, bits_per_sample: Optional[int] = None)[source]¶

Apply the specified codecs to the input signal.

Warning: Wait until torchaudio 2.1 for this function to work.

Note

This function only supports CPU backend.
The GSM codec can be used to emulate phone line channel effects.

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
format (str) – file format. Valid values are “wav”, “mp3”, “ogg”, “vorbis”, “amr-nb”, “amb”, “flac”, “sph”, “gsm”, and “htk”.
compression (float or None, optional) –
used for formats other than WAV

For more details see torchaudio.backend.sox_io_backend.save().
encoding (str or None, optional) – change the encoding for the supported formats Valid values are “PCM_S” (signed integer Linear PCM), “PCM_U” (unsigned integer Linear PCM), “PCM_F” (floating point PCM), “ULAW” (mu-law), and “ALAW” (a-law). For more details see torchaudio.backend.sox_io_backend.save().
bits_per_sample (int or None, optional) – change the bit depth for the supported formats For more details see torchaudio.backend.sox_io_backend.save().

Returns:

compressed signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.contrast(waveform, sample_rate: int = 16000, enhancement_amount: float = 75.0)[source]¶

Apply contrast effect to the input signal to make it sound louder.

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz (not used)
enhancement_amount (float) – controls the amount of the enhancement Allowed range of values for enhancement_amount : 0-100 Note that enhancement_amount = 0 still gives a significant contrast enhancement.

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.corrupt_phase(waveform, sample_rate, scale: float = 0.5, n_fft: float = 0.032, win_length: Optional[float] = None, hop_length: float = 0.008, window: Optional[str] = 'hann')[source]¶

Adding random noise to the phase of input waveform.

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
scale (float) – scale factor for the phase noise
n_fft (float) – length of FFT (in second)
win_length (float or None) – The window length (in second) used for STFT If None, it is treated as equal to n_fft
hop_length (float) – The hop size (in second) used for STFT
window (str or None) – The windowing function applied to the signal after padding with zeros

Returns:

phase-corrupted signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.deemphasis(waveform, sample_rate: int, coeff: float = 0.97)[source]¶

De-emphasize a waveform along the time dimension.

y[i] = x[i] + coeff * y[i - 1]

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz (not used)
coeff (float) – de-emphasis coefficient. Typically between 0.0 and 1.0.

Returns:

de-emphasized signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.equalization_filtering(waveform, sample_rate: int, center_freq: int = 1000, gain: float = 0.0, Q: float = 0.707)[source]¶

Equalization filter the input signal.

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
center_freq (int) – filter’s center frequency
gain (float or torch.Tensor) – desired gain at the boost (or attenuation) in dB
Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.highpass_filtering(waveform, sample_rate: int, cutoff_freq: int = 3000, Q: float = 0.707)[source]¶

Highpass filter the input signal.

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
cutoff_freq (int) – filter cutoff frequency
Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.lowpass_filtering(waveform, sample_rate: int, cutoff_freq: int = 1000, Q: float = 0.707)[source]¶

Lowpass filter the input signal.

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
cutoff_freq (int) – filter cutoff frequency
Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.pitch_shift(waveform, sample_rate: int, n_steps: int, bins_per_octave: int = 12, n_fft: float = 0.032, win_length: Optional[float] = None, hop_length: float = 0.008, window: Optional[str] = 'hann')[source]¶

Shift the pitch of a waveform by n_steps steps.

Note: this function is slow.

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
n_steps (int) – the (fractional) steps to shift the pitch -4 for shifting pitch down by 4/bins_per_octave octaves 4 for shifting pitch up by 4/bins_per_octave octaves
bins_per_octave (int) – number of steps per octave
n_fft (float) – length of FFT (in second)
win_length (float or None) – The window length (in second) used for STFT If None, it is treated as equal to n_fft
hop_length (float) – The hop size (in second) used for STFT
window (str or None) – The windowing function applied to the signal after padding with zeros

Returns:

filtered signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.polarity_inverse(waveform, sample_rate)[source]¶

espnet2.layers.augmentation.preemphasis(waveform, sample_rate: int, coeff: float = 0.97)[source]¶

Pre-emphasize a waveform along the time dimension.

y[i] = x[i] - coeff * x[i - 1]

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz (not used)
coeff (float) – pre-emphasis coefficient. Typically between 0.0 and 1.0.

Returns:

pre-emphasized signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.reverse(waveform, sample_rate)[source]¶

espnet2.layers.augmentation.speed_perturb(waveform, sample_rate: int, factor: float)[source]¶

Speed perturbation which also changes the pitch.

Note: This function should be used with caution as it changes the signal duration.

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
factor (float) – speed factor (e.g., 0.9 for 90% speed)
lengths (torch.Tensor) – lengths of the input signals

Returns:

perturbed signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.time_stretch(waveform, sample_rate: int, factor: float, n_fft: float = 0.032, win_length: Optional[float] = None, hop_length: float = 0.008, window: Optional[str] = 'hann')[source]¶

Time scaling (speed up in time without modifying pitch) via phase vocoder.

Note: This function should be used with caution as it changes the signal duration.

Parameters:

waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
factor (float) – speed-up factor (e.g., 0.9 for 90% speed and 1.3 for 130% speed)
n_fft (float) – length of FFT (in second)
win_length (float or None) – The window length (in second) used for STFT If None, it is treated as equal to n_fft
hop_length (float) – The hop size (in second) used for STFT
window (str or None) – The windowing function applied to the signal after padding with zeros

Returns:

perturbed signal (…, time)

Return type:

ret (torch.Tensor)

espnet2.layers.augmentation.weighted_sample_without_replacement(population, weights, k, rng=<module 'random' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/random.py'>)[source]¶

espnet2.layers.sinc_conv¶

Sinc convolutions.

class espnet2.layers.sinc_conv.BarkScale[source]¶

Bases: object

Bark frequency scale.

Has wider bandwidths at lower frequencies, see: Critical bandwidth: BARK Zwicker and Terhardt, 1980

classmethod bank(channels: int, fs: float) → torch.Tensor[source]¶

Obtain initialization values for the Bark scale.

Parameters:

channels – Number of channels.
fs – Sample rate.

Returns:

Filter start frequencíes. torch.Tensor: Filter stop frequencíes.

Return type:

torch.Tensor

static convert(f)[source]¶: Convert Hz to Bark.

static invert(x)[source]¶: Convert Bark to Hz.

class espnet2.layers.sinc_conv.LogCompression[source]¶

Bases: torch.nn.modules.module.Module

Log Compression Activation.

Activation function log(abs(x) + 1).

Initialize.

forward(x: torch.Tensor) → torch.Tensor[source]¶

Forward.

Applies the Log Compression function elementwise on tensor x.

class espnet2.layers.sinc_conv.MelScale[source]¶

Bases: object

Mel frequency scale.

classmethod bank(channels: int, fs: float) → torch.Tensor[source]¶

Obtain initialization values for the mel scale.

Parameters:

channels – Number of channels.
fs – Sample rate.

Returns:

Filter start frequencíes. torch.Tensor: Filter stop frequencies.

Return type:

torch.Tensor

static convert(f)[source]¶: Convert Hz to mel.

static invert(x)[source]¶: Convert mel to Hz.

class espnet2.layers.sinc_conv.SincConv(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, window_func: str = 'hamming', scale_type: str = 'mel', fs: Union[int, float] = 16000)[source]¶

Bases: torch.nn.modules.module.Module

Sinc Convolution.

This module performs a convolution using Sinc filters in time domain as kernel. Sinc filters function as band passes in spectral domain. The filtering is done as a convolution in time domain, and no transformation to spectral domain is necessary.

This implementation of the Sinc convolution is heavily inspired by Ravanelli et al. https://github.com/mravanelli/SincNet, and adapted for the ESpnet toolkit. Combine Sinc convolutions with a log compression activation function, as in: https://arxiv.org/abs/2010.07597

Notes: Currently, the same filters are applied to all input channels. The windowing function is applied on the kernel to obtained a smoother filter, and not on the input values, which is different to traditional ASR.

Initialize Sinc convolutions.

Parameters:

in_channels – Number of input channels.
out_channels – Number of output channels.
kernel_size – Sinc filter kernel size (needs to be an odd number).
stride – See torch.nn.functional.conv1d.
padding – See torch.nn.functional.conv1d.
dilation – See torch.nn.functional.conv1d.
window_func – Window function on the filter, one of [“hamming”, “none”].
fs (str, int, float) – Sample rate of the input data

forward(xs: torch.Tensor) → torch.Tensor[source]¶

Sinc convolution forward function.

Parameters:: xs – Batch in form of torch.Tensor (B, C_in, D_in).
Returns:: Batch in form of torch.Tensor (B, C_out, D_out).
Return type:: xs

get_odim(idim: int) → int[source]¶: Obtain the output dimension of the filter.

static hamming_window(x: torch.Tensor) → torch.Tensor[source]¶: Hamming Windowing function.

init_filters()[source]¶: Initialize filters with filterbank values.

static none_window(x: torch.Tensor) → torch.Tensor[source]¶: Identity-like windowing function.

static sinc(x: torch.Tensor) → torch.Tensor[source]¶: Sinc function.

espnet2.layers.global_mvn¶

class espnet2.layers.global_mvn.GlobalMVN(stats_file: Union[pathlib.Path, str], norm_means: bool = True, norm_vars: bool = True, eps: float = 1e-20)[source]¶

Bases: espnet2.layers.abs_normalize.AbsNormalize, espnet2.layers.inversible_interface.InversibleInterface

Apply global mean and variance normalization

TODO(kamo): Make this class portable somehow

Parameters:

stats_file – npy file
norm_means – Apply mean normalization
norm_vars – Apply var normalization
eps –

extra_repr()[source]¶

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward function

Parameters:

x – (B, L, …)
ilens – (B,)

inverse(x: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶

espnet2.layers package¶

espnet2.layers.time_warp¶

espnet2.layers.label_aggregation¶

espnet2.layers.__init__¶

espnet2.layers.create_adapter_fn¶

espnet2.layers.create_adapter_utils¶

espnet2.layers.log_mel¶

espnet2.layers.mask_along_axis¶

espnet2.layers.inversible_interface¶

espnet2.layers.stft¶

espnet2.layers.utterance_mvn¶

espnet2.layers.abs_normalize¶

espnet2.layers.houlsby_adapter_layer¶

espnet2.layers.create_adapter¶

espnet2.layers.augmentation¶

espnet2.layers.sinc_conv¶

espnet2.layers.global_mvn¶

espnet2.layers.init¶