espnet2.layers package¶
espnet2.layers.abs_normalize¶
-
class
espnet2.layers.abs_normalize.
AbsNormalize
(*args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input: torch.Tensor, input_lengths: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.layers.sinc_conv¶
Sinc convolutions.
-
class
espnet2.layers.sinc_conv.
BarkScale
[source]¶ Bases:
object
Bark frequency scale.
Has wider bandwidths at lower frequencies, see: Critical bandwidth: BARK Zwicker and Terhardt, 1980
-
class
espnet2.layers.sinc_conv.
LogCompression
[source]¶ Bases:
torch.nn.modules.module.Module
Log Compression Activation.
Activation function log(abs(x) + 1).
Initialize.
-
class
espnet2.layers.sinc_conv.
MelScale
[source]¶ Bases:
object
Mel frequency scale.
-
class
espnet2.layers.sinc_conv.
SincConv
(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, window_func: str = 'hamming', scale_type: str = 'mel', fs: Union[int, float] = 16000)[source]¶ Bases:
torch.nn.modules.module.Module
Sinc Convolution.
This module performs a convolution using Sinc filters in time domain as kernel. Sinc filters function as band passes in spectral domain. The filtering is done as a convolution in time domain, and no transformation to spectral domain is necessary.
This implementation of the Sinc convolution is heavily inspired by Ravanelli et al. https://github.com/mravanelli/SincNet, and adapted for the ESpnet toolkit. Combine Sinc convolutions with a log compression activation function, as in: https://arxiv.org/abs/2010.07597
Notes: Currently, the same filters are applied to all input channels. The windowing function is applied on the kernel to obtained a smoother filter, and not on the input values, which is different to traditional ASR.
Initialize Sinc convolutions.
- Parameters:
in_channels – Number of input channels.
out_channels – Number of output channels.
kernel_size – Sinc filter kernel size (needs to be an odd number).
stride – See torch.nn.functional.conv1d.
padding – See torch.nn.functional.conv1d.
dilation – See torch.nn.functional.conv1d.
window_func – Window function on the filter, one of [“hamming”, “none”].
fs (str, int, float) – Sample rate of the input data
espnet2.layers.stft¶
-
class
espnet2.layers.stft.
Stft
(n_fft: int = 512, win_length: int = None, hop_length: int = 128, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
,espnet2.layers.inversible_interface.InversibleInterface
-
extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(input: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ STFT forward function.
- Parameters:
input – (Batch, Nsamples) or (Batch, Nsample, Channels)
ilens – (Batch)
- Returns:
(Batch, Frames, Freq, 2) or (Batch, Frames, Channels, Freq, 2)
- Return type:
output
-
inverse
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Inverse STFT.
- Parameters:
input – Tensor(batch, T, F, 2) or ComplexTensor(batch, T, F)
ilens – (batch,)
- Returns:
(batch, samples) ilens: (batch,)
- Return type:
wavs
-
espnet2.layers.log_mel¶
-
class
espnet2.layers.log_mel.
LogMel
(fs: int = 16000, n_fft: int = 512, n_mels: int = 80, fmin: float = None, fmax: float = None, htk: bool = False, log_base: float = None)[source]¶ Bases:
torch.nn.modules.module.Module
Convert STFT to fbank feats
The arguments is same as librosa.filters.mel
- Parameters:
fs – number > 0 [scalar] sampling rate of the incoming signal
n_fft – int > 0 [scalar] number of FFT components
n_mels – int > 0 [scalar] number of Mel bands to generate
fmin – float >= 0 [scalar] lowest frequency (in Hz)
fmax – float >= 0 [scalar] highest frequency (in Hz). If None, use fmax = fs / 2.0
htk – use HTK formula instead of Slaney
-
extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(feat: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.espnet2.layers.create_lora_adapter¶
Definition of the low-rank adaptation (LoRA) for large models.
References
LoRA: Low-Rank Adaptation of Large Language Models (https://arxiv.org/pdf/2106.09685.pdf)
https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora.py
-
espnet2.layers.create_lora_adapter.
check_target_module_exists
(key: str, target_modules: List[str])[source]¶ Check if the target_modules matchs the given key.
-
espnet2.layers.create_lora_adapter.
create_lora_adapter
(model: torch.nn.modules.module.Module, rank: int = 8, alpha: int = 8, dropout_rate: float = 0.0, target_modules: List[str] = ['query'], bias_type: str = 'none')[source]¶ Create LoRA adapter for the base model.
See: https://arxiv.org/pdf/2106.09685.pdf
- Parameters:
model (torch.nn.Module) – Base model to be adapted.
rank (int) – Rank of LoRA matrices. Defaults to 8.
alpha (int) – Constant number for LoRA scaling. Defaults to 8.
dropout_rate (float) – Dropout probability for LoRA layers. Defaults to 0.0.
target_modules (List[str]) – List of module(s) to apply LoRA adaptation. e.g. [“query”, “key”, “value”] for all layers, while [“encoder.encoders.blocks.0.attn.key”] for a specific layer.
bias_type (str) – Bias training type for LoRA adaptaion, can be one of [“none”, “all”, “lora_only”]. “none” means not training any bias vectors; “all” means training all bias vectors, include LayerNorm biases; “lora_only” means only training bias vectors in LoRA adapted modules.
- Returns:
The LoRA adapted model.
- Return type:
torch.nn.Module
-
espnet2.layers.create_lora_adapter.
create_new_module
(target_module: torch.nn.modules.module.Module, rank: int, alpha: int, dropout_rate: float)[source]¶ Create a new module for the given target module.
espnet2.layers.augmentation¶
-
class
espnet2.layers.augmentation.
DataAugmentation
(effects: List[Union[Tuple[float, List[Tuple[float, str, Dict]]], Tuple[float, str, Dict]]], apply_n: Tuple[int, int] = [1, 1])[source]¶ Bases:
object
A series of data augmentation effects that can be applied to a given waveform.
Note: Currently we only support single-channel waveforms.
- Parameters:
effects (list) –
a list of effects to be applied to the waveform. .. rubric:: Example
- [
[0.1, “lowpass”, {“cutoff_freq”: 1000, “Q”: 0.707}], [0.1, “highpass”, {“cutoff_freq”: 3000, “Q”: 0.707}], [0.1, “equalization”, {“center_freq”: 1000, “gain”: 0, “Q”: 0.707}], [
0.1, [
[0.3, “speed_perturb”, {“factor”: 0.9}], [0.3, “speed_perturb”, {“factor”: 1.1}],
]
],
]
- Description:
The above list defines a series of data augmentation effects that will be randomly sampled to apply to a given waveform.
The data structure of each element can be either type1=Tuple[float, str, Dict] or type2=Tuple[float, type1].
In type1, the three values are the weight of sampling this effect, the name (key) of the effect, and the keyword arguments for the effect.
In type2, the first value is the weight of sampling this effect. The second value is a list of type1 elements which are similarly defined as above.
Note that he effects defined in each type2 data are mutually exclusive (i.e., only one of them can be applied each time). This can be useful when you want to avoid applying some specific effects at the same time.
apply_n (list) – range of the number of effects to be applied to the waveform.
-
espnet2.layers.augmentation.
bandpass_filtering
(waveform, sample_rate: int, center_freq: int = 3000, Q: float = 0.707, const_skirt_gain: bool = False)[source]¶ Bandpass filter the input signal.
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
center_freq_freq (int) – filter’s center_freq frequency
Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor
const_skirt_gain (bool) – If True, uses a constant skirt gain (peak gain = Q). If False, uses a constant 0dB peak gain.
- Returns:
filtered signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
bandreject_filtering
(waveform, sample_rate: int, center_freq: int = 3000, Q: float = 0.707)[source]¶ Two-pole band-reject filter the input signal.
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
center_freq_freq (int) – filter’s center_freq frequency
Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor
- Returns:
filtered signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
clipping
(waveform, sample_rate: int, min_quantile: float = 0.0, max_quantile: float = 0.9)[source]¶ Apply the clipping distortion to the input signal.
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz (not used)
min_quantile (float) – lower bound on the total percent of samples to be clipped
max_quantile (float) – upper bound on the total percent of samples to be clipped
- Returns:
clipped signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
codecs
(waveform, sample_rate: int, format: str, compression: Optional[float] = None, encoding: Optional[str] = None, bits_per_sample: Optional[int] = None)[source]¶ Apply the specified codecs to the input signal.
Warning: Wait until torchaudio 2.1 for this function to work.
Note
This function only supports CPU backend.
The GSM codec can be used to emulate phone line channel effects.
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
format (str) – file format. Valid values are “wav”, “mp3”, “ogg”, “vorbis”, “amr-nb”, “amb”, “flac”, “sph”, “gsm”, and “htk”.
compression (float or None, optional) –
used for formats other than WAV
For more details see torchaudio.backend.sox_io_backend.save().
encoding (str or None, optional) – change the encoding for the supported formats Valid values are “PCM_S” (signed integer Linear PCM), “PCM_U” (unsigned integer Linear PCM), “PCM_F” (floating point PCM), “ULAW” (mu-law), and “ALAW” (a-law). For more details see torchaudio.backend.sox_io_backend.save().
bits_per_sample (int or None, optional) – change the bit depth for the supported formats For more details see torchaudio.backend.sox_io_backend.save().
- Returns:
compressed signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
contrast
(waveform, sample_rate: int = 16000, enhancement_amount: float = 75.0)[source]¶ Apply contrast effect to the input signal to make it sound louder.
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz (not used)
enhancement_amount (float) – controls the amount of the enhancement Allowed range of values for enhancement_amount : 0-100 Note that enhancement_amount = 0 still gives a significant contrast enhancement.
- Returns:
filtered signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
corrupt_phase
(waveform, sample_rate, scale: float = 0.5, n_fft: float = 0.032, win_length: Optional[float] = None, hop_length: float = 0.008, window: Optional[str] = 'hann')[source]¶ Adding random noise to the phase of input waveform.
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
scale (float) – scale factor for the phase noise
n_fft (float) – length of FFT (in second)
win_length (float or None) – The window length (in second) used for STFT If None, it is treated as equal to n_fft
hop_length (float) – The hop size (in second) used for STFT
window (str or None) – The windowing function applied to the signal after padding with zeros
- Returns:
phase-corrupted signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
deemphasis
(waveform, sample_rate: int, coeff: float = 0.97)[source]¶ De-emphasize a waveform along the time dimension.
y[i] = x[i] + coeff * y[i - 1]
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz (not used)
coeff (float) – de-emphasis coefficient. Typically between 0.0 and 1.0.
- Returns:
de-emphasized signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
equalization_filtering
(waveform, sample_rate: int, center_freq: int = 1000, gain: float = 0.0, Q: float = 0.707)[source]¶ Equalization filter the input signal.
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
center_freq (int) – filter’s center frequency
gain (float or torch.Tensor) – desired gain at the boost (or attenuation) in dB
Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor
- Returns:
filtered signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
highpass_filtering
(waveform, sample_rate: int, cutoff_freq: int = 3000, Q: float = 0.707)[source]¶ Highpass filter the input signal.
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
cutoff_freq (int) – filter cutoff frequency
Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor
- Returns:
filtered signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
lowpass_filtering
(waveform, sample_rate: int, cutoff_freq: int = 1000, Q: float = 0.707)[source]¶ Lowpass filter the input signal.
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
cutoff_freq (int) – filter cutoff frequency
Q (float or torch.Tensor) – https://en.wikipedia.org/wiki/Q_factor
- Returns:
filtered signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
pitch_shift
(waveform, sample_rate: int, n_steps: int, bins_per_octave: int = 12, n_fft: float = 0.032, win_length: Optional[float] = None, hop_length: float = 0.008, window: Optional[str] = 'hann')[source]¶ Shift the pitch of a waveform by n_steps steps.
Note: this function is slow.
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
n_steps (int) – the (fractional) steps to shift the pitch -4 for shifting pitch down by 4/bins_per_octave octaves 4 for shifting pitch up by 4/bins_per_octave octaves
bins_per_octave (int) – number of steps per octave
n_fft (float) – length of FFT (in second)
win_length (float or None) – The window length (in second) used for STFT If None, it is treated as equal to n_fft
hop_length (float) – The hop size (in second) used for STFT
window (str or None) – The windowing function applied to the signal after padding with zeros
- Returns:
filtered signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
preemphasis
(waveform, sample_rate: int, coeff: float = 0.97)[source]¶ Pre-emphasize a waveform along the time dimension.
y[i] = x[i] - coeff * x[i - 1]
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz (not used)
coeff (float) – pre-emphasis coefficient. Typically between 0.0 and 1.0.
- Returns:
pre-emphasized signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
speed_perturb
(waveform, sample_rate: int, factor: float)[source]¶ Speed perturbation which also changes the pitch.
Note: This function should be used with caution as it changes the signal duration.
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
factor (float) – speed factor (e.g., 0.9 for 90% speed)
lengths (torch.Tensor) – lengths of the input signals
- Returns:
perturbed signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
time_stretch
(waveform, sample_rate: int, factor: float, n_fft: float = 0.032, win_length: Optional[float] = None, hop_length: float = 0.008, window: Optional[str] = 'hann')[source]¶ Time scaling (speed up in time without modifying pitch) via phase vocoder.
Note: This function should be used with caution as it changes the signal duration.
- Parameters:
waveform (torch.Tensor) – audio signal (…, time)
sample_rate (int) – sampling rate in Hz
factor (float) – speed-up factor (e.g., 0.9 for 90% speed and 1.3 for 130% speed)
n_fft (float) – length of FFT (in second)
win_length (float or None) – The window length (in second) used for STFT If None, it is treated as equal to n_fft
hop_length (float) – The hop size (in second) used for STFT
window (str or None) – The windowing function applied to the signal after padding with zeros
- Returns:
perturbed signal (…, time)
- Return type:
ret (torch.Tensor)
-
espnet2.layers.augmentation.
weighted_sample_without_replacement
(population, weights, k, rng=<module 'random' from '/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/random.py'>)[source]¶
espnet2.layers.__init__¶
espnet2.layers.global_mvn¶
-
class
espnet2.layers.global_mvn.
GlobalMVN
(stats_file: Union[pathlib.Path, str], norm_means: bool = True, norm_vars: bool = True, eps: float = 1e-20)[source]¶ Bases:
espnet2.layers.abs_normalize.AbsNormalize
,espnet2.layers.inversible_interface.InversibleInterface
Apply global mean and variance normalization
TODO(kamo): Make this class portable somehow
- Parameters:
stats_file – npy file
norm_means – Apply mean normalization
norm_vars – Apply var normalization
eps –
-
extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
espnet2.layers.utterance_mvn¶
-
class
espnet2.layers.utterance_mvn.
UtteranceMVN
(norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20)[source]¶
-
espnet2.layers.utterance_mvn.
utterance_mvn
(x: torch.Tensor, ilens: torch.Tensor = None, norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20) → Tuple[torch.Tensor, torch.Tensor][source]¶ Apply utterance mean and variance normalization
- Parameters:
x – (B, T, D), assumed zero padded
ilens – (B,)
norm_means –
norm_vars –
eps –
espnet2.layers.label_aggregation¶
-
class
espnet2.layers.label_aggregation.
LabelAggregate
(win_length: int = 512, hop_length: int = 128, center: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
espnet2.layers.time_warp¶
Time warp module.
-
class
espnet2.layers.time_warp.
TimeWarp
(window: int = 80, mode: str = 'bicubic')[source]¶ Bases:
torch.nn.modules.module.Module
Time warping using torch.interpolate.
- Parameters:
window – time warp parameter
mode – Interpolate mode
espnet2.layers.mask_along_axis¶
-
class
espnet2.layers.mask_along_axis.
MaskAlongAxis
(mask_width_range: Union[int, Sequence[int]] = (0, 30), num_mask: int = 2, dim: Union[int, str] = 'time', replace_with_zero: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
-
class
espnet2.layers.mask_along_axis.
MaskAlongAxisVariableMaxWidth
(mask_width_ratio_range: Union[float, Sequence[float]] = (0.0, 0.05), num_mask: int = 2, dim: Union[int, str] = 'time', replace_with_zero: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
Mask input spec along a specified axis with variable maximum width.
- Formula:
max_width = max_width_ratio * seq_len
-
espnet2.layers.mask_along_axis.
mask_along_axis
(spec: torch.Tensor, spec_lengths: torch.Tensor, mask_width_range: Sequence[int] = (0, 30), dim: int = 1, num_mask: int = 2, replace_with_zero: bool = True)[source]¶ Apply mask along the specified direction.
- Parameters:
spec – (Batch, Length, Freq)
spec_lengths – (Length): Not using lengths in this implementation
mask_width_range – Select the width randomly between this range
espnet2.layers.inversible_interface¶
-
class
-
abstract