espnet2.enh.layers.uses2_comp.USES2_Comp
espnet2.enh.layers.uses2_comp.USES2_Comp
class espnet2.enh.layers.uses2_comp.USES2_Comp(input_size, output_size, bottleneck_size=64, num_blocks=4, num_spatial_blocks=2, segment_size=64, memory_size=20, memory_types=1, input_resolution=(130, 64), window_size=(10, 8), mlp_ratio=4, qkv_bias=True, qk_scale=None, att_heads=4, dropout=0.0, att_dropout=0.0, drop_path=0.0, use_checkpoint=False, rnn_type='lstm', hidden_size=128, activation='relu', bidirectional=True, norm_type='cLN', ch_mode='att_tac', ch_att_dim=256, eps=1e-05)
Bases: Module
Unconstrained Speech Enhancement and Separation v2 (USES2-Comp) Network.
Reference: : [1] W. Zhang, J.-w. Jung, and Y. Qian, “Improving Design of Input : Condition Invariant Speech Enhancement,” in Proc. ICASSP, 2024. <br/> [2] W. Zhang, K. Saijo, Z.-Q., Wang, S. Watanabe, and Y. Qian, : “Toward Universal Speech Enhancement for Diverse Input Conditions,” in Proc. ASRU, 2023.
- Parameters:
input_size (int) – dimension of the input feature.
output_size (int) – dimension of the output.
bottleneck_size (int) – dimension of the bottleneck feature. Must be a multiple of att_heads.
num_blocks (int) – number of processing blocks.
num_spatial_blocks (int) – number of processing blocks with channel modeling.
segment_size (int) – number of frames in each non-overlapping segment. This is used to segment long utterances into smaller segments for efficient processing.
memory_size (int) – group size of global memory tokens. The basic use of memory tokens is to store the history information from previous segments. The memory tokens are updated by the output of the last block after processing each segment.
memory_types (int) –
numbre of memory token groups. Each group corresponds to a different type of processing, i.e.,
the first group is used for denoising without dereverberation, the second group is used for denoising with dereverberation.
input_resolution (tuple) – frequency and time dimension of the input feature. Only used for efficient training. Should be close to the actual spectrum size (F, T) of training samples.
window_size (tuple) – size of the Time-Frequency window in Swin-Transformer.
mlp_ratio (int) – ratio of the MLP hidden size to embedding size in BasicLayer.
qkv_bias (bool) – If True, add a learnable bias to query, key, value in BasicLayer.
qk_scale (float) – Override default qk scale of head_dim ** -0.5 in BasicLayer if set.
att_heads (int) – number of attention heads in Transformer.
dropout (float) – dropout ratio. Default is 0.
att_dropout (float) – dropout ratio in attention in BasicLayer.
drop_path (float) – drop-path ratio in BasicLayer.
use_checkpoint (bool) – whether to use checkpointing to save memory.
rnn_type (str) – type of the RNN cell in the improved Transformer layer.
hidden_size (int) – hidden dimension of the RNN cell.
activation (str) – non-linear activation function applied in each block.
bidirectional (bool) – whether the RNN layers are bidirectional.
norm_type (str) – normalization type in the improved Transformer layer.
ch_mode (str) – mode of channel modeling. Select from “att”, “tac”, and “att_tac”.
ch_att_dim (int) – dimension of the channel attention.
eps (float) – epsilon for layer normalization.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
forward(input, ref_channel=None, mem_idx=None)
USES2-Comp forward.
- Parameters:
- input (torch.Tensor) – input feature (batch, mics, input_size, freq, time)
- ref_channel (None or int) – index of the reference channel.
- Returns: output feature (batch, output_size, freq, time)
- Return type: output (torch.Tensor)