espnet2.enh.layers.uses.USES
espnet2.enh.layers.uses.USES
class espnet2.enh.layers.uses.USES(input_size, output_size, bottleneck_size=64, num_blocks=6, num_spatial_blocks=3, segment_size=64, memory_size=20, memory_types=1, rnn_type='lstm', hidden_size=128, att_heads=4, dropout=0.0, activation='relu', bidirectional=True, norm_type='cLN', ch_mode='att', ch_att_dim=256, eps=1e-05)
Bases: Module
Unconstrained Speech Enhancement and Separation (USES) Network.
Reference: : [1] W. Zhang, K. Saijo, Z.-Q., Wang, S. Watanabe, and Y. Qian, “Toward Universal Speech Enhancement for Diverse Input Conditions,” in Proc. ASRU, 2023.
- Parameters:
input_size (int) – dimension of the input feature.
output_size (int) – dimension of the output.
bottleneck_size (int) – dimension of the bottleneck feature. Must be a multiple of att_heads.
num_blocks (int) – number of processing blocks.
num_spatial_blocks (int) – number of processing blocks with channel modeling.
segment_size (int) – number of frames in each non-overlapping segment. This is used to segment long utterances into smaller segments for efficient processing.
memory_size (int) – group size of global memory tokens. The basic use of memory tokens is to store the history information from previous segments. The memory tokens are updated by the output of the last block after processing each segment.
memory_types (int) –
numbre of memory token groups. Each group corresponds to a different type of processing, i.e.,
the first group is used for denoising without dereverberation, the second group is used for denoising with dereverberation,
rnn_type (str) – type of the RNN cell in the improved Transformer layer.
hidden_size (int) – hidden dimension of the RNN cell.
att_heads (int) – number of attention heads in Transformer.
dropout (float) – dropout ratio. Default is 0.
activation (str) – non-linear activation function applied in each block.
bidirectional (bool) – whether the RNN layers are bidirectional.
norm_type (str) – normalization type in the improved Transformer layer.
ch_mode (str) – mode of channel modeling. Select from “att” and “tac”.
ch_att_dim (int) – dimension of the channel attention.
eps (float) – epsilon for layer normalization.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
forward(input, ref_channel=None, mem_idx=None)
USES forward.
- Parameters:
- input (torch.Tensor) – input feature (batch, mics, input_size, freq, time)
- ref_channel (None or int) – index of the reference channel. if None, simply average all channels. if int, take the specified channel instead of averaging.
- mem_idx (None or int) – index of the memory token group. if None, use the only group of memory tokens in the model. if int, use the specified group from multiple existing groups.
- Returns: output feature (batch, output_size, freq, time)
- Return type: output (torch.Tensor)