espnet2.enh.separator.uses2_separator.USES2Separator
espnet2.enh.separator.uses2_separator.USES2Separator
class espnet2.enh.separator.uses2_separator.USES2Separator(input_dim: int, num_spk: int = 2, enc_channels: int = 256, bottleneck_size: int = 64, num_blocks: int = 4, num_spatial_blocks: int = 2, ref_channel: int | None = None, tf_mode: str = 'comp', swin_block_depth: int | Tuple[int] = (4, 4, 4, 4), segment_size: int = 64, memory_size: int = 20, memory_types: int = 1, input_resolution: Tuple[int, int] = (130, 64), window_size: Tuple[int, int] = (10, 8), mlp_ratio: int = 4, qkv_bias: bool = True, qk_scale: float | None = None, rnn_type: str = 'lstm', bidirectional: bool = True, hidden_size: int = 128, att_heads: int = 4, dropout: float = 0.0, att_dropout: float = 0.0, drop_path: float = 0.0, norm_type: str = 'cLN', activation: str = 'relu', use_checkpoint: bool = False, ch_mode: str | List[str] = 'att_tac', ch_att_dim: int = 256, eps: float = 1e-05, additional: dict = {})
Bases: AbsSeparator
Unconstrained Speech Enhancement and Separation v2 (USES2) Network.
Reference: : [1] W. Zhang, J.-w. Jung, and Y. Qian, “Improving Design of Input Condition Invariant Speech Enhancement,” in Proc. ICASSP, 2024. [2] W. Zhang, K. Saijo, Z.-Q., Wang, S. Watanabe, and Y. Qian, “Toward Universal Speech Enhancement for Diverse Input Conditions,” in Proc. ASRU, 2023.
- Parameters:
input_dim (int) – input feature dimension. Not used as the model is independent of the input size.
num_spk (int) – number of speakers.
enc_channels (int) – feature dimension after the Conv1D encoder.
bottleneck_size (int) – dimension of the bottleneck feature. Must be a multiple of att_heads.
num_blocks (int) – number of processing blocks.
num_spatial_blocks (int) – number of processing blocks with channel modeling.
ref_channel (int) – reference channel (used in channel modeling modules).
tf_mode (str) – mode of Time-Frequency modeling. Select from “swin” and “comp”.
swin_block_depth (Tuple *[*int ]) – depth of each Swin-Transformer block.
segment_size (int) – number of frames in each non-overlapping segment. This is only used when
tf_mode
is “comp”, and is used to segment long utterances into smaller chunks for efficient processing.memory_size (int) – group size of global memory tokens. This is only used when
tf_mode
is “comp”. The basic use of memory tokens is to store the history information from previous segments. The memory tokens are updated by the output of the last block after processing each segment.memory_types (int) –
numbre of memory token groups. This is only used when
tf_mode
is “comp”. Each group corresponds to a different type of processing, i.e.,the first group is used for denoising without dereverberation, the second group is used for denoising with dereverberation.
input_resolution (tuple) – frequency and time dimension of the input feature. Only used for efficient training. Should be close to the actual spectrum size (F, T) of training samples.
window_size (tuple) – size of the Time-Frequency window in Swin-Transformer.
mlp_ratio (int) – ratio of the MLP hidden size to embedding size in BasicLayer.
qkv_bias (bool) – If True, add a learnable bias to query, key, value in BasicLayer.
qk_scale (float) – Override default qk scale of head_dim ** -0.5 in BasicLayer if set.
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
bidirectional (bool) – whether the inter-chunk RNN layers are bidirectional.
hidden_size (int) – dimension of the hidden state.
att_heads (int) – number of attention heads.
dropout (float) – dropout ratio. Default is 0.
att_dropout (float) – attention dropout ratio in BasicLayer.
drop_path (float) – drop-path ratio in BasicLayer.
norm_type – type of normalization to use after each inter- or intra-chunk NN block.
activation – the nonlinear activation function.
use_checkpoint (bool) – whether to use checkpointing to save memory.
ch_mode (str or list) – mode of channel modeling. Select from “att”, “tac”, and “att_tac”.
ch_att_dim (int) – dimension of the channel attention.
ref_channel – Optional[int], index of the reference channel.
eps (float) – epsilon for layer normalization.
forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]
Forward.
Parameters:
input (torch.Tensor or ComplexTensor) – STFT spectrum [B, T, (C,) F (,2)] B is the batch size T is the number of time frames C is the number of microphone channels (optional) F is the number of frequency bins 2 is real and imaginary parts (optional if input is a complex tensor)
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) –
other data included in model “mode”: one of (“no_dereverb”, “dereverb”, “both”), only used when
self.tf_mode == “comp”
- ”no_dereverb”: only use the first memory group for denoising : without dereverberation
- ”dereverb”: only use the second memory group for denoising : with dereverberation
- ”both”: use both memory groups for denoising with and without : dereverberation
Returns: [(B, T, F), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
Return type: masked (List[Union(torch.Tensor, ComplexTensor)])
property num_spk