espnet2.enh.separator.uses_separator.USESSeparator
espnet2.enh.separator.uses_separator.USESSeparator
class espnet2.enh.separator.uses_separator.USESSeparator(input_dim: int, num_spk: int = 2, enc_channels: int = 256, bottleneck_size: int = 64, num_blocks: int = 6, num_spatial_blocks: int = 3, ref_channel: int | None = None, segment_size: int = 64, memory_size: int = 20, memory_types: int = 1, rnn_type: str = 'lstm', bidirectional: bool = True, hidden_size: int = 128, att_heads: int = 4, dropout: float = 0.0, norm_type: str = 'cLN', activation: str = 'relu', ch_mode: str | List[str] = 'att', ch_att_dim: int = 256, eps: float = 1e-05, additional: dict = {})
Bases: AbsSeparator
Unconstrained Speech Enhancement and Separation (USES) Network.
Reference: : [1] W. Zhang, K. Saijo, Z.-Q., Wang, S. Watanabe, and Y. Qian, “Toward Universal Speech Enhancement for Diverse Input Conditions,” in Proc. ASRU, 2023.
- Parameters:
input_dim (int) – input feature dimension. Not used as the model is independent of the input size.
num_spk (int) – number of speakers.
enc_channels (int) – feature dimension after the Conv1D encoder.
bottleneck_size (int) – dimension of the bottleneck feature. Must be a multiple of att_heads.
num_blocks (int) – number of processing blocks.
num_spatial_blocks (int) – number of processing blocks with channel modeling.
ref_channel (int) – reference channel (used in channel modeling modules).
segment_size (int) – number of frames in each non-overlapping segment. This is used to segment long utterances into smaller chunks for efficient processing.
memory_size (int) – group size of global memory tokens. The basic use of memory tokens is to store the history information from previous segments. The memory tokens are updated by the output of the last block after processing each segment.
memory_types (int) –
numbre of memory token groups. Each group corresponds to a different type of processing, i.e.,
the first group is used for denoising without dereverberation, the second group is used for denoising with dereverberation,
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
bidirectional (bool) – whether the inter-chunk RNN layers are bidirectional.
hidden_size (int) – dimension of the hidden state.
att_heads (int) – number of attention heads.
dropout (float) – dropout ratio. Default is 0.
norm_type – type of normalization to use after each inter- or intra-chunk NN block.
activation – the nonlinear activation function.
ch_mode – str or list, mode of channel modeling. Select from “att” and “tac”.
ch_att_dim (int) – dimension of the channel attention.
ref_channel – Optional[int], index of the reference channel.
eps (float) – epsilon for layer normalization.
forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]
Forward.
Parameters:
input (torch.Tensor or ComplexTensor) – STFT spectrum [B, T, (C,) F (,2)] B is the batch size T is the number of time frames C is the number of microphone channels (optional) F is the number of frequency bins 2 is real and imaginary parts (optional if input is a complex tensor)
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) –
other data included in model “mode”: one of (“no_dereverb”, “dereverb”, “both”)
- “no_dereverb”: only use the first memory group for denoising
without dereverberation
- ”dereverb”: only use the second memory group for denoising : with dereverberation
- ”both”: use both memory groups for denoising with and without : dereverberation
Returns: [(B, T, F), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
Return type: masked (List[Union(torch.Tensor, ComplexTensor)])
property num_spk