espnet2.enh.separator.svoice_separator.SVoiceSeparator
espnet2.enh.separator.svoice_separator.SVoiceSeparator
class espnet2.enh.separator.svoice_separator.SVoiceSeparator(input_dim: int, enc_dim: int, kernel_size: int, hidden_size: int, num_spk: int = 2, num_layers: int = 4, segment_size: int = 20, bidirectional: bool = True, input_normalize: bool = False)
Bases: AbsSeparator
SVoice model for speech separation.
Reference: : Voice Separation with an Unknown Number of Multiple Speakers; E. Nachmani et al., 2020; https://arxiv.org/abs/2003.01531
- Parameters:
- enc_dim – int, dimension of the encoder module’s output. (Default: 128)
- kernel_size – int, the kernel size of Conv1D layer in both encoder and decoder modules. (Default: 8)
- hidden_size – int, dimension of the hidden state in RNN layers. (Default: 128)
- num_spk – int, the number of speakers in the output. (Default: 2)
- num_layers – int, number of stacked MulCat blocks. (Default: 4)
- segment_size – dual-path segment size. (Default: 20)
- bidirectional – bool, whether the RNN layers are bidirectional. (Default: True)
- input_normalize – bool, whether to apply GroupNorm on the input Tensor. (Default: False)
Initializes internal Module state, shared by both nn.Module and ScriptModule.
forward(input: Tensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor], Tensor, OrderedDict]
Forward.
Parameters:
- input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
- ilens (torch.Tensor) – input lengths [Batch]
- additional (Dict or None) – other data included in model NOTE: not used in this model
Returns: [(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
Return type: masked (List[Union(torch.Tensor, ComplexTensor)])
property num_spk