espnet2.enh.separator.conformer_separator.ConformerSeparator
espnet2.enh.separator.conformer_separator.ConformerSeparator
class espnet2.enh.separator.conformer_separator.ConformerSeparator(input_dim: int, num_spk: int = 2, predict_noise: bool = False, adim: int = 384, aheads: int = 4, layers: int = 6, linear_units: int = 1536, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, normalize_before: bool = False, concat_after: bool = False, dropout_rate: float = 0.1, input_layer: str = 'linear', positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.1, nonlinear: str = 'relu', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, conformer_enc_kernel_size: int = 7, padding_idx: int = -1)
Bases: AbsSeparator
Conformer separator.
- Parameters:
- input_dim – input feature dimension
- num_spk – number of speakers
- predict_noise – whether to output the estimated noise signal
- adim (int) – Dimension of attention.
- aheads (int) – The number of heads of multi head attention.
- linear_units (int) – The number of units of position-wise feed forward.
- layers (int) – The number of transformer blocks.
- dropout_rate (float) – Dropout rate.
- input_layer (Union *[*str , torch.nn.Module ]) – Input layer type.
- attention_dropout_rate (float) – Dropout rate in attention.
- positional_dropout_rate (float) – Dropout rate after adding positional encoding.
- normalize_before (bool) – Whether to use layer_norm before the first block.
- concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
- conformer_pos_enc_layer_type (str) – Encoder positional encoding layer type.
- conformer_self_attn_layer_type (str) – Encoder attention layer type.
- conformer_activation_type (str) – Encoder activation function type.
- positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
- positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
- use_macaron_style_in_conformer (bool) – Whether to use macaron style for positionwise layer.
- use_cnn_in_conformer (bool) – Whether to use convolution module.
- conformer_enc_kernel_size (int) – Kernerl size of convolution module.
- padding_idx (int) – Padding idx for input_layer=embed.
- nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]
Forward.
Parameters:
- input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
- ilens (torch.Tensor) – input lengths [Batch]
- additional (Dict or None) – other data included in model NOTE: not used in this model
Returns: [(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
Return type: masked (List[Union(torch.Tensor, ComplexTensor)])
property num_spk