espnet2.enh.separator.transformer_separator.TransformerSeparator

About 1 min

espnet2.enh.separator.transformer_separator.TransformerSeparator

class espnet2.enh.separator.transformer_separator.TransformerSeparator(input_dim: int, num_spk: int = 2, predict_noise: bool = False, adim: int = 384, aheads: int = 4, layers: int = 6, linear_units: int = 1536, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, normalize_before: bool = False, concat_after: bool = False, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.1, use_scaled_pos_enc: bool = True, nonlinear: str = 'relu')

Bases: AbsSeparator

Transformer separator.

Parameters:
- input_dim – input feature dimension
- num_spk – number of speakers
- predict_noise – whether to output the estimated noise signal
- adim (int) – Dimension of attention.
- aheads (int) – The number of heads of multi head attention.
- linear_units (int) – The number of units of position-wise feed forward.
- layers (int) – The number of transformer blocks.
- dropout_rate (float) – Dropout rate.
- attention_dropout_rate (float) – Dropout rate in attention.
- positional_dropout_rate (float) – Dropout rate after adding positional encoding.
- normalize_before (bool) – Whether to use layer_norm before the first block.
- concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
- positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
- positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
- use_scaled_pos_enc (bool) – use scaled positional encoding or not
- nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]

Forward.

Parameters:
- input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
- ilens (torch.Tensor) – input lengths [Batch]
- additional (Dict or None) – other data included in model NOTE: not used in this model
Returns: [(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
Return type: masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk