espnet2.enh.separator.tfgridnetv3_separator.TFGridNetV3

About 2 min

espnet2.enh.separator.tfgridnetv3_separator.TFGridNetV3

class espnet2.enh.separator.tfgridnetv3_separator.TFGridNetV3(input_dim, n_srcs=2, n_imics=1, n_layers=6, lstm_hidden_units=192, attn_n_head=4, attn_qk_output_channel=4, emb_dim=48, emb_ks=4, emb_hs=1, activation='prelu', eps=1e-05)

Bases: AbsSeparator

Offline TFGridNetV3.

On top of TFGridNetV2, TFGridNetV3 slightly modifies the internal architecture to make the model sampling-frequency-independent (SFI). This is achieved by making all network layers independent of the input time and frequency dimensions.

Reference: [1] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation”, in TASLP, 2023. [2] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation”, in ICASSP, 2023.

NOTES: As outlined in the Reference, this model works best when trained with variance normalized mixture input and target, e.g., with mixture of shape [batch, samples, microphones], you normalize it by dividing with torch.std(mixture, (1, 2)). You must do the same for the target signals. It is encouraged to do so when not using scale-invariant loss functions such as SI-SDR. Specifically, use:

std_

= std(mix) mix = mix /

std_

tgt = tgt /

std_

Parameters:
- input_dim – placeholder, not used
- n_srcs – number of output sources/speakers.
- n_fft – stft window size.
- stride – stft stride.
- window – stft window type choose between ‘hamming’, ‘hanning’ or None.
- n_imics – number of microphones channels (only fixed-array geometry supported).
- n_layers – number of TFGridNetV3 blocks.
- lstm_hidden_units – number of hidden units in LSTM.
- attn_n_head – number of heads in self-attention
- attn_attn_qk_output_channel – output channels of point-wise conv2d for getting key and query
- emb_dim – embedding dimension
- emb_ks – kernel size for unfolding and deconv1D
- emb_hs – hop size for unfolding and deconv1D
- activation – activation function to use in the whole TFGridNetV3 model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.
- eps – small epsilon for normalization layers.
- use_builtin_complex – whether to use builtin complex type or not.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor], Tensor, OrderedDict]

Forward.

Parameters:
- input (torch.Tensor) – batched multi-channel audio tensor with M audio channels and N samples [B, T, F]
- ilens (torch.Tensor) – input lengths [B]
- additional (Dict or None) – other data, currently unused in this model.
Returns: [(B, T), …] list of len n_srcs : of mono audio tensors with T samples.
ilens (torch.Tensor): (B,) additional (Dict or None): other data, currently unused in this model,
we return it also in output.
Return type: enhanced (List[Union(torch.Tensor)])

property num_spk