espnet2.enh.separator.tfgridnetv2_separator.TFGridNetV2
espnet2.enh.separator.tfgridnetv2_separator.TFGridNetV2
class espnet2.enh.separator.tfgridnetv2_separator.TFGridNetV2(input_dim, n_srcs=2, n_fft=128, stride=64, window='hann', n_imics=1, n_layers=6, lstm_hidden_units=192, attn_n_head=4, attn_approx_qk_dim=512, emb_dim=48, emb_ks=4, emb_hs=1, activation='prelu', eps=1e-05, use_builtin_complex=False)
Bases: AbsSeparator
Offline TFGridNetV2. Compared with TFGridNet, TFGridNetV2 speeds up the code
by vectorizing multiple heads in self-attention, and better dealing with Deconv1D in each intra- and inter-block when emb_ks == emb_hs.
Reference: [1] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation”, in TASLP, 2023. [2] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation”, in ICASSP, 2023.
NOTES: As outlined in the Reference, this model works best when trained with variance normalized mixture input and target, e.g., with mixture of shape [batch, samples, microphones], you normalize it by dividing with torch.std(mixture, (1, 2)). You must do the same for the target signals. It is encouraged to do so when not using scale-invariant loss functions such as SI-SDR. Specifically, use:
std_
= std(mix) mix = mix /
std_
tgt = tgt /
std_
- Parameters:
- input_dim – placeholder, not used
- n_srcs – number of output sources/speakers.
- n_fft – stft window size.
- stride – stft stride.
- window – stft window type choose between ‘hamming’, ‘hanning’ or None.
- n_imics – number of microphones channels (only fixed-array geometry supported).
- n_layers – number of TFGridNetV2 blocks.
- lstm_hidden_units – number of hidden units in LSTM.
- attn_n_head – number of heads in self-attention
- attn_approx_qk_dim – approximate dimention of frame-level key and value tensors
- emb_dim – embedding dimension
- emb_ks – kernel size for unfolding and deconv1D
- emb_hs – hop size for unfolding and deconv1D
- activation – activation function to use in the whole TFGridNetV2 model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.
- eps – small epsilon for normalization layers.
- use_builtin_complex – whether to use builtin complex type or not.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
forward(input: Tensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor], Tensor, OrderedDict]
Forward.
Parameters:
- input (torch.Tensor) – batched multi-channel audio tensor with M audio channels and N samples [B, N, M]
- ilens (torch.Tensor) – input lengths [B]
- additional (Dict or None) – other data, currently unused in this model.
Returns: [(B, T), …] list of len n_srcs : of mono audio tensors with T samples.
ilens (torch.Tensor): (B,) additional (Dict or None): other data, currently unused in this model,
we return it also in output.
Return type: enhanced (List[Union(torch.Tensor)])
property num_spk
static pad2(input_tensor, target_len)