espnet2.enh.layers.bsrnn.BSRNN

About 1 min

espnet2.enh.layers.bsrnn.BSRNN

class espnet2.enh.layers.bsrnn.BSRNN(input_dim=481, num_channel=16, num_layer=6, target_fs=48000, subbands=None, causal=False, num_spk=1, norm_type='GN')

Bases: Module

Band-Split RNN (BSRNN).

References

[1] J. Yu, H. Chen, Y. Luo, R. Gu, and C. Weng, “High fidelity speech enhancement with band-split RNN,” in Proc. ISCA Interspeech, 2023. https://isca-speech.org/archive/interspeech_2023/yu23b_interspeech.html [2] J. Yu, and Y. Luo, “Efficient monaural speech enhancement with universal sample rate band-split RNN,” in Proc. ICASSP, 2023. https://ieeexplore.ieee.org/document/10096020

Parameters:
- input_dim (int) – maximum number of frequency bins corresponding to target_fs
- num_channel (int) – embedding dimension of each time-frequency bin
- num_layer (int) – number of time and frequency RNN layers
- target_fs (int) – maximum sampling frequency supported by the model
- subbands (list or tuple , optional) – list of subband sizes to split the frequency band into. If specified, this will override the subband definition in the BandSplit class.
- causal (bool) – Whether or not to adopt causal processing if True, LSTM will be used instead of BLSTM for time modeling
- num_spk (int) – number of outputs to be generated
- norm_type (str) – type of normalization layer (cfLN / cLN / BN / GN)

forward(x, fs=None)

BSRNN forward.

Parameters:
- x (torch.Tensor) – input tensor of shape (B, T, F, 2)
- fs (int , optional) – sampling rate of the input signal. if not None, the input signal will be truncated to only process the effective frequency subbands. if None, the input signal is assumed to be already truncated to only contain effective frequency subbands.
Returns: output tensor of shape (B, num_spk, T, F, 2)
Return type: out (torch.Tensor)