espnet2.enh.layers.bsrnn.BSRNN
Less than 1 minute
espnet2.enh.layers.bsrnn.BSRNN
class espnet2.enh.layers.bsrnn.BSRNN(input_dim=481, num_channel=16, num_layer=6, target_fs=48000, causal=True, num_spk=1, norm_type='GN')
Bases: Module
Band-Split RNN (BSRNN).
References
[1] J. Yu, H. Chen, Y. Luo, R. Gu, and C. Weng, “High fidelity speech enhancement with band-split RNN,” in Proc. ISCA Interspeech, 2023. https://isca-speech.org/archive/interspeech_2023/yu23b_interspeech.html [2] J. Yu, and Y. Luo, “Efficient monaural speech enhancement with universal sample rate band-split RNN,” in Proc. ICASSP, 2023. https://ieeexplore.ieee.org/document/10096020
- Parameters:
- input_dim (int) – maximum number of frequency bins corresponding to target_fs
- num_channel (int) – embedding dimension of each time-frequency bin
- num_layer (int) – number of time and frequency RNN layers
- target_fs (int) – maximum sampling frequency supported by the model
- causal (bool) – Whether or not to adopt causal processing if True, LSTM will be used instead of BLSTM for time modeling
- num_spk (int) – number of outputs to be generated
- norm_type (str) – type of normalization layer (cfLN / cLN / BN / GN)
forward(x, fs=None)
BSRNN forward.
- Parameters:
- x (torch.Tensor) – input tensor of shape (B, T, F, 2)
- fs (int , optional) – sampling rate of the input signal. if not None, the input signal will be truncated to only process the effective frequency subbands. if None, the input signal is assumed to be already truncated to only contain effective frequency subbands.
- Returns: output tensor of shape (B, num_spk, T, F, 2)
- Return type: out (torch.Tensor)