espnet2.enh.extractor.td_speakerbeam_extractor.TDSpeakerBeamExtractor
espnet2.enh.extractor.td_speakerbeam_extractor.TDSpeakerBeamExtractor
class espnet2.enh.extractor.td_speakerbeam_extractor.TDSpeakerBeamExtractor(input_dim: int, layer: int = 8, stack: int = 3, bottleneck_dim: int = 128, hidden_dim: int = 512, skip_dim: int = 128, kernel: int = 3, causal: bool = False, norm_type: str = 'gLN', pre_nonlinear: str = 'prelu', nonlinear: str = 'relu', i_adapt_layer: int = 7, adapt_layer_type: str = 'mul', adapt_enroll_dim: int = 128, use_spk_emb: bool = False, spk_emb_dim: int = 256)
Bases: AbsExtractor
Time-Domain SpeakerBeam Extractor.
- Parameters:
- input_dim – input feature dimension
- layer – int, number of layers in each stack
- stack – int, number of stacks
- bottleneck_dim – bottleneck dimension
- hidden_dim – number of convolution channel
- skip_dim – int, number of skip connection channels
- kernel – int, kernel size.
- causal – bool, defalut False.
- norm_type – str, choose from ‘BN’, ‘gLN’, ‘cLN’
- pre_nonlinear – the nonlinear function right before mask estimation select from ‘prelu’, ‘relu’, ‘tanh’, ‘sigmoid’, ‘linear’
- nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’, ‘linear’
- i_adapt_layer – int, index of adaptation layer
- adapt_layer_type – str, type of adaptation layer see espnet2.enh.layers.adapt_layers for options
- adapt_enroll_dim – int, dimensionality of the speaker embedding
- use_spk_emb – bool, whether to use speaker embeddings as enrollment
- spk_emb_dim – int, dimension of input speaker embeddings only used when use_spk_emb is True
forward(input: Tensor | ComplexTensor, ilens: Tensor, input_aux: Tensor, ilens_aux: Tensor, suffix_tag: str = '', additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]
TD-SpeakerBeam Forward.
Parameters:
- input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
- ilens (torch.Tensor) – input lengths [Batch]
- input_aux (torch.Tensor or ComplexTensor) – Encoded auxiliary feature for the target speaker [B, T, N] or [B, N]
- ilens_aux (torch.Tensor) – input lengths of auxiliary input for the target speaker [Batch]
- suffix_tag (str) – suffix to append to the keys in others
- additional (None or dict) – additional parameters not used in this model
Returns: [(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
f’mask{suffix_tag}’: torch.Tensor(Batch, Frames, Freq), f’enroll_emb{suffix_tag}’: torch.Tensor(Batch, adapt_enroll_dim/adapt_enroll_dim*2),
]
Return type: masked (List[Union(torch.Tensor, ComplexTensor)])