espnet2.diar package¶

espnet2.diar.init¶

espnet2.diar.label_processor¶

class espnet2.diar.label_processor.LabelProcessor(win_length: int = 512, hop_length: int = 128, center: bool = True)[source]¶

Bases: torch.nn.modules.module.Module

Label aggregator for speaker diarization

forward(input: torch.Tensor, ilens: torch.Tensor)[source]¶

Forward.

Parameters:

input – (Batch, Nsamples, Label_dim)
ilens – (Batch)

Returns:

(Batch, Frames, Label_dim) olens: (Batch)

Return type:

output

espnet2.diar.abs_diar¶

class espnet2.diar.abs_diar.AbsDiarization(*args, **kwargs)[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, collections.OrderedDict][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract forward_rawwav(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, collections.OrderedDict][source]¶

espnet2.diar.espnet_model¶

class espnet2.diar.espnet_model.ESPnetDiarizationModel(frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], label_aggregator: torch.nn.modules.module.Module, encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, decoder: espnet2.diar.decoder.abs_decoder.AbsDecoder, attractor: Optional[espnet2.diar.attractor.abs_attractor.AbsAttractor], diar_weight: float = 1.0, attractor_weight: float = 1.0)[source]¶

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

Speaker Diarization model

If “attractor” is “None”, SA-EEND will be used. Else if “attractor” is not “None”, EEND-EDA will be used. For the details about SA-EEND and EEND-EDA, refer to the following papers: SA-EEND: https://arxiv.org/pdf/1909.06247.pdf EEND-EDA: https://arxiv.org/pdf/2005.09921.pdf, https://arxiv.org/pdf/2106.10654.pdf

attractor_loss(att_prob, label)[source]¶

static calc_diarization_error(pred, label, length)[source]¶

collect_feats(speech: torch.Tensor, speech_lengths: torch.Tensor, spk_labels: torch.Tensor = None, spk_labels_lengths: torch.Tensor = None, **kwargs) → Dict[str, torch.Tensor][source]¶

create_length_mask(length, max_len, num_output)[source]¶

encode(speech: torch.Tensor, speech_lengths: torch.Tensor, bottleneck_feats: torch.Tensor, bottleneck_feats_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Frontend + Encoder

Parameters:

speech – (Batch, Length, …)
speech_lengths – (Batch,)
bottleneck_feats – (Batch, Length, …): used for enh + diar

forward(speech: torch.Tensor, speech_lengths: torch.Tensor = None, spk_labels: torch.Tensor = None, spk_labels_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Frontend + Encoder + Decoder + Calc loss

Parameters:

speech – (Batch, samples)
speech_lengths – (Batch,) default None for chunk interator, because the chunk-iterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py
spk_labels – (Batch, )
kwargs – “utt_id” is among the input.

pit_loss(pred, label, lengths)[source]¶

pit_loss_single_permute(pred, label, length)[source]¶

espnet2.diar.layers.tcn_nomask¶

class espnet2.diar.layers.tcn_nomask.ChannelwiseLayerNorm(channel_size)[source]¶

Bases: torch.nn.modules.module.Module

Channel-wise Layer Normalization (cLN).

forward(y)[source]¶

Forward.

Parameters:: y – [M, N, K], M is batch size, N is channel size, K is length
Returns:: [M, N, K]
Return type:: cLN_y

reset_parameters()[source]¶

class espnet2.diar.layers.tcn_nomask.Chomp1d(chomp_size)[source]¶

Bases: torch.nn.modules.module.Module

To ensure the output length is the same as the input.

forward(x)[source]¶

Forward.

Parameters:: x – [M, H, Kpad]
Returns:: [M, H, K]

class espnet2.diar.layers.tcn_nomask.DepthwiseSeparableConv(in_channels, out_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]¶

Bases: torch.nn.modules.module.Module

forward(x)[source]¶

Forward.

Parameters:: x – [M, H, K]
Returns:: [M, B, K]
Return type:: result

class espnet2.diar.layers.tcn_nomask.GlobalLayerNorm(channel_size)[source]¶

Bases: torch.nn.modules.module.Module

Global Layer Normalization (gLN).

forward(y)[source]¶

Forward.

Parameters:: y – [M, N, K], M is batch size, N is channel size, K is length
Returns:: [M, N, K]
Return type:: gLN_y

reset_parameters()[source]¶

class espnet2.diar.layers.tcn_nomask.TemporalBlock(in_channels, out_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]¶

Bases: torch.nn.modules.module.Module

forward(x)[source]¶

Forward.

Parameters:: x – [M, B, K]
Returns:: [M, B, K]

class espnet2.diar.layers.tcn_nomask.TemporalConvNet(N, B, H, P, X, R, norm_type='gLN', causal=False)[source]¶

Bases: torch.nn.modules.module.Module

Basic Module of tasnet.

Parameters:

N – Number of filters in autoencoder
B – Number of channels in bottleneck 1 * 1-conv block
H – Number of channels in convolutional blocks
P – Kernel size in convolutional blocks
X – Number of convolutional blocks in each repeat
R – Number of repeats
norm_type – BN, gLN, cLN
causal – causal or non-causal

forward(mixture_w)[source]¶

Keep this API same with TasNet.

Parameters:: mixture_w – [M, N, K], M is batch size
Returns:: [M, B, K]
Return type:: bottleneck_feature

espnet2.diar.layers.tcn_nomask.check_nonlinear(nolinear_type)[source]¶

espnet2.diar.layers.tcn_nomask.chose_norm(norm_type, channel_size)[source]¶

The input of normalization will be (M, C, K), where M is batch size.

C is channel size and K is sequence length.

espnet2.diar.layers.init¶

espnet2.diar.layers.multi_mask¶

class espnet2.diar.layers.multi_mask.MultiMask(input_dim: int, bottleneck_dim: int = 128, max_num_spk: int = 3, mask_nonlinear='relu')[source]¶

Bases: espnet2.diar.layers.abs_mask.AbsMask

Multiple 1x1 convolution layer Module.

This module corresponds to the final 1x1 conv block and non-linear function in TCNSeparator. This module has multiple 1x1 conv blocks. One of them is selected according to the given num_spk to handle flexible num_spk.

Parameters:

input_dim – Number of filters in autoencoder
bottleneck_dim – Number of channels in bottleneck 1 * 1-conv block
max_num_spk – Number of mask_conv1x1 modules (>= Max number of speakers in the dataset)
mask_nonlinear – use which non-linear function to generate mask

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, bottleneck_feat: torch.Tensor, num_spk: int) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶

Keep this API same with TasNet.

Parameters:

input – [M, K, N], M is batch size
ilens (torch.Tensor) – (M,)
bottleneck_feat – [M, K, B]
num_spk – number of speakers
(Training – oracle,
Inference – estimated by other module (e.g, EEND-EDA))

Returns:

[(M, K, N), …] ilens (torch.Tensor): (M,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type:

masked (List[Union(torch.Tensor, ComplexTensor)])

property max_num_spk¶

espnet2.diar.layers.abs_mask¶

class espnet2.diar.layers.abs_mask.AbsMask(*args, **kwargs)[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input, ilens, bottleneck_feat, num_spk) → Tuple[Tuple[torch.Tensor], torch.Tensor, collections.OrderedDict][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract property max_num_spk¶

espnet2.diar.decoder.init¶

espnet2.diar.decoder.linear_decoder¶

class espnet2.diar.decoder.linear_decoder.LinearDecoder(encoder_output_size: int, num_spk: int = 2)[source]¶

Bases: espnet2.diar.decoder.abs_decoder.AbsDecoder

Linear decoder for speaker diarization

forward(input: torch.Tensor, ilens: torch.Tensor)[source]¶

Forward.

Parameters:

input (torch.Tensor) – hidden_space [Batch, T, F]
ilens (torch.Tensor) – input lengths [Batch]

property num_spk¶

espnet2.diar.decoder.abs_decoder¶

class espnet2.diar.decoder.abs_decoder.AbsDecoder(*args, **kwargs)[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract property num_spk¶

espnet2.diar.separator.init¶

espnet2.diar.separator.tcn_separator_nomask¶

class espnet2.diar.separator.tcn_separator_nomask.TCNSeparatorNomask(input_dim: int, layer: int = 8, stack: int = 3, bottleneck_dim: int = 128, hidden_dim: int = 512, kernel: int = 3, causal: bool = False, norm_type: str = 'gLN')[source]¶

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Temporal Convolution Separator

Note that this separator is equivalent to TCNSeparator except for not having the mask estimation part. This separator outputs the intermediate bottleneck feats (which is used as the input to diarization branch in enh_diar task). This separator is followed by MultiMask module, which estimates the masks.

Parameters:

input_dim – input feature dimension
layer – int, number of layers in each stack.
stack – int, number of stacks
bottleneck_dim – bottleneck dimension
hidden_dim – number of convolution channel
kernel – int, kernel size.
causal – bool, defalut False.
norm_type – str, choose from ‘BN’, ‘gLN’, ‘cLN’

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward.

Parameters:

input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]

Returns:

[B, T, bottleneck_dim] ilens (torch.Tensor): (B,)

Return type:

feats (torch.Tensor)

property num_spk¶

property output_dim¶

espnet2.diar.attractor.init¶

espnet2.diar.attractor.rnn_attractor¶

class espnet2.diar.attractor.rnn_attractor.RnnAttractor(encoder_output_size: int, layer: int = 1, unit: int = 512, dropout: float = 0.1, attractor_grad: bool = True)[source]¶

Bases: espnet2.diar.attractor.abs_attractor.AbsAttractor

encoder decoder attractor for speaker diarization

forward(enc_input: torch.Tensor, ilens: torch.Tensor, dec_input: torch.Tensor)[source]¶

Forward.

Parameters:

enc_input (torch.Tensor) – hidden_space [Batch, T, F]
ilens (torch.Tensor) – input lengths [Batch]
dec_input (torch.Tensor) – decoder input (zeros) [Batch, num_spk + 1, F]

Returns:

[Batch, num_spk + 1, F] att_prob: [Batch, num_spk + 1, 1]

Return type:

attractor

espnet2.diar.attractor.abs_attractor¶

class espnet2.diar.attractor.abs_attractor.AbsAttractor(*args, **kwargs)[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(enc_input: torch.Tensor, ilens: torch.Tensor, dec_input: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.diar package¶

espnet2.diar.__init__¶

espnet2.diar.label_processor¶

espnet2.diar.abs_diar¶

espnet2.diar.espnet_model¶

espnet2.diar.layers.tcn_nomask¶

espnet2.diar.layers.__init__¶

espnet2.diar.layers.multi_mask¶

espnet2.diar.layers.abs_mask¶

espnet2.diar.decoder.__init__¶

espnet2.diar.decoder.linear_decoder¶

espnet2.diar.decoder.abs_decoder¶

espnet2.diar.separator.__init__¶

espnet2.diar.separator.tcn_separator_nomask¶

espnet2.diar.attractor.__init__¶

espnet2.diar.attractor.rnn_attractor¶

espnet2.diar.attractor.abs_attractor¶

espnet2.diar.init¶

espnet2.diar.layers.init¶

espnet2.diar.decoder.init¶

espnet2.diar.separator.init¶

espnet2.diar.attractor.init¶