espnet2.gan_svs package

espnet2.gan_svs.abs_gan_svs

GAN-based SVS abstrast class.

class espnet2.gan_svs.abs_gan_svs.AbsGANSVS(*args, **kwargs)[source]

Bases: espnet2.svs.abs_svs.AbsSVS, abc.ABC

GAN-based SVS model abstract class.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(forward_generator, *args, **kwargs) → Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor], int]][source]

Return generator or discriminator loss.

espnet2.gan_svs.espnet_model

GAN-based Singing-voice-synthesis ESPnet model.

class espnet2.gan_svs.espnet_model.ESPnetGANSVSModel(text_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], score_feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], label_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], pitch_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], ying_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], duration_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], energy_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], pitch_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], energy_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], svs: espnet2.gan_svs.abs_gan_svs.AbsGANSVS)[source]

Bases: espnet2.train.abs_gan_espnet_model.AbsGANESPnetModel

ESPnet model for GAN-based singing voice synthesis task.

Initialize ESPnetGANSVSModel module.

collect_feats(text: torch.Tensor, text_lengths: torch.Tensor, singing: torch.Tensor, singing_lengths: torch.Tensor, label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, duration_phn: Optional[torch.Tensor] = None, duration_phn_lengths: Optional[torch.Tensor] = None, duration_ruled_phn: Optional[torch.Tensor] = None, duration_ruled_phn_lengths: Optional[torch.Tensor] = None, duration_syb: Optional[torch.Tensor] = None, duration_syb_lengths: Optional[torch.Tensor] = None, slur: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, ying: Optional[torch.Tensor] = None, ying_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, **kwargs) → Dict[str, torch.Tensor][source]

Calculate features and return them as a dict.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • singing (Tensor) – Singing waveform tensor (B, T_wav).

  • singing_lengths (Tensor) – Singing length tensor (B,).

  • label (Option[Tensor]) – Label tensor (B, T_label).

  • label_lengths (Optional[Tensor]) – Label lrngth tensor (B,).

  • phn_cnt (Optional[Tensor]) – Number of phones in each syllable (B, T_syb)

  • midi (Option[Tensor]) – Midi tensor (B, T_label).

  • midi_lengths (Optional[Tensor]) – Midi lrngth tensor (B,).

  • duration_phn (Optional[Tensor]) – duration tensor (T_label).

  • duration_ruled_phn (Optional[Tensor]) – duration tensor (T_phone).

  • duration_syb (Optional[Tensor]) – duration tensor (T_phone).

  • slur (Optional[Tensor]) – slur tensor (B, T_slur).

  • pitch (Optional[Tensor]) – Pitch tensor (B, T_wav). - f0 sequence

  • pitch_lengths (Optional[Tensor]) – Pitch length tensor (B,).

  • energy (Optional[Tensor) – Energy tensor.

  • energy_lengths (Optional[Tensor) – Energy length tensor (B,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).

  • sids (Optional[Tensor]) – Speaker ID tensor (B, 1).

  • lids (Optional[Tensor]) – Language ID tensor (B, 1).

Returns:

Dict of features.

Return type:

Dict[str, Tensor]

forward(text: torch.Tensor, text_lengths: torch.Tensor, singing: torch.Tensor, singing_lengths: torch.Tensor, feats: Optional[torch.Tensor] = None, feats_lengths: Optional[torch.Tensor] = None, label: Optional[torch.Tensor] = None, label_lengths: Optional[torch.Tensor] = None, phn_cnt: Optional[torch.Tensor] = None, midi: Optional[torch.Tensor] = None, midi_lengths: Optional[torch.Tensor] = None, duration_phn: Optional[torch.Tensor] = None, duration_phn_lengths: Optional[torch.Tensor] = None, duration_ruled_phn: Optional[torch.Tensor] = None, duration_ruled_phn_lengths: Optional[torch.Tensor] = None, duration_syb: Optional[torch.Tensor] = None, duration_syb_lengths: Optional[torch.Tensor] = None, slur: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, pitch_lengths: Optional[torch.Tensor] = None, energy: Optional[torch.Tensor] = None, energy_lengths: Optional[torch.Tensor] = None, ying: Optional[torch.Tensor] = None, ying_lengths: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, forward_generator: bool = True, **kwargs) → Dict[str, Any][source]

Return generator or discriminator loss with dict format.

Parameters:
  • text (Tensor) – Text index tensor (B, T_text).

  • text_lengths (Tensor) – Text length tensor (B,).

  • singing (Tensor) – Singing waveform tensor (B, T_wav).

  • singing_lengths (Tensor) – Singing length tensor (B,).

  • label (Option[Tensor]) – Label tensor (B, T_label).

  • label_lengths (Optional[Tensor]) – Label lrngth tensor (B,).

  • phn_cnt (Optional[Tensor]) – Number of phones in each syllable (B, T_syb)

  • midi (Option[Tensor]) – Midi tensor (B, T_label).

  • midi_lengths (Optional[Tensor]) – Midi lrngth tensor (B,).

  • duration_phn (Optional[Tensor]) – duration tensor (B, T_label).

  • duration_phn_lengths (Optional[Tensor]) – duration length tensor (B,).

  • duration_ruled_phn (Optional[Tensor]) – duration tensor (B, T_phone).

  • duration_ruled_phn_lengths (Optional[Tensor]) – duration length tensor (B,).

  • duration_syb (Optional[Tensor]) – duration tensor (B, T_syllable).

  • duration_syb_lengths (Optional[Tensor]) – duration length tensor (B,).

  • slur (Optional[Tensor]) – slur tensor (B, T_slur).

  • pitch (Optional[Tensor]) – Pitch tensor (B, T_wav). - f0 sequence

  • pitch_lengths (Optional[Tensor]) – Pitch length tensor (B,).

  • energy (Optional[Tensor]) – Energy tensor.

  • energy_lengths (Optional[Tensor]) – Energy length tensor (B,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (B, D).

  • sids (Optional[Tensor]) – Speaker ID tensor (B, 1).

  • lids (Optional[Tensor]) – Language ID tensor (B, 1).

  • forward_generator (bool) – Whether to forward generator.

  • kwargs – “utt_id” is among the input.

Returns:

  • loss (Tensor): Loss scalar tensor.

  • stats (Dict[str, float]): Statistics to be monitored.

  • weight (Tensor): Weight tensor to summarize losses.

  • optim_idx (int): Optimizer index (0 for G and 1 for D).

Return type:

Dict[str, Any]

espnet2.gan_svs.__init__

espnet2.gan_svs.joint.joint_score2wav

Joint score-to-wav module for end-to-end training.

class espnet2.gan_svs.joint.joint_score2wav.JointScore2Wav(idim: int, odim: int, segment_size: int = 32, sampling_rate: int = 22050, score2mel_type: str = 'xiaoice', score2mel_params: Dict[str, Any] = {'adim': 384, 'aheads': 4, 'conformer_activation_type': 'swish', 'conformer_dec_kernel_size': 31, 'conformer_enc_kernel_size': 7, 'conformer_pos_enc_layer_type': 'rel_pos', 'conformer_rel_pos_type': 'latest', 'conformer_self_attn_layer_type': 'rel_selfattn', 'decoder_concat_after': False, 'decoder_normalize_before': True, 'decoder_type': 'transformer', 'dlayers': 6, 'dunits': 1536, 'duration_predictor_chans': 384, 'duration_predictor_dropout_rate': 0.1, 'duration_predictor_kernel_size': 3, 'duration_predictor_layers': 2, 'elayers': 6, 'encoder_concat_after': False, 'encoder_normalize_before': True, 'encoder_type': 'transformer', 'eunits': 1536, 'init_dec_alpha': 1.0, 'init_enc_alpha': 1.0, 'init_type': 'xavier_uniform', 'lambda_dur': 0.1, 'lambda_mel': 1, 'lambda_pitch': 0.01, 'lambda_vuv': 0.01, 'langs': None, 'loss_function': 'XiaoiceSing2', 'loss_type': 'L1', 'midi_dim': 129, 'positionwise_conv_kernel_size': 1, 'positionwise_layer_type': 'conv1d', 'postnet_chans': 512, 'postnet_dropout_rate': 0.5, 'postnet_filts': 5, 'postnet_layers': 5, 'reduction_factor': 1, 'spk_embed_dim': None, 'spk_embed_integration_type': 'add', 'spks': None, 'tempo_dim': 500, 'transformer_dec_attn_dropout_rate': 0.1, 'transformer_dec_dropout_rate': 0.1, 'transformer_dec_positional_dropout_rate': 0.1, 'transformer_enc_attn_dropout_rate': 0.1, 'transformer_enc_dropout_rate': 0.1, 'transformer_enc_positional_dropout_rate': 0.1, 'use_batch_norm': True, 'use_cnn_in_conformer': True, 'use_macaron_style_in_conformer': True, 'use_masking': False, 'use_scaled_pos_enc': True, 'use_weighted_masking': False, 'zero_triu': False}, vocoder_type: str = 'hifigan_generator', vocoder_params: Dict[str, Any] = {'bias': True, 'channels': 512, 'global_channels': -1, 'kernel_size': 7, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'resblock_kernel_sizes': [3, 7, 11], 'upsample_kernel_sizes': [16, 16, 4, 4], 'upsample_scales': [8, 8, 2, 2], 'use_additional_convs': True, 'use_weight_norm': True}, use_pqmf: bool = False, pqmf_params: Dict[str, Any] = {'beta': 9.0, 'cutoff_ratio': 0.142, 'subbands': 4, 'taps': 62}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, use_feat_match_loss: bool = True, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, use_mel_loss: bool = True, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_score2mel: float = 1.0, lambda_adv: float = 1.0, lambda_feat_match: float = 2.0, lambda_mel: float = 45.0, cache_generator_outputs: bool = False)[source]

Bases: espnet2.gan_svs.abs_gan_svs.AbsGANSVS

General class to jointly train score2mel and vocoder parts.

Initialize JointScore2Wav module.

Parameters:
  • idim (int) – Input vocabrary size.

  • odim (int) – Acoustic feature dimension. The actual output channels will be 1 since the model is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.

  • segment_size (int) – Segment size for random windowed inputs.

  • sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.

  • text2mel_type (str) – The text2mel model type.

  • text2mel_params (Dict[str, Any]) – Parameter dict for text2mel model.

  • use_pqmf (bool) – Whether to use PQMF for multi-band vocoder.

  • pqmf_params (Dict[str, Any]) – Parameter dict for PQMF module.

  • vocoder_type (str) – The vocoder model type.

  • vocoder_params (Dict[str, Any]) – Parameter dict for vocoder model.

  • discriminator_type (str) – Discriminator type.

  • discriminator_params (Dict[str, Any]) – Parameter dict for discriminator.

  • generator_adv_loss_params (Dict[str, Any]) – Parameter dict for generator adversarial loss.

  • discriminator_adv_loss_params (Dict[str, Any]) – Parameter dict for discriminator adversarial loss.

  • use_feat_match_loss (bool) – Whether to use feat match loss.

  • feat_match_loss_params (Dict[str, Any]) – Parameter dict for feat match loss.

  • use_mel_loss (bool) – Whether to use mel loss.

  • mel_loss_params (Dict[str, Any]) – Parameter dict for mel loss.

  • lambda_text2mel (float) – Loss scaling coefficient for text2mel model loss.

  • lambda_adv (float) – Loss scaling coefficient for adversarial loss.

  • lambda_feat_match (float) – Loss scaling coefficient for feat match loss.

  • lambda_mel (float) – Loss scaling coefficient for mel loss.

  • cache_generator_outputs (bool) – Whether to cache generator outputs.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, singing: torch.Tensor, singing_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, pitch: torch.LongTensor = None, duration: Optional[Dict[str, torch.Tensor]] = None, slur: torch.LongTensor = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, forward_generator: bool = True) → Dict[str, Any][source]

Perform generator forward.

Parameters:
  • text (LongTensor) – Batch of padded character ids (B, Tmax).

  • text_lengths (LongTensor) – Batch of lengths of each input batch (B,).

  • feats (Tensor) – Batch of padded target features (B, Lmax, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • singing (Tensor) – Singing waveform tensor (B, T_wav).

  • singing_lengths (Tensor) – Singing length tensor (B,).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).

  • label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).

  • slur (FloatTensor) – Batch of padded slur (B, Tmax).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

  • forward_generator (bool) – Whether to forward generator.

Returns:

  • loss (Tensor): Loss scalar tensor.

  • stats (Dict[str, float]): Statistics to be monitored.

  • weight (Tensor): Weight tensor to summarize losses.

  • optim_idx (int): Optimizer index (0 for G and 1 for D).

Return type:

Dict[str, Any]

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, slur: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: Optional[int] = None, use_teacher_forcing: bool = False) → Dict[str, torch.Tensor][source]

Run inference.

Parameters:
  • text (Tensor) – Input text index tensor (T_text,).

  • feats (Tensor) – Feature tensor (T_feats, aux_channels).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).

  • duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • slur (LongTensor) – Batch of padded slur (B, Tmax).

  • sids (Tensor) – Speaker index tensor (1,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (spk_embed_dim,).

  • lids (Tensor) – Language index tensor (1,).

  • noise_scale (float) – Noise scale value for flow.

  • noise_scale_dur (float) – Noise scale value for duration predictor.

  • alpha (float) – Alpha parameter to control the speed of generated singing.

  • max_len (Optional[int]) – Maximum length.

  • use_teacher_forcing (bool) – Whether to use teacher forcing.

Returns:

  • wav (Tensor): Generated waveform tensor (T_wav,).

  • feat_gan (Tensor): Generated feature tensor (T_text, C).

Return type:

Dict[str, Tensor]

property require_raw_singing

Return whether or not singing is required.

property require_vocoder

Return whether or not vocoder is required.

espnet2.gan_svs.joint.__init__

espnet2.gan_svs.pits.modules

class espnet2.gan_svs.pits.modules.WN(hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=0, p_dropout=0)[source]

Bases: torch.nn.modules.module.Module

forward(x, x_mask, g=None, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels)[source]
remove_weight_norm()[source]

espnet2.gan_svs.pits.ying_decoder

class espnet2.gan_svs.pits.ying_decoder.YingDecoder(hidden_channels, kernel_size, dilation_rate, n_layers, yin_start, yin_scope, yin_shift_range, gin_channels=0)[source]

Bases: torch.nn.modules.module.Module

Ying decoder module.

Initialize the YingDecoder module.

Parameters:
  • hidden_channels (int) – Number of hidden channels.

  • kernel_size (int) – Size of the convolutional kernel.

  • dilation_rate (int) – Dilation rate of the convolutional layers.

  • n_layers (int) – Number of convolutional layers.

  • yin_start (int) – Start point of the yin target signal.

  • yin_scope (int) – Scope of the yin target signal.

  • yin_shift_range (int) – Maximum number of frames to shift the yin target signal.

  • gin_channels (int, optional) – Number of global conditioning channels. Defaults to 0.

crop_scope(x, yin_start, scope_shift)[source]

Crop the input tensor.

Parameters:
  • x (torch.Tensor) – Input tensor of shape [B, C, T].

  • yin_start (int) – Starting point of the yin target signal.

  • scope_shift (torch.Tensor) – Shift tensor of shape [B].

Returns:

Cropped tensor of shape [B, C, yin_scope].

Return type:

torch.Tensor

forward(z_yin, yin_gt, z_mask, g=None)[source]

Forward pass of the decoder.

Parameters:
  • z_yin (torch.Tensor) – The input yin note sequence of shape (B, C, T_yin).

  • yin_gt (torch.Tensor) – The ground truth yin note sequence of shape (B, C, T_yin).

  • z_mask (torch.Tensor) – The mask tensor of shape (B, 1, T_yin).

  • g (torch.Tensor) – The global conditioning tensor.

Returns:

The predicted yin note sequence of shape (B, C, T_yin). torch.Tensor: The shifted ground truth yin note sequence of shape

(B, C, T_yin).

torch.Tensor: The cropped ground truth yin note sequence of shape

(B, C, T_yin).

torch.Tensor: The cropped input yin note sequence of shape (B, C, T_yin). torch.Tensor: The scope shift tensor of shape (B,).

Return type:

torch.Tensor

infer(z_yin, z_mask, g=None)[source]

Generate yin prediction.

Parameters:
  • z_yin (torch.Tensor) – Input yin target tensor of shape [B, yin_scope, C].

  • z_mask (torch.Tensor) – Input mask tensor of shape [B, yin_scope, 1].

  • g (torch.Tensor, optional) – Global conditioning tensor of shape [B, gin_channels, 1]. Defaults to None.

Returns:

Predicted yin tensor of shape [B, yin_scope, C].

Return type:

torch.Tensor

espnet2.gan_svs.vits.generator

Generator module in VISinger.

This code is based on https://github.com/jaywalnut310/vits.

This is a module of VISinger described in `VISinger: Variational Inference

with Adversarial Learning for End-to-End Singing Voice Synthesis`_.

class espnet2.gan_svs.vits.generator.VISingerGenerator(vocabs: int, aux_channels: int = 513, hidden_channels: int = 192, spks: Optional[int] = None, langs: Optional[int] = None, spk_embed_dim: Optional[int] = None, global_channels: int = -1, segment_size: int = 32, text_encoder_attention_heads: int = 2, text_encoder_ffn_expand: int = 4, text_encoder_blocks: int = 6, text_encoder_positionwise_layer_type: str = 'conv1d', text_encoder_positionwise_conv_kernel_size: int = 1, text_encoder_positional_encoding_layer_type: str = 'rel_pos', text_encoder_self_attention_layer_type: str = 'rel_selfattn', text_encoder_activation_type: str = 'swish', text_encoder_normalize_before: bool = True, text_encoder_dropout_rate: float = 0.1, text_encoder_positional_dropout_rate: float = 0.0, text_encoder_attention_dropout_rate: float = 0.0, text_encoder_conformer_kernel_size: int = 7, use_macaron_style_in_text_encoder: bool = True, use_conformer_conv_in_text_encoder: bool = True, decoder_kernel_size: int = 7, decoder_channels: int = 512, decoder_downsample_scales: List[int] = [2, 2, 8, 8], decoder_downsample_kernel_sizes: List[int] = [4, 4, 16, 16], decoder_upsample_scales: List[int] = [8, 8, 2, 2], decoder_upsample_kernel_sizes: List[int] = [16, 16, 4, 4], decoder_resblock_kernel_sizes: List[int] = [3, 7, 11], decoder_resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], use_avocodo=False, projection_filters: List[int] = [0, 1, 1, 1], projection_kernels: List[int] = [0, 5, 7, 11], n_harmonic: int = 64, use_weight_norm_in_decoder: bool = True, posterior_encoder_kernel_size: int = 5, posterior_encoder_layers: int = 16, posterior_encoder_stacks: int = 1, posterior_encoder_base_dilation: int = 1, posterior_encoder_dropout_rate: float = 0.0, use_weight_norm_in_posterior_encoder: bool = True, flow_flows: int = 4, flow_kernel_size: int = 5, flow_base_dilation: int = 1, flow_layers: int = 4, flow_dropout_rate: float = 0.0, use_weight_norm_in_flow: bool = True, use_only_mean_in_flow: bool = True, generator_type: str = 'visinger', vocoder_generator_type: str = 'hifigan', fs: int = 22050, hop_length: int = 256, win_length: Optional[int] = 1024, n_fft: int = 1024, use_phoneme_predictor: bool = False, expand_f0_method: str = 'repeat')[source]

Bases: torch.nn.modules.module.Module

Generator module in VISinger.

Initialize VITS generator module.

Parameters:
  • vocabs (int) – Input vocabulary size.

  • aux_channels (int) – Number of acoustic feature channels.

  • hidden_channels (int) – Number of hidden channels.

  • spks (Optional[int]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.

  • langs (Optional[int]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.

  • spk_embed_dim (Optional[int]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.

  • global_channels (int) – Number of global conditioning channels.

  • segment_size (int) – Segment size for decoder.

  • text_encoder_attention_heads (int) – Number of heads in conformer block of text encoder.

  • text_encoder_ffn_expand (int) – Expansion ratio of FFN in conformer block of text encoder.

  • text_encoder_blocks (int) – Number of conformer blocks in text encoder.

  • text_encoder_positionwise_layer_type (str) – Position-wise layer type in conformer block of text encoder.

  • text_encoder_positionwise_conv_kernel_size (int) – Position-wise convolution kernel size in conformer block of text encoder. Only used when the above layer type is conv1d or conv1d-linear.

  • text_encoder_positional_encoding_layer_type (str) – Positional encoding layer type in conformer block of text encoder.

  • text_encoder_self_attention_layer_type (str) – Self-attention layer type in conformer block of text encoder.

  • text_encoder_activation_type (str) – Activation function type in conformer block of text encoder.

  • text_encoder_normalize_before (bool) – Whether to apply layer norm before self-attention in conformer block of text encoder.

  • text_encoder_dropout_rate (float) – Dropout rate in conformer block of text encoder.

  • text_encoder_positional_dropout_rate (float) – Dropout rate for positional encoding in conformer block of text encoder.

  • text_encoder_attention_dropout_rate (float) – Dropout rate for attention in conformer block of text encoder.

  • text_encoder_conformer_kernel_size (int) – Conformer conv kernel size. It will be used when only use_conformer_conv_in_text_encoder = True.

  • use_macaron_style_in_text_encoder (bool) – Whether to use macaron style FFN in conformer block of text encoder.

  • use_conformer_conv_in_text_encoder (bool) – Whether to use covolution in conformer block of text encoder.

  • decoder_kernel_size (int) – Decoder kernel size.

  • decoder_channels (int) – Number of decoder initial channels.

  • decoder_downsample_scales (List[int]) – List of downsampling scales in decoder.

  • decoder_downsample_kernel_sizes (List[int]) – List of kernel sizes for downsampling layers in decoder.

  • decoder_upsample_scales (List[int]) – List of upsampling scales in decoder.

  • decoder_upsample_kernel_sizes (List[int]) – List of kernel sizes for upsampling layers in decoder.

  • decoder_resblock_kernel_sizes (List[int]) – List of kernel sizes for resblocks in decoder.

  • decoder_resblock_dilations (List[List[int]]) – List of list of dilations for resblocks in decoder.

  • use_avocodo (bool) – Whether to use Avocodo model in the generator.

  • projection_filters (List[int]) – List of projection filter sizes.

  • projection_kernels (List[int]) – List of projection kernel sizes.

  • n_harmonic (int) – Number of harmonic components.

  • use_weight_norm_in_decoder (bool) – Whether to apply weight normalization in decoder.

  • posterior_encoder_kernel_size (int) – Posterior encoder kernel size.

  • posterior_encoder_layers (int) – Number of layers of posterior encoder.

  • posterior_encoder_stacks (int) – Number of stacks of posterior encoder.

  • posterior_encoder_base_dilation (int) – Base dilation of posterior encoder.

  • posterior_encoder_dropout_rate (float) – Dropout rate for posterior encoder.

  • use_weight_norm_in_posterior_encoder (bool) – Whether to apply weight normalization in posterior encoder.

  • flow_flows (int) – Number of flows in flow.

  • flow_kernel_size (int) – Kernel size in flow.

  • flow_base_dilation (int) – Base dilation in flow.

  • flow_layers (int) – Number of layers in flow.

  • flow_dropout_rate (float) – Dropout rate in flow

  • use_weight_norm_in_flow (bool) – Whether to apply weight normalization in flow.

  • use_only_mean_in_flow (bool) – Whether to use only mean in flow.

  • generator_type (str) – Type of generator to use for the model.

  • vocoder_generator_type (str) – Type of vocoder generator to use for the model.

  • fs (int) – Sample rate of the audio.

  • hop_length (int) – Number of samples between successive frames in STFT.

  • win_length (int) – Window size of the STFT.

  • n_fft (int) – Length of the FFT window to be used.

  • use_phoneme_predictor (bool) – Whether to use phoneme predictor in the model.

  • expand_f0_method (str) – The method used to expand F0. Use “repeat” or “interpolation”.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, label: torch.Tensor = None, label_lengths: torch.Tensor = None, melody: torch.Tensor = None, gt_dur: torch.Tensor = None, score_dur: torch.Tensor = None, slur: torch.Tensor = None, pitch: torch.Tensor = None, ying: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]][source]

Calculate forward propagation.

Parameters:
  • text (LongTensor) – Batch of padded character ids (B, Tmax).

  • text_lengths (LongTensor) – Batch of lengths of each input batch (B,).

  • feats (Tensor) – Batch of padded target features (B, Lmax, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • label (LongTensor) – Batch of padded label ids (B, Tmax).

  • label_lengths (LongTensor) – Batch of the lengths of padded label ids (B, ).

  • melody (LongTensor) – Batch of padded midi (B, Tmax).

  • gt_dur (LongTensor) – Batch of padded ground truth duration (B, Tmax).

  • score_dur (LongTensor) – Batch of padded score duration (B, Tmax).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • ying (Optional[Tensor]) – Batch of padded ying (B, Tmax).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

Returns:

Waveform tensor (B, 1, segment_size * upsample_factor). Tensor: Duration negative log-likelihood (NLL) tensor (B,). Tensor: Monotonic attention weight tensor (B, 1, T_feats, T_text). Tensor: Segments start index tensor (B,). Tensor: Text mask tensor (B, 1, T_text). Tensor: Feature mask tensor (B, 1, T_feats). tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]:

  • Tensor: Posterior encoder hidden representation (B, H, T_feats).

  • Tensor: Flow hidden representation (B, H, T_feats).

  • Tensor: Expanded text encoder projected mean (B, H, T_feats).

  • Tensor: Expanded text encoder projected scale (B, H, T_feats).

  • Tensor: Posterior encoder projected mean (B, H, T_feats).

  • Tensor: Posterior encoder projected scale (B, H, T_feats).

Return type:

Tensor

inference(text: torch.Tensor, text_lengths: torch.Tensor, feats: Optional[torch.Tensor] = None, feats_lengths: Optional[torch.Tensor] = None, label: torch.Tensor = None, label_lengths: torch.Tensor = None, melody: torch.Tensor = None, score_dur: torch.Tensor = None, slur: torch.Tensor = None, gt_dur: Optional[torch.Tensor] = None, pitch: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, spembs: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: Optional[int] = None, use_teacher_forcing: bool = False) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Run inference.

Parameters:
  • text (LongTensor) – Batch of padded character ids (B, Tmax).

  • text_lengths (LongTensor) – Batch of lengths of each input batch (B,).

  • feats (Tensor) – Batch of padded target features (B, Lmax, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • label (LongTensor) – Batch of padded label ids (B, Tmax).

  • label_lengths (LongTensor) – Batch of the lengths of padded label ids (B, ).

  • melody (LongTensor) – Batch of padded midi (B, Tmax).

  • gt_dur (LongTensor) – Batch of padded ground truth duration (B, Tmax).

  • score_dur (LongTensor) – Batch of padded score duration (B, Tmax).

  • pitch (FloatTensor) – Batch of padded f0 (B, Tmax).

  • ying (Optional[Tensor]) – Batch of padded ying (B, Tmax).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

  • noise_scale (float) – Noise scale parameter for flow.

  • noise_scale_dur (float) – Noise scale parameter for duration predictor.

  • alpha (float) – Alpha parameter to control the speed of generated speech.

  • max_len (Optional[int]) – Maximum length of acoustic feature sequence.

  • use_teacher_forcing (bool) – Whether to use teacher forcing.

Returns:

Generated waveform tensor (B, T_wav).

Return type:

Tensor

espnet2.gan_svs.vits.modules

class espnet2.gan_svs.vits.modules.Projection(hidden_channels, out_channels)[source]

Bases: torch.nn.modules.module.Module

forward(x, x_mask)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.gan_svs.vits.modules.sequence_mask(length, max_length=None)[source]

espnet2.gan_svs.vits.vits

VITS/VISinger module for GAN-SVS task.

class espnet2.gan_svs.vits.vits.VITS(idim: int, odim: int, sampling_rate: int = 22050, generator_type: str = 'visinger', vocoder_generator_type: str = 'hifigan', generator_params: Dict[str, Any] = {'decoder_channels': 512, 'decoder_kernel_size': 7, 'decoder_resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'decoder_resblock_kernel_sizes': [3, 7, 11], 'decoder_upsample_kernel_sizes': [16, 16, 4, 4], 'decoder_upsample_scales': [8, 8, 2, 2], 'expand_f0_method': 'repeat', 'flow_base_dilation': 1, 'flow_dropout_rate': 0.0, 'flow_flows': 4, 'flow_kernel_size': 5, 'flow_layers': 4, 'global_channels': -1, 'hidden_channels': 192, 'langs': None, 'posterior_encoder_base_dilation': 1, 'posterior_encoder_dropout_rate': 0.0, 'posterior_encoder_kernel_size': 5, 'posterior_encoder_layers': 16, 'posterior_encoder_stacks': 1, 'projection_filters': [0, 1, 1, 1], 'projection_kernels': [0, 5, 7, 11], 'segment_size': 32, 'spk_embed_dim': None, 'spks': None, 'text_encoder_activation_type': 'swish', 'text_encoder_attention_dropout_rate': 0.0, 'text_encoder_attention_heads': 2, 'text_encoder_blocks': 6, 'text_encoder_conformer_kernel_size': 7, 'text_encoder_dropout_rate': 0.1, 'text_encoder_ffn_expand': 4, 'text_encoder_normalize_before': True, 'text_encoder_positional_dropout_rate': 0.0, 'text_encoder_positional_encoding_layer_type': 'rel_pos', 'text_encoder_positionwise_conv_kernel_size': 1, 'text_encoder_positionwise_layer_type': 'conv1d', 'text_encoder_self_attention_layer_type': 'rel_selfattn', 'use_conformer_conv_in_text_encoder': True, 'use_macaron_style_in_text_encoder': True, 'use_only_mean_in_flow': True, 'use_phoneme_predictor': False, 'use_weight_norm_in_decoder': True, 'use_weight_norm_in_flow': True, 'use_weight_norm_in_posterior_encoder': True}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'avocodo': {'combd': {'combd_d_d': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], 'combd_d_g': [[1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1]], 'combd_d_k': [[7, 11, 11, 11, 11, 5], [11, 21, 21, 21, 21, 5], [15, 41, 41, 41, 41, 5]], 'combd_d_p': [[3, 5, 5, 5, 5, 2], [5, 10, 10, 10, 10, 2], [7, 20, 20, 20, 20, 2]], 'combd_d_s': [[1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1]], 'combd_h_u': [[16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024]], 'combd_op_f': [1, 1, 1], 'combd_op_g': [1, 1, 1], 'combd_op_k': [3, 3, 3]}, 'pqmf_config': {'lv1': [2, 256, 0.25, 10.0], 'lv2': [4, 192, 0.13, 10.0]}, 'sbd': {'pqmf_config': {'fsbd': [64, 256, 0.1, 9.0], 'sbd': [16, 256, 0.03, 10.0]}, 'sbd_band_ranges': [[0, 6], [0, 11], [0, 16], [0, 64]], 'sbd_dilations': [[[5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11]], [[3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [2, 3, 5], [2, 3, 5]]], 'sbd_filters': [[64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [32, 64, 128, 128, 128]], 'sbd_kernel_sizes': [[[7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]], [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]]], 'sbd_strides': [[1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1]], 'sbd_transpose': [False, False, False, True], 'use_sbd': True}}, 'hifigan_multi_scale_multi_period_discriminator': {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_adv: float = 1.0, lambda_mel: float = 45.0, lambda_feat_match: float = 2.0, lambda_dur: float = 0.1, lambda_kl: float = 1.0, lambda_pitch: float = 10.0, lambda_phoneme: float = 1.0, lambda_c_yin: float = 45.0, cache_generator_outputs: bool = True)[source]

Bases: espnet2.gan_svs.abs_gan_svs.AbsGANSVS

VITS module (generator + discriminator).

This is a module of VITS described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Initialize VITS module.

Parameters:
  • idim (int) – Input vocabrary size.

  • odim (int) – Acoustic feature dimension. The actual output channels will be 1 since VITS is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.

  • sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.

  • generator_type (str) – Generator type.

  • vocoder_generator_type (str) – Type of vocoder generator to use in the model.

  • generator_params (Dict[str, Any]) – Parameter dict for generator.

  • discriminator_type (str) – Discriminator type.

  • discriminator_params (Dict[str, Any]) – Parameter dict for discriminator.

  • generator_adv_loss_params (Dict[str, Any]) – Parameter dict for generator adversarial loss.

  • discriminator_adv_loss_params (Dict[str, Any]) – Parameter dict for discriminator adversarial loss.

  • feat_match_loss_params (Dict[str, Any]) – Parameter dict for feat match loss.

  • mel_loss_params (Dict[str, Any]) – Parameter dict for mel loss.

  • lambda_adv (float) – Loss scaling coefficient for adversarial loss.

  • lambda_mel (float) – Loss scaling coefficient for mel spectrogram loss.

  • lambda_feat_match (float) – Loss scaling coefficient for feat match loss.

  • lambda_dur (float) – Loss scaling coefficient for duration loss.

  • lambda_kl (float) – Loss scaling coefficient for KL divergence loss.

  • lambda_pitch (float) – Loss scaling coefficient for pitch loss.

  • lambda_phoneme (float) – Loss scaling coefficient for phoneme loss.

  • lambda_c_yin (float) – Loss scaling coefficient for yin loss.

  • cache_generator_outputs (bool) – Whether to cache generator outputs.

forward(text: torch.Tensor, text_lengths: torch.Tensor, feats: torch.Tensor, feats_lengths: torch.Tensor, singing: torch.Tensor, singing_lengths: torch.Tensor, label: Optional[Dict[str, torch.Tensor]] = None, label_lengths: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, pitch: torch.LongTensor = None, ying: torch.Tensor = None, duration: Optional[Dict[str, torch.Tensor]] = None, slur: torch.LongTensor = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, forward_generator: bool = True) → Dict[str, Any][source]

Perform generator forward.

Parameters:
  • text (LongTensor) – Batch of padded character ids (B, T_text).

  • text_lengths (LongTensor) – Batch of lengths of each input batch (B,).

  • feats (Tensor) – Batch of padded target features (B, Lmax, odim).

  • feats_lengths (LongTensor) – Batch of the lengths of each target (B,).

  • singing (Tensor) – Singing waveform tensor (B, T_wav).

  • singing_lengths (Tensor) – Singing length tensor (B,).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, T_text).

  • label_lengths (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, T_text).

  • pitch (FloatTensor) – Batch of padded f0 (B, T_feats).

  • duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, T_text).

  • slur (FloatTensor) – Batch of padded slur (B, T_text).

  • spembs (Optional[Tensor]) – Batch of speaker embeddings (B, spk_embed_dim).

  • sids (Optional[Tensor]) – Batch of speaker IDs (B, 1).

  • lids (Optional[Tensor]) – Batch of language IDs (B, 1).

  • forward_generator (bool) – Whether to forward generator.

Returns:

  • loss (Tensor): Loss scalar tensor.

  • stats (Dict[str, float]): Statistics to be monitored.

  • weight (Tensor): Weight tensor to summarize losses.

  • optim_idx (int): Optimizer index (0 for G and 1 for D).

Return type:

Dict[str, Any]

inference(text: torch.Tensor, feats: Optional[torch.Tensor] = None, label: Optional[Dict[str, torch.Tensor]] = None, melody: Optional[Dict[str, torch.Tensor]] = None, pitch: Optional[torch.Tensor] = None, duration: Optional[Dict[str, torch.Tensor]] = None, slur: Optional[Dict[str, torch.Tensor]] = None, spembs: Optional[torch.Tensor] = None, sids: Optional[torch.Tensor] = None, lids: Optional[torch.Tensor] = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: Optional[int] = None, use_teacher_forcing: bool = False) → Dict[str, torch.Tensor][source]

Run inference.

Parameters:
  • text (Tensor) – Input text index tensor (T_text,).

  • feats (Tensor) – Feature tensor (T_feats, aux_channels).

  • label (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, T_text).

  • melody (Optional[Dict]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, T_text).

  • pitch (FloatTensor) – Batch of padded f0 (B, T_feats).

  • slur (LongTensor) – Batch of padded slur (B, T_text).

  • sids (Tensor) – Speaker index tensor (1,).

  • spembs (Optional[Tensor]) – Speaker embedding tensor (spk_embed_dim,).

  • lids (Tensor) – Language index tensor (1,).

  • noise_scale (float) – Noise scale value for flow.

  • noise_scale_dur (float) – Noise scale value for duration predictor.

  • alpha (float) – Alpha parameter to control the speed of generated singing.

  • max_len (Optional[int]) – Maximum length.

  • use_teacher_forcing (bool) – Whether to use teacher forcing.

  • duration (Optional[Dict]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, T_text).

Returns:

  • wav (Tensor): Generated waveform tensor (T_wav,).

Return type:

Dict[str, Tensor]

property require_raw_singing

Return whether or not singing is required.

property require_vocoder

Return whether or not vocoder is required.

espnet2.gan_svs.vits.duration_predictor

Duration predictor modules in VISinger.

class espnet2.gan_svs.vits.duration_predictor.DurationPredictor(channels, filter_channels, kernel_size, dropout_rate, global_channels=0)[source]

Bases: torch.nn.modules.module.Module

Initialize duration predictor module.

Parameters:
  • channels (int) – Number of input channels.

  • filter_channels (int) – Number of filter channels.

  • kernel_size (int) – Size of the convolutional kernel.

  • dropout_rate (float) – Dropout rate.

  • global_channels (int, optional) – Number of global conditioning channels.

forward(x, x_mask, g=None)[source]

Forward pass through the duration predictor module.

Parameters:
  • x (Tensor) – Input tensor (B, in_channels, T).

  • x_mask (Tensor) – Mask tensor (B, 1, T).

  • g (Tensor, optional) – Global condition tensor (B, global_channels, 1).

Returns:

Predicted duration tensor (B, 2, T).

Return type:

Tensor

espnet2.gan_svs.vits.pitch_predictor

class espnet2.gan_svs.vits.pitch_predictor.Decoder(out_channels: int = 192, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 6, pw_layer_type: str = 'conv1d', pw_conv_kernel_size: int = 3, pos_enc_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, global_channels: int = -1)[source]

Bases: torch.nn.modules.module.Module

Pitch or Mel decoder module in VISinger 2.

Initialize Decoder in VISinger 2.

Parameters:
  • out_channels (int) – The output dimension of the module.

  • attention_dim (int) – The dimension of the attention mechanism.

  • attention_heads (int) – The number of attention heads.

  • linear_units (int) – The number of units in the linear layer.

  • blocks (int) – The number of encoder blocks.

  • pw_layer_type (str) – The type of position-wise layer to use.

  • pw_conv_kernel_size (int) – The kernel size of the position-wise convolutional layer.

  • pos_enc_layer_type (str) – The type of positional encoding layer to use.

  • self_attention_layer_type (str) – The type of self-attention layer to use.

  • activation_type (str) – The type of activation function to use.

  • normalize_before (bool) – Whether to normalize the data before the position-wise layer or after.

  • use_macaron_style (bool) – Whether to use the macaron style or not.

  • use_conformer_conv (bool) – Whether to use Conformer style conv or not.

  • conformer_kernel_size (int) – The kernel size of the conformer convolutional layer.

  • dropout_rate (float) – The dropout rate to use.

  • positional_dropout_rate (float) – The positional dropout rate to use.

  • attention_dropout_rate (float) – The attention dropout rate to use.

  • global_channels (int) – The number of channels to use for global conditioning.

forward(x, x_lengths, g=None)[source]

Forward pass of the Decoder.

Parameters:
  • x (Tensor) – Input tensor (B, 2 + attention_dim, T).

  • x_lengths (Tensor) – Length tensor (B,).

  • g (Tensor, optional) – Global conditioning tensor (B, global_channels, 1).

Returns:

Output tensor (B, 1, T). Tensor: Output mask (B, 1, T).

Return type:

Tensor

espnet2.gan_svs.vits.length_regulator

Length regulator related modules.

class espnet2.gan_svs.vits.length_regulator.LengthRegulator(pad_value=0.0)[source]

Bases: torch.nn.modules.module.Module

Length Regulator

Initilize length regulator module.

Parameters:

pad_value (float, optional) – Value used for padding.

LR(x, duration, use_state_info=False)[source]

Length regulates input mel-spectrograms to match duration.

Parameters:
  • x (Tensor) – Input tensor (B, dim, T).

  • duration (Tensor) – Duration tensor (B, T).

  • use_state_info (bool, optional) – Whether to use position information or not.

Returns:

Output tensor (B, dim, D_frame). Tensor: Output length (B,).

Return type:

Tensor

expand(batch, predicted, use_state_info=False)[source]

Expand input mel-spectrogram based on the predicted duration.

Parameters:
  • batch (Tensor) – Input tensor (T, dim).

  • predicted (Tensor) – Predicted duration tensor (T,).

  • use_state_info (bool, optional) – Whether to use position information or not.

Returns:

Output tensor (D_frame, dim).

Return type:

Tensor

forward(x, duration, use_state_info=False)[source]

Forward pass through the length regulator module.

Parameters:
  • x (Tensor) – Input tensor (B, dim, T).

  • duration (Tensor) – Duration tensor (B, T).

  • use_state_info (bool, optional) – Whether to use position information or not.

Returns:

Output tensor (B, dim, D_frame). Tensor: Output length (B,).

Return type:

Tensor

espnet2.gan_svs.vits.__init__

espnet2.gan_svs.vits.prior_decoder

class espnet2.gan_svs.vits.prior_decoder.PriorDecoder(out_channels: int = 384, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 6, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 3, positional_encoding_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, global_channels: int = 0)[source]

Bases: torch.nn.modules.module.Module

Initialize prior decoder module.

Parameters:
  • out_channels (int) – Output channels of the prior decoder. Defaults to 384.

  • attention_dim (int) – Dimension of the attention mechanism. Defaults to 192.

  • attention_heads (int) – Number of attention heads. Defaults to 2.

  • linear_units (int) – Number of units in the linear layer. Defaults to 768.

  • blocks (int) – Number of blocks in the encoder. Defaults to 6.

  • positionwise_layer_type (str) – Type of the positionwise layer. Defaults to “conv1d”.

  • positionwise_conv_kernel_size (int) – Kernel size of the positionwise convolutional layer. Defaults to 3.

  • positional_encoding_layer_type (str) – Type of positional encoding layer. Defaults to “rel_pos”.

  • self_attention_layer_type (str) – Type of self-attention layer. Defaults to “rel_selfattn”.

  • activation_type (str) – Type of activation. Defaults to “swish”.

  • normalize_before (bool) – Flag for normalization. Defaults to True.

  • use_macaron_style (bool) – Flag for macaron style. Defaults to False.

  • use_conformer_conv (bool) – Flag for using conformer convolution. Defaults to False.

  • conformer_kernel_size (int) – Kernel size for conformer convolution. Defaults to 7.

  • dropout_rate (float) – Dropout rate. Defaults to 0.1.

  • positional_dropout_rate (float) – Dropout rate for positional encoding. Defaults to 0.0.

  • attention_dropout_rate (float) – Dropout rate for attention. Defaults to 0.0.

  • global_channels (int) – Number of global channels. Defaults to 0.

forward(x, x_lengths, g=None)[source]

Forward pass of the PriorDecoder module.

Parameters:
  • x (Tensor) – Input tensor (B, attention_dim + 2, T).

  • x_lengths (Tensor) – Length tensor (B,).

  • g (Tensor) – Tensor for multi-singer. (B, global_channels, 1)

Returns:

Output tensor (B, out_channels, T). Tensor: Output mask tensor (B, 1, T).

Return type:

Tensor

espnet2.gan_svs.vits.text_encoder

Text encoder module in VISinger.

This code is based on https://github.com/jaywalnut310/vits and https://github.com/zhangyongmao/VISinger2.

class espnet2.gan_svs.vits.text_encoder.TextEncoder(vocabs: int, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 6, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 3, positional_encoding_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, use_slur=True)[source]

Bases: torch.nn.modules.module.Module

Text encoder module in VISinger.

This is a module of text encoder described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Instead of the relative positional Transformer, we use conformer architecture as the encoder module, which contains additional convolution layers.

Initialize TextEncoder module.

Parameters:
  • vocabs (int) – Vocabulary size.

  • attention_dim (int) – Attention dimension.

  • attention_heads (int) – Number of attention heads.

  • linear_units (int) – Number of linear units of positionwise layers.

  • blocks (int) – Number of encoder blocks.

  • positionwise_layer_type (str) – Positionwise layer type.

  • positionwise_conv_kernel_size (int) – Positionwise layer’s kernel size.

  • positional_encoding_layer_type (str) – Positional encoding layer type.

  • self_attention_layer_type (str) – Self-attention layer type.

  • activation_type (str) – Activation function type.

  • normalize_before (bool) – Whether to apply LayerNorm before attention.

  • use_macaron_style (bool) – Whether to use macaron style components.

  • use_conformer_conv (bool) – Whether to use conformer conv layers.

  • conformer_kernel_size (int) – Conformer’s conv kernel size.

  • dropout_rate (float) – Dropout rate.

  • positional_dropout_rate (float) – Dropout rate for positional encoding.

  • attention_dropout_rate (float) – Dropout rate for attention.

  • use_slur (bool) – Whether to use slur embedding.

forward(phone: torch.Tensor, phone_lengths: torch.Tensor, midi_id: torch.Tensor, dur: torch.Tensor, slur: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Calculate forward propagation.

Parameters:
  • phone (Tensor) – Input index tensor (B, T_text).

  • phone_lengths (Tensor) – Length tensor (B,).

  • midi_id (Tensor) – Input midi tensor (B, T_text).

  • dur (Tensor) – Input duration tensor (B, T_text).

Returns:

Encoded hidden representation (B, attention_dim, T_text). Tensor: Mask tensor for padded part (B, 1, T_text). Tensor: Encoded hidden representation for duration

(B, attention_dim, T_text).

Tensor: Encoded hidden representation for pitch

(B, attention_dim, T_text).

Return type:

Tensor

espnet2.gan_svs.vits.phoneme_predictor

class espnet2.gan_svs.vits.phoneme_predictor.PhonemePredictor(vocabs: int, hidden_channels: int = 192, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 2, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 3, positional_encoding_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

Phoneme Predictor module in VISinger.

Initialize PhonemePredictor module.

Parameters:
  • vocabs (int) – The number of vocabulary.

  • hidden_channels (int) – The number of hidden channels.

  • attention_dim (int) – The number of attention dimension.

  • attention_heads (int) – The number of attention heads.

  • linear_units (int) – The number of linear units.

  • blocks (int) – The number of encoder blocks.

  • positionwise_layer_type (str) – The type of position-wise layer.

  • positionwise_conv_kernel_size (int) – The size of position-wise convolution kernel.

  • positional_encoding_layer_type (str) – The type of positional encoding layer.

  • self_attention_layer_type (str) – The type of self-attention layer.

  • activation_type (str) – The type of activation function.

  • normalize_before (bool) – Whether to apply normalization before the position-wise layer or not.

  • use_macaron_style (bool) – Whether to use macaron style or not.

  • use_conformer_conv (bool) – Whether to use Conformer convolution or not.

  • conformer_kernel_size (int) – The size of Conformer kernel.

  • dropout_rate (float) – The dropout rate.

  • positional_dropout_rate (float) – The dropout rate for positional encoding.

  • attention_dropout_rate (float) – The dropout rate for attention.

forward(x, x_mask)[source]

Perform forward propagation.

Parameters:
  • x (Tensor) – The input tensor of shape (B, dim, length).

  • x_mask (Tensor) – The mask tensor for the input tensor of shape (B, length).

Returns:

The predicted phoneme tensor of shape (length, B, vocab_size).

Return type:

Tensor

espnet2.gan_svs.avocodo.__init__

class espnet2.gan_svs.avocodo.__init__.MDC(in_channels, out_channels, strides, kernel_size, dilations, use_spectral_norm=False)[source]

Bases: torch.nn.modules.module.Module

Multiscale Dilated Convolution from https://arxiv.org/pdf/1609.07093.pdf

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.gan_svs.avocodo.__init__.SBD(h, use_spectral_norm=False)[source]

Bases: torch.nn.modules.module.Module

SBD (Sub-band Discriminator) from https://arxiv.org/pdf/2206.13404.pdf

forward(y, y_hat)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.gan_svs.avocodo.__init__.AvocodoDiscriminator(combd: Dict[str, Any] = {'combd_d_d': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], 'combd_d_g': [[1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1]], 'combd_d_k': [[7, 11, 11, 11, 11, 5], [11, 21, 21, 21, 21, 5], [15, 41, 41, 41, 41, 5]], 'combd_d_p': [[3, 5, 5, 5, 5, 2], [5, 10, 10, 10, 10, 2], [7, 20, 20, 20, 20, 2]], 'combd_d_s': [[1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1]], 'combd_h_u': [[16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024]], 'combd_op_f': [1, 1, 1], 'combd_op_g': [1, 1, 1], 'combd_op_k': [3, 3, 3]}, sbd: Dict[str, Any] = {'pqmf_config': {'fsbd': [64, 256, 0.1, 9.0], 'sbd': [16, 256, 0.03, 10.0]}, 'sbd_band_ranges': [[0, 6], [0, 11], [0, 16], [0, 64]], 'sbd_dilations': [[[5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11]], [[3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [2, 3, 5], [2, 3, 5]]], 'sbd_filters': [[64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [32, 64, 128, 128, 128]], 'sbd_kernel_sizes': [[[7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]], [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]]], 'sbd_strides': [[1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1]], 'sbd_transpose': [False, False, False, True], 'segment_size': 8192, 'use_sbd': True}, pqmf_config: Dict[str, Any] = {'lv1': [2, 256, 0.25, 10.0], 'lv2': [4, 192, 0.13, 10.0]}, projection_filters: List[int] = [0, 1, 1, 1])[source]

Bases: torch.nn.modules.module.Module

Avocodo Discriminator module

forward(y: torch.Tensor, y_hats: torch.Tensor) → List[List[torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.gan_svs.avocodo.__init__.AvocodoDiscriminatorPlus(combd: Dict[str, Any] = {'combd_d_d': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], 'combd_d_g': [[1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1]], 'combd_d_k': [[7, 11, 11, 11, 11, 5], [11, 21, 21, 21, 21, 5], [15, 41, 41, 41, 41, 5]], 'combd_d_p': [[3, 5, 5, 5, 5, 2], [5, 10, 10, 10, 10, 2], [7, 20, 20, 20, 20, 2]], 'combd_d_s': [[1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1]], 'combd_h_u': [[16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024]], 'combd_op_f': [1, 1, 1], 'combd_op_g': [1, 1, 1], 'combd_op_k': [3, 3, 3]}, sbd: Dict[str, Any] = {'pqmf_config': {'fsbd': [64, 256, 0.1, 9.0], 'sbd': [16, 256, 0.03, 10.0]}, 'sbd_band_ranges': [[0, 6], [0, 11], [0, 16], [0, 64]], 'sbd_dilations': [[[5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11]], [[3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [2, 3, 5], [2, 3, 5]]], 'sbd_filters': [[64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [32, 64, 128, 128, 128]], 'sbd_kernel_sizes': [[[7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]], [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]]], 'sbd_strides': [[1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1]], 'sbd_transpose': [False, False, False, True], 'segment_size': 8192, 'use_sbd': True}, pqmf_config: Dict[str, Any] = {'lv1': [2, 256, 0.25, 10.0], 'lv2': [4, 192, 0.13, 10.0]}, projection_filters: List[int] = [0, 1, 1, 1], sample_rate: int = 22050, multi_freq_disc_params: Dict[str, Any] = {'divisors': [32, 16, 8, 4, 2, 1, 1], 'domain': 'double', 'hidden_channels': [256, 512, 512], 'hop_length_factors': [4, 8, 16], 'mel_scale': True, 'strides': [1, 2, 1, 2, 1, 2, 1]})[source]

Bases: torch.nn.modules.module.Module

Avocodo discriminator with additional MFD.

forward(y: torch.Tensor, y_hats: torch.Tensor) → List[List[torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.gan_svs.avocodo.__init__.AvocodoGenerator(in_channels: int = 80, out_channels: int = 1, channels: int = 512, global_channels: int = -1, kernel_size: int = 7, upsample_scales: List[int] = [8, 8, 2, 2], upsample_kernel_sizes: List[int] = [16, 16, 4, 4], resblock_kernel_sizes: List[int] = [3, 7, 11], resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], projection_filters: List[int] = [0, 1, 1, 1], projection_kernels: List[int] = [0, 5, 7, 11], use_additional_convs: bool = True, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Avocodo generator module.

Initialize AvocodoGenerator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • channels (int) – Number of hidden representation channels.

  • global_channels (int) – Number of global conditioning channels.

  • kernel_size (int) – Kernel size of initial and final conv layer.

  • upsample_scales (List[int]) – List of upsampling scales.

  • upsample_kernel_sizes (List[int]) – List of kernel sizes for upsample layers.

  • resblock_kernel_sizes (List[int]) – List of kernel sizes for residual blocks.

  • resblock_dilations (List[List[int]]) – List of list of dilations for residual blocks.

  • use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c: torch.Tensor, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • c (Tensor) – Input tensor (B, in_channels, T).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns:

List of output tensors (B, out_channels, T).

Return type:

List[Tensor]

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows the official implementation manner. https://github.com/jik876/hifi-gan/blob/master/models.py

class espnet2.gan_svs.avocodo.__init__.CoMBD(h, pqmf_list=None, use_spectral_norm=False)[source]

Bases: torch.nn.modules.module.Module

CoMBD (Collaborative Multi-band Discriminator) module

from from https://arxiv.org/abs/2206.13404

forward(ys, ys_hat)[source]

Forward CoMBD.

Parameters:
  • ys (List[Tensor]) – List of ground truth signals of shape (B, 1, T).

  • ys_hat (List[Tensor]) – List of predicted signals of shape (B, 1, T).

Returns:

Tuple containing the list of output tensors of shape (B, C_out, T_out) for real and fake, respectively, and the list of feature maps of shape (B, C, T) at each Conv1d layer for real and fake, respectively.

Return type:

Tuple[List[Tensor], List[Tensor], List[List[Tensor]], List[List[Tensor]]]

class espnet2.gan_svs.avocodo.__init__.CoMBDBlock(h_u: List[int], d_k: List[int], d_s: List[int], d_d: List[int], d_g: List[int], d_p: List[int], op_f: int, op_k: int, op_g: int, use_spectral_norm=False)[source]

Bases: torch.nn.modules.module.Module

CoMBD (Collaborative Multi-band Discriminator) block module

forward(x)[source]

Forward pass through the CoMBD block.

Parameters:

x (Tensor) – Input tensor of shape (B, C_in, T_in).

Returns:

Tuple containing the output tensor of

shape (B, C_out, T_out)

and a list of feature maps of shape (B, C, T) at each Conv1d layer.

Return type:

Tuple[Tensor, List[Tensor]]

class espnet2.gan_svs.avocodo.__init__.SBDBlock(segment_dim, strides, filters, kernel_size, dilations, use_spectral_norm=False)[source]

Bases: torch.nn.modules.module.Module

SBD (Sub-band Discriminator) Block

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.gan_svs.avocodo.avocodo

Avocodo Modules.

This code is modified from https://github.com/ncsoft/avocodo.

class espnet2.gan_svs.avocodo.avocodo.AvocodoDiscriminator(combd: Dict[str, Any] = {'combd_d_d': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], 'combd_d_g': [[1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1]], 'combd_d_k': [[7, 11, 11, 11, 11, 5], [11, 21, 21, 21, 21, 5], [15, 41, 41, 41, 41, 5]], 'combd_d_p': [[3, 5, 5, 5, 5, 2], [5, 10, 10, 10, 10, 2], [7, 20, 20, 20, 20, 2]], 'combd_d_s': [[1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1]], 'combd_h_u': [[16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024]], 'combd_op_f': [1, 1, 1], 'combd_op_g': [1, 1, 1], 'combd_op_k': [3, 3, 3]}, sbd: Dict[str, Any] = {'pqmf_config': {'fsbd': [64, 256, 0.1, 9.0], 'sbd': [16, 256, 0.03, 10.0]}, 'sbd_band_ranges': [[0, 6], [0, 11], [0, 16], [0, 64]], 'sbd_dilations': [[[5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11]], [[3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [2, 3, 5], [2, 3, 5]]], 'sbd_filters': [[64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [32, 64, 128, 128, 128]], 'sbd_kernel_sizes': [[[7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]], [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]]], 'sbd_strides': [[1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1]], 'sbd_transpose': [False, False, False, True], 'segment_size': 8192, 'use_sbd': True}, pqmf_config: Dict[str, Any] = {'lv1': [2, 256, 0.25, 10.0], 'lv2': [4, 192, 0.13, 10.0]}, projection_filters: List[int] = [0, 1, 1, 1])[source]

Bases: torch.nn.modules.module.Module

Avocodo Discriminator module

forward(y: torch.Tensor, y_hats: torch.Tensor) → List[List[torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.gan_svs.avocodo.avocodo.AvocodoDiscriminatorPlus(combd: Dict[str, Any] = {'combd_d_d': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], 'combd_d_g': [[1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1]], 'combd_d_k': [[7, 11, 11, 11, 11, 5], [11, 21, 21, 21, 21, 5], [15, 41, 41, 41, 41, 5]], 'combd_d_p': [[3, 5, 5, 5, 5, 2], [5, 10, 10, 10, 10, 2], [7, 20, 20, 20, 20, 2]], 'combd_d_s': [[1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1]], 'combd_h_u': [[16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024]], 'combd_op_f': [1, 1, 1], 'combd_op_g': [1, 1, 1], 'combd_op_k': [3, 3, 3]}, sbd: Dict[str, Any] = {'pqmf_config': {'fsbd': [64, 256, 0.1, 9.0], 'sbd': [16, 256, 0.03, 10.0]}, 'sbd_band_ranges': [[0, 6], [0, 11], [0, 16], [0, 64]], 'sbd_dilations': [[[5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11]], [[3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [2, 3, 5], [2, 3, 5]]], 'sbd_filters': [[64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [32, 64, 128, 128, 128]], 'sbd_kernel_sizes': [[[7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]], [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]]], 'sbd_strides': [[1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1]], 'sbd_transpose': [False, False, False, True], 'segment_size': 8192, 'use_sbd': True}, pqmf_config: Dict[str, Any] = {'lv1': [2, 256, 0.25, 10.0], 'lv2': [4, 192, 0.13, 10.0]}, projection_filters: List[int] = [0, 1, 1, 1], sample_rate: int = 22050, multi_freq_disc_params: Dict[str, Any] = {'divisors': [32, 16, 8, 4, 2, 1, 1], 'domain': 'double', 'hidden_channels': [256, 512, 512], 'hop_length_factors': [4, 8, 16], 'mel_scale': True, 'strides': [1, 2, 1, 2, 1, 2, 1]})[source]

Bases: torch.nn.modules.module.Module

Avocodo discriminator with additional MFD.

forward(y: torch.Tensor, y_hats: torch.Tensor) → List[List[torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.gan_svs.avocodo.avocodo.AvocodoGenerator(in_channels: int = 80, out_channels: int = 1, channels: int = 512, global_channels: int = -1, kernel_size: int = 7, upsample_scales: List[int] = [8, 8, 2, 2], upsample_kernel_sizes: List[int] = [16, 16, 4, 4], resblock_kernel_sizes: List[int] = [3, 7, 11], resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], projection_filters: List[int] = [0, 1, 1, 1], projection_kernels: List[int] = [0, 5, 7, 11], use_additional_convs: bool = True, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Avocodo generator module.

Initialize AvocodoGenerator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • channels (int) – Number of hidden representation channels.

  • global_channels (int) – Number of global conditioning channels.

  • kernel_size (int) – Kernel size of initial and final conv layer.

  • upsample_scales (List[int]) – List of upsampling scales.

  • upsample_kernel_sizes (List[int]) – List of kernel sizes for upsample layers.

  • resblock_kernel_sizes (List[int]) – List of kernel sizes for residual blocks.

  • resblock_dilations (List[List[int]]) – List of list of dilations for residual blocks.

  • use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c: torch.Tensor, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • c (Tensor) – Input tensor (B, in_channels, T).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns:

List of output tensors (B, out_channels, T).

Return type:

List[Tensor]

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows the official implementation manner. https://github.com/jik876/hifi-gan/blob/master/models.py

class espnet2.gan_svs.avocodo.avocodo.CoMBD(h, pqmf_list=None, use_spectral_norm=False)[source]

Bases: torch.nn.modules.module.Module

CoMBD (Collaborative Multi-band Discriminator) module

from from https://arxiv.org/abs/2206.13404

forward(ys, ys_hat)[source]

Forward CoMBD.

Parameters:
  • ys (List[Tensor]) – List of ground truth signals of shape (B, 1, T).

  • ys_hat (List[Tensor]) – List of predicted signals of shape (B, 1, T).

Returns:

Tuple containing the list of output tensors of shape (B, C_out, T_out) for real and fake, respectively, and the list of feature maps of shape (B, C, T) at each Conv1d layer for real and fake, respectively.

Return type:

Tuple[List[Tensor], List[Tensor], List[List[Tensor]], List[List[Tensor]]]

class espnet2.gan_svs.avocodo.avocodo.CoMBDBlock(h_u: List[int], d_k: List[int], d_s: List[int], d_d: List[int], d_g: List[int], d_p: List[int], op_f: int, op_k: int, op_g: int, use_spectral_norm=False)[source]

Bases: torch.nn.modules.module.Module

CoMBD (Collaborative Multi-band Discriminator) block module

forward(x)[source]

Forward pass through the CoMBD block.

Parameters:

x (Tensor) – Input tensor of shape (B, C_in, T_in).

Returns:

Tuple containing the output tensor of

shape (B, C_out, T_out)

and a list of feature maps of shape (B, C, T) at each Conv1d layer.

Return type:

Tuple[Tensor, List[Tensor]]

class espnet2.gan_svs.avocodo.avocodo.MDC(in_channels, out_channels, strides, kernel_size, dilations, use_spectral_norm=False)[source]

Bases: torch.nn.modules.module.Module

Multiscale Dilated Convolution from https://arxiv.org/pdf/1609.07093.pdf

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.gan_svs.avocodo.avocodo.MDCDConfig(h)[source]

Bases: object

class espnet2.gan_svs.avocodo.avocodo.SBD(h, use_spectral_norm=False)[source]

Bases: torch.nn.modules.module.Module

SBD (Sub-band Discriminator) from https://arxiv.org/pdf/2206.13404.pdf

forward(y, y_hat)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.gan_svs.avocodo.avocodo.SBDBlock(segment_dim, strides, filters, kernel_size, dilations, use_spectral_norm=False)[source]

Bases: torch.nn.modules.module.Module

SBD (Sub-band Discriminator) Block

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.gan_svs.avocodo.avocodo.get_padding(kernel_size, dilation=1)[source]

espnet2.gan_svs.uhifigan.uhifigan

Unet-baed HiFi-GAN Modules.

This code is based on https://github.com/jik876/hifi-gan and https://github.com/kan-bayashi/ParallelWaveGAN.

class espnet2.gan_svs.uhifigan.uhifigan.UHiFiGANGenerator(in_channels=80, out_channels=1, channels=512, global_channels: int = -1, kernel_size=7, downsample_scales=(2, 2, 8, 8), downsample_kernel_sizes=(4, 4, 16, 16), upsample_scales=(8, 8, 2, 2), upsample_kernel_sizes=(16, 16, 4, 4), resblock_kernel_sizes=(3, 7, 11), resblock_dilations=[(1, 3, 5), (1, 3, 5), (1, 3, 5)], projection_filters: List[int] = [0, 1, 1, 1], projection_kernels: List[int] = [0, 5, 7, 11], dropout=0.3, use_additional_convs=True, bias=True, nonlinear_activation='LeakyReLU', nonlinear_activation_params={'negative_slope': 0.1}, use_causal_conv=False, use_weight_norm=True, use_avocodo=False)[source]

Bases: torch.nn.modules.module.Module

UHiFiGAN generator module.

Initialize Unet-based HiFiGANGenerator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • channels (int) – Number of hidden representation channels.

  • global_channels (int) – Number of global conditioning channels.

  • kernel_size (int) – Kernel size of initial and final conv layer.

  • upsample_scales (list) – List of upsampling scales.

  • upsample_kernel_sizes (list) – List of kernel sizes for upsampling layers.

  • resblock_kernel_sizes (list) – List of kernel sizes for residual blocks.

  • resblock_dilations (list) – List of dilation list for residual blocks.

  • use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (dict) – Hyperparameters for activation function.

  • use_causal_conv (bool) – Whether to use causal structure.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c=None, f0=None, excitation=None, g: Optional[torch.Tensor] = None)[source]

Calculate forward propagation.

Parameters:
  • c (Tensor) – Input tensor (B, in_channels, T).

  • f0 (Tensor) – Input tensor (B, 1, T).

  • excitation (Tensor) – Input tensor (B, frame_len, T).

Returns:

Output tensor (B, out_channels, T).

Return type:

Tensor

inference(excitation=None, f0=None, c=None, normalize_before=False)[source]

Perform inference.

Parameters:
  • c (Union[Tensor, ndarray]) – Input tensor (T, in_channels).

  • normalize_before (bool) – Whether to perform normalization.

Returns:

Output tensor (T ** prod(upsample_scales), out_channels).

Return type:

Tensor

register_stats(stats)[source]

Register stats for de-normalization as buffer.

Parameters:

stats (str) – Path of statistics file (“.npy” or “.h5”).

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows the official implementation manner. https://github.com/jik876/hifi-gan/blob/master/models.py

espnet2.gan_svs.uhifigan.sine_generator

class espnet2.gan_svs.uhifigan.sine_generator.SineGen(sample_rate, harmonic_num=0, sine_amp=0.1, noise_std=0.003, voiced_threshold=0, flag_for_pulse=False)[source]

Bases: torch.nn.modules.module.Module

Definition of sine generator

SineGen(samp_rate, harmonic_num = 0,

sine_amp = 0.1, noise_std = 0.003, voiced_threshold = 0, flag_for_pulse=False)

sample_rate: sampling rate in Hz harmonic_num: number of harmonic overtones (default 0) sine_amp: amplitude of sine-wavefrom (default 0.1) noise_std: std of Gaussian noise (default 0.003) voiced_thoreshold: F0 threshold for U/V classification (default 0) flag_for_pulse: this SinGen is used inside PulseGen (default False)

Note: when flag_for_pulse is True, the first time step of a voiced

segment is always sin(np.pi) or cos(0)

forward(f0)[source]

Forward SineGen.

sine_tensor, uv = forward(f0) input F0: tensor(batchsize=1, length, dim=1)

f0 for unvoiced steps should be 0

output sine_tensor: tensor(batchsize=1, length, dim) output uv: tensor(batchsize=1, length, 1)

espnet2.gan_svs.uhifigan.__init__

class espnet2.gan_svs.uhifigan.__init__.UHiFiGANGenerator(in_channels=80, out_channels=1, channels=512, global_channels: int = -1, kernel_size=7, downsample_scales=(2, 2, 8, 8), downsample_kernel_sizes=(4, 4, 16, 16), upsample_scales=(8, 8, 2, 2), upsample_kernel_sizes=(16, 16, 4, 4), resblock_kernel_sizes=(3, 7, 11), resblock_dilations=[(1, 3, 5), (1, 3, 5), (1, 3, 5)], projection_filters: List[int] = [0, 1, 1, 1], projection_kernels: List[int] = [0, 5, 7, 11], dropout=0.3, use_additional_convs=True, bias=True, nonlinear_activation='LeakyReLU', nonlinear_activation_params={'negative_slope': 0.1}, use_causal_conv=False, use_weight_norm=True, use_avocodo=False)[source]

Bases: torch.nn.modules.module.Module

UHiFiGAN generator module.

Initialize Unet-based HiFiGANGenerator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • channels (int) – Number of hidden representation channels.

  • global_channels (int) – Number of global conditioning channels.

  • kernel_size (int) – Kernel size of initial and final conv layer.

  • upsample_scales (list) – List of upsampling scales.

  • upsample_kernel_sizes (list) – List of kernel sizes for upsampling layers.

  • resblock_kernel_sizes (list) – List of kernel sizes for residual blocks.

  • resblock_dilations (list) – List of dilation list for residual blocks.

  • use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (dict) – Hyperparameters for activation function.

  • use_causal_conv (bool) – Whether to use causal structure.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c=None, f0=None, excitation=None, g: Optional[torch.Tensor] = None)[source]

Calculate forward propagation.

Parameters:
  • c (Tensor) – Input tensor (B, in_channels, T).

  • f0 (Tensor) – Input tensor (B, 1, T).

  • excitation (Tensor) – Input tensor (B, frame_len, T).

Returns:

Output tensor (B, out_channels, T).

Return type:

Tensor

inference(excitation=None, f0=None, c=None, normalize_before=False)[source]

Perform inference.

Parameters:
  • c (Union[Tensor, ndarray]) – Input tensor (T, in_channels).

  • normalize_before (bool) – Whether to perform normalization.

Returns:

Output tensor (T ** prod(upsample_scales), out_channels).

Return type:

Tensor

register_stats(stats)[source]

Register stats for de-normalization as buffer.

Parameters:

stats (str) – Path of statistics file (“.npy” or “.h5”).

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows the official implementation manner. https://github.com/jik876/hifi-gan/blob/master/models.py

class espnet2.gan_svs.uhifigan.__init__.SineGen(sample_rate, harmonic_num=0, sine_amp=0.1, noise_std=0.003, voiced_threshold=0, flag_for_pulse=False)[source]

Bases: torch.nn.modules.module.Module

Definition of sine generator

SineGen(samp_rate, harmonic_num = 0,

sine_amp = 0.1, noise_std = 0.003, voiced_threshold = 0, flag_for_pulse=False)

sample_rate: sampling rate in Hz harmonic_num: number of harmonic overtones (default 0) sine_amp: amplitude of sine-wavefrom (default 0.1) noise_std: std of Gaussian noise (default 0.003) voiced_thoreshold: F0 threshold for U/V classification (default 0) flag_for_pulse: this SinGen is used inside PulseGen (default False)

Note: when flag_for_pulse is True, the first time step of a voiced

segment is always sin(np.pi) or cos(0)

forward(f0)[source]

Forward SineGen.

sine_tensor, uv = forward(f0) input F0: tensor(batchsize=1, length, dim=1)

f0 for unvoiced steps should be 0

output sine_tensor: tensor(batchsize=1, length, dim) output uv: tensor(batchsize=1, length, 1)

espnet2.gan_svs.visinger2.visinger2_vocoder

VISinger2 HiFi-GAN Modules.

This code is based on https://github.com/zhangyongmao/VISinger2

class espnet2.gan_svs.visinger2.visinger2_vocoder.BaseFrequenceDiscriminator(in_channels, hidden_channels=512, divisors=[32, 16, 8, 4, 2, 1, 1], strides=[1, 2, 1, 2, 1, 2, 1])[source]

Bases: torch.nn.modules.module.Module

Base Frequence Discriminator

Parameters:
  • in_channels (int) – Number of input channels.

  • hidden_channels (int, optional) – Number of channels in hidden layers. Defaults to 512.

  • divisors (List[int], optional) – List of divisors for the number of channels in each layer. The length of the list determines the number of layers. Defaults to [32, 16, 8, 4, 2, 1, 1].

  • strides (List[int], optional) – List of stride values for each layer. The length of the list determines the number of layers.Defaults to [1, 2, 1, 2, 1, 2, 1].

forward(x)[source]

Perform forward pass through the base frequency discriminator.

Parameters:

x (torch.Tensor) – Input tensor of shape (B, in_channels, freq_bins, time_steps).

Returns:

List of output tensors from each layer of the

discriminator, where the first tensor corresponds to the output of the first layer, and so on.

Return type:

List[torch.Tensor]

class espnet2.gan_svs.visinger2.visinger2_vocoder.ConvReluNorm(in_channels, hidden_channels, out_channels, kernel_size, n_layers, dropout_rate)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.gan_svs.visinger2.visinger2_vocoder.Generator_Harm(hidden_channels: int = 192, n_harmonic: int = 64, kernel_size: int = 3, padding: int = 1, dropout_rate: float = 0.1, sample_rate: int = 22050, hop_size: int = 256)[source]

Bases: torch.nn.modules.module.Module

Initialize harmonic generator module.

Parameters:
  • hidden_channels (int) – Number of channels in the input and hidden layers.

  • n_harmonic (int) – Number of harmonic channels.

  • kernel_size (int) – Size of the convolutional kernel.

  • padding (int) – Amount of padding added to the input.

  • dropout_rate (float) – Dropout rate.

  • sample_rate (int) – Sampling rate of the input audio.

  • hop_size (int) – Hop size used in the analysis of the input audio.

forward(f0, harm, mask)[source]

Generate harmonics from F0 and harmonic data.

Parameters:
  • f0 (Tensor) – Pitch (F0) tensor (B, 1, T).

  • harm (Tensor) – Harmonic data tensor (B, hidden_channels, T).

  • mask (Tensor) – Mask tensor for harmonic data (B, 1, T).

Returns:

Harmonic signal tensor (B, n_harmonic, T * hop_length).

Return type:

Tensor

class espnet2.gan_svs.visinger2.visinger2_vocoder.Generator_Noise(win_length: int = 1024, hop_length: int = 256, n_fft: int = 1024, hidden_channels: int = 192, kernel_size: int = 3, padding: int = 1, dropout_rate: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

Initialize the Generator_Noise module.

Parameters:
  • win_length (int, optional) – Window length. If None, set to n_fft.

  • hop_length (int) – Hop length.

  • n_fft (int) – FFT size.

  • hidden_channels (int) – Number of hidden representation channels.

  • kernel_size (int) – Size of the convolutional kernel.

  • padding (int) – Size of the padding applied to the input.

  • dropout_rate (float) – Dropout rate.

forward(x, mask)[source]

Forward Generator Noise.

Parameters:
  • x (Tensor) – Input tensor (B, hidden_channels, T).

  • mask (Tensor) – Mask tensor (B, 1, T).

Returns:

Output tensor (B, 1, T * hop_size).

Return type:

Tensor

class espnet2.gan_svs.visinger2.visinger2_vocoder.LayerNorm(channels, eps=1e-05)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.gan_svs.visinger2.visinger2_vocoder.MelScale(n_mels: int = 128, sample_rate: int = 24000, f_min: float = 0.0, f_max: Optional[float] = None, n_stft: Optional[int] = None)[source]

Bases: torch.nn.modules.module.Module

Turn a normal STFT into a mel frequency STFT, using a conversion

matrix. This uses triangular filter banks. User can control which device the filter bank (fb) is (e.g. fb.to(spec_f.device)). :param n_mels: Number of mel filterbanks. (Default: 128) :type n_mels: int, optional :param sample_rate: Sample rate of audio signal. (Default: 16000) :type sample_rate: int, optional :param f_min: Minimum frequency. (Default: 0.) :type f_min: float, optional :param f_max: Maximum frequency.

(Default: sample_rate // 2)

Parameters:

n_stft (int, optional) – Number of bins in STFT. Calculated from first input if None is given. See n_fft in :class:Spectrogram. (Default: None)

forward(specgram: torch.Tensor) → torch.Tensor[source]

Forward MelScale

Parameters:

specgram (Tensor) – A spectrogram STFT of dimension (…, freq, time).

Returns:

Mel frequency spectrogram of size (…, n_mels, time).

Return type:

Tensor

class espnet2.gan_svs.visinger2.visinger2_vocoder.MultiFrequencyDiscriminator(sample_rate: int = 22050, hop_lengths=[128, 256, 512], hidden_channels=[256, 512, 512], domain='double', mel_scale=True, divisors=[32, 16, 8, 4, 2, 1, 1], strides=[1, 2, 1, 2, 1, 2, 1])[source]

Bases: torch.nn.modules.module.Module

Multi-Frequency Discriminator module in UnivNet.

Initialize Multi-Frequency Discriminator module.

Parameters:
  • hop_lengths (list) – List of hop lengths.

  • hidden_channels (list) – List of number of channels in hidden layers.

  • domain (str) – Domain of input signal. Default is “double”.

  • mel_scale (bool) – Whether to use mel-scale frequency. Default is True.

  • divisors (list) – List of divisors for each layer in the discriminator. Default is [32, 16, 8, 4, 2, 1, 1].

  • strides (list) – List of strides for each layer in the discriminator. Default is [1, 2, 1, 2, 1, 2, 1].

forward(x)[source]

Forward pass of Multi-Frequency Discriminator module.

Parameters:

x (Tensor) – Input tensor (B, 1, T * hop_size).

Returns:

List of feature maps.

Return type:

List[Tensor]

class espnet2.gan_svs.visinger2.visinger2_vocoder.TorchSTFT(sample_rate, fft_size, hop_size, win_size, normalized=False, domain='linear', mel_scale=False, ref_level_db=20, min_level_db=-100)[source]

Bases: torch.nn.modules.module.Module

complex(x)[source]
transform(x)[source]
class espnet2.gan_svs.visinger2.visinger2_vocoder.VISinger2Discriminator(scales: int = 1, scale_downsample_pooling: str = 'AvgPool1d', scale_downsample_pooling_params: Dict[str, Any] = {'kernel_size': 4, 'padding': 2, 'stride': 2}, scale_discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1}, follow_official_norm: bool = True, periods: List[int] = [2, 3, 5, 7, 11], period_discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, multi_freq_disc_params: Dict[str, Any] = {'divisors': [32, 16, 8, 4, 2, 1, 1], 'domain': 'double', 'hidden_channels': [256, 512, 512], 'hop_length_factors': [4, 8, 16], 'mel_scale': True, 'sample_rate': 22050, 'strides': [1, 2, 1, 2, 1, 2, 1]})[source]

Bases: torch.nn.modules.module.Module

Discriminator module for VISinger2, including MSD, MPD, and MFD.

Parameters:
  • scales (int) – Number of scales to be used in the multi-scale discriminator.

  • scale_downsample_pooling (str) – Type of pooling used for downsampling.

  • scale_downsample_pooling_params (Dict[str, Any]) – Parameters for the downsampling pooling layer.

  • scale_discriminator_params (Dict[str, Any]) – Parameters for the scale discriminator.

  • follow_official_norm (bool) – Whether to follow the official normalization.

  • periods (List[int]) – List of periods to be used in the multi-period discriminator.

  • period_discriminator_params (Dict[str, Any]) – Parameters for the period discriminator.

  • multi_freq_disc_params (Dict[str, Any]) – Parameters for the multi-frequency discriminator.

  • use_spectral_norm (bool) – Whether to use spectral normalization or not.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.gan_svs.visinger2.visinger2_vocoder.VISinger2VocoderGenerator(in_channels: int = 80, out_channels: int = 1, channels: int = 512, global_channels: int = -1, kernel_size: int = 7, upsample_scales: List[int] = [8, 8, 2, 2], upsample_kernel_sizes: List[int] = [16, 16, 4, 4], resblock_kernel_sizes: List[int] = [3, 7, 11], resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], n_harmonic: int = 64, use_additional_convs: bool = True, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Initialize HiFiGANGenerator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • channels (int) – Number of hidden representation channels.

  • global_channels (int) – Number of global conditioning channels.

  • kernel_size (int) – Kernel size of initial and final conv layer.

  • upsample_scales (List[int]) – List of upsampling scales.

  • upsample_kernel_sizes (List[int]) – List of kernel sizes for upsample layers.

  • resblock_kernel_sizes (List[int]) – List of kernel sizes for residual blocks.

  • resblock_dilations (List[List[int]]) – List of list of dilations for residual blocks.

  • n_harmonic (int) – Number of harmonics used to synthesize a sound signal.

  • use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c, ddsp, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • c (Tensor) – Input tensor (B, in_channels, T).

  • ddsp (Tensor) – Input tensor (B, n_harmonic + 2, T * hop_length).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns:

Output tensor (B, out_channels, T).

Return type:

Tensor

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows the official implementation manner. https://github.com/jik876/hifi-gan/blob/master/models.py

espnet2.gan_svs.visinger2.visinger2_vocoder.create_fb_matrix(n_freqs: int, f_min: float, f_max: float, n_mels: int, sample_rate: int, norm: Optional[str] = None) → torch.Tensor[source]

Create a frequency bin conversion matrix.

Parameters:
  • n_freqs (int) – Number of frequencies to highlight/apply

  • f_min (float) – Minimum frequency (Hz)

  • f_max (float) – Maximum frequency (Hz)

  • n_mels (int) – Number of mel filterbanks

  • sample_rate (int) – Sample rate of the audio waveform

  • norm (Optional[str]) – If ‘slaney’,

  • the triangular mel weights by the width of the mel band (divide) –

  • normalization) (Default ((area) – None)

Returns:

Triangular filter banks (fb matrix) of size (n_freqs, n_mels) meaning number of frequencies to highlight/apply to x the number of filterbanks. Each column is a filterbank so that assuming there is a matrix A of size (…, n_freqs), the applied result would be A * create_fb_matrix(A.size(-1), …).

Return type:

Tensor

espnet2.gan_svs.visinger2.ddsp

espnet2.gan_svs.visinger2.ddsp.amp_to_impulse_response(amp, target_size)[source]
espnet2.gan_svs.visinger2.ddsp.extract_loudness(signal, sampling_rate, block_size, n_fft=2048)[source]
espnet2.gan_svs.visinger2.ddsp.extract_pitch(signal, sampling_rate, block_size)[source]
espnet2.gan_svs.visinger2.ddsp.fft_convolve(signal, kernel)[source]
espnet2.gan_svs.visinger2.ddsp.gru(n_input, hidden_size)[source]
espnet2.gan_svs.visinger2.ddsp.harmonic_synth(pitch, amplitudes, sampling_rate)[source]
espnet2.gan_svs.visinger2.ddsp.init_kernels(win_len, win_inc, fft_len, win_type=None, invers=False)[source]
espnet2.gan_svs.visinger2.ddsp.mean_std_loudness(dataset)[source]
espnet2.gan_svs.visinger2.ddsp.mlp(in_size, hidden_size, n_layers)[source]
espnet2.gan_svs.visinger2.ddsp.multiscale_fft(signal, scales, overlap)[source]
espnet2.gan_svs.visinger2.ddsp.remove_above_nyquist(amplitudes, pitch, sampling_rate)[source]
espnet2.gan_svs.visinger2.ddsp.resample(x, factor: int)[source]
espnet2.gan_svs.visinger2.ddsp.safe_log(x)[source]
espnet2.gan_svs.visinger2.ddsp.scale_function(x)[source]
espnet2.gan_svs.visinger2.ddsp.upsample(signal, factor)[source]

espnet2.gan_svs.visinger2.__init__

class espnet2.gan_svs.visinger2.__init__.Generator_Harm(hidden_channels: int = 192, n_harmonic: int = 64, kernel_size: int = 3, padding: int = 1, dropout_rate: float = 0.1, sample_rate: int = 22050, hop_size: int = 256)[source]

Bases: torch.nn.modules.module.Module

Initialize harmonic generator module.

Parameters:
  • hidden_channels (int) – Number of channels in the input and hidden layers.

  • n_harmonic (int) – Number of harmonic channels.

  • kernel_size (int) – Size of the convolutional kernel.

  • padding (int) – Amount of padding added to the input.

  • dropout_rate (float) – Dropout rate.

  • sample_rate (int) – Sampling rate of the input audio.

  • hop_size (int) – Hop size used in the analysis of the input audio.

forward(f0, harm, mask)[source]

Generate harmonics from F0 and harmonic data.

Parameters:
  • f0 (Tensor) – Pitch (F0) tensor (B, 1, T).

  • harm (Tensor) – Harmonic data tensor (B, hidden_channels, T).

  • mask (Tensor) – Mask tensor for harmonic data (B, 1, T).

Returns:

Harmonic signal tensor (B, n_harmonic, T * hop_length).

Return type:

Tensor

class espnet2.gan_svs.visinger2.__init__.Generator_Noise(win_length: int = 1024, hop_length: int = 256, n_fft: int = 1024, hidden_channels: int = 192, kernel_size: int = 3, padding: int = 1, dropout_rate: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

Initialize the Generator_Noise module.

Parameters:
  • win_length (int, optional) – Window length. If None, set to n_fft.

  • hop_length (int) – Hop length.

  • n_fft (int) – FFT size.

  • hidden_channels (int) – Number of hidden representation channels.

  • kernel_size (int) – Size of the convolutional kernel.

  • padding (int) – Size of the padding applied to the input.

  • dropout_rate (float) – Dropout rate.

forward(x, mask)[source]

Forward Generator Noise.

Parameters:
  • x (Tensor) – Input tensor (B, hidden_channels, T).

  • mask (Tensor) – Mask tensor (B, 1, T).

Returns:

Output tensor (B, 1, T * hop_size).

Return type:

Tensor

class espnet2.gan_svs.visinger2.__init__.VISinger2Discriminator(scales: int = 1, scale_downsample_pooling: str = 'AvgPool1d', scale_downsample_pooling_params: Dict[str, Any] = {'kernel_size': 4, 'padding': 2, 'stride': 2}, scale_discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1}, follow_official_norm: bool = True, periods: List[int] = [2, 3, 5, 7, 11], period_discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, multi_freq_disc_params: Dict[str, Any] = {'divisors': [32, 16, 8, 4, 2, 1, 1], 'domain': 'double', 'hidden_channels': [256, 512, 512], 'hop_length_factors': [4, 8, 16], 'mel_scale': True, 'sample_rate': 22050, 'strides': [1, 2, 1, 2, 1, 2, 1]})[source]

Bases: torch.nn.modules.module.Module

Discriminator module for VISinger2, including MSD, MPD, and MFD.

Parameters:
  • scales (int) – Number of scales to be used in the multi-scale discriminator.

  • scale_downsample_pooling (str) – Type of pooling used for downsampling.

  • scale_downsample_pooling_params (Dict[str, Any]) – Parameters for the downsampling pooling layer.

  • scale_discriminator_params (Dict[str, Any]) – Parameters for the scale discriminator.

  • follow_official_norm (bool) – Whether to follow the official normalization.

  • periods (List[int]) – List of periods to be used in the multi-period discriminator.

  • period_discriminator_params (Dict[str, Any]) – Parameters for the period discriminator.

  • multi_freq_disc_params (Dict[str, Any]) – Parameters for the multi-frequency discriminator.

  • use_spectral_norm (bool) – Whether to use spectral normalization or not.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.gan_svs.visinger2.__init__.VISinger2VocoderGenerator(in_channels: int = 80, out_channels: int = 1, channels: int = 512, global_channels: int = -1, kernel_size: int = 7, upsample_scales: List[int] = [8, 8, 2, 2], upsample_kernel_sizes: List[int] = [16, 16, 4, 4], resblock_kernel_sizes: List[int] = [3, 7, 11], resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], n_harmonic: int = 64, use_additional_convs: bool = True, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True)[source]

Bases: torch.nn.modules.module.Module

Initialize HiFiGANGenerator module.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • channels (int) – Number of hidden representation channels.

  • global_channels (int) – Number of global conditioning channels.

  • kernel_size (int) – Kernel size of initial and final conv layer.

  • upsample_scales (List[int]) – List of upsampling scales.

  • upsample_kernel_sizes (List[int]) – List of kernel sizes for upsample layers.

  • resblock_kernel_sizes (List[int]) – List of kernel sizes for residual blocks.

  • resblock_dilations (List[List[int]]) – List of list of dilations for residual blocks.

  • n_harmonic (int) – Number of harmonics used to synthesize a sound signal.

  • use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.

  • bias (bool) – Whether to add bias parameter in convolution layers.

  • nonlinear_activation (str) – Activation function module name.

  • nonlinear_activation_params (Dict[str, Any]) – Hyperparameters for activation function.

  • use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()[source]

Apply weight normalization module from all of the layers.

forward(c, ddsp, g: Optional[torch.Tensor] = None) → torch.Tensor[source]

Calculate forward propagation.

Parameters:
  • c (Tensor) – Input tensor (B, in_channels, T).

  • ddsp (Tensor) – Input tensor (B, n_harmonic + 2, T * hop_length).

  • g (Optional[Tensor]) – Global conditioning tensor (B, global_channels, 1).

Returns:

Output tensor (B, out_channels, T).

Return type:

Tensor

remove_weight_norm()[source]

Remove weight normalization module from all of the layers.

reset_parameters()[source]

Reset parameters.

This initialization follows the official implementation manner. https://github.com/jik876/hifi-gan/blob/master/models.py

espnet2.gan_svs.utils.expand_f0

Function to get random segments.

espnet2.gan_svs.utils.expand_f0.expand_f0(f0_frame, hop_length, method='interpolation')[source]

Expand f0 to output wave length.

Parameters:
  • f0_frame (Tensor) – Input tensor (B, 1, frame_len).

  • hop_length (Tensor) – Hop length.

  • method (str) – Method to expand f0. Choose either ‘interpolation’ or ‘repeat’.

Returns:

Output tensor (B, 1, wav_len).

Return type:

Tensor

espnet2.gan_svs.utils.__init__