espnet2.gan_codec.hificodec.hificodec.HiFiCodec

About 2 min

espnet2.gan_codec.hificodec.hificodec.HiFiCodec

class espnet2.gan_codec.hificodec.hificodec.HiFiCodec(sampling_rate: int = 16000, generator_params: Dict[str, Any] = {'hidden_dim': 256, 'quantizer_bins': 1024, 'quantizer_decay': 0.99, 'quantizer_kmeans_init': True, 'quantizer_kmeans_iters': 50, 'quantizer_n_q': 8, 'quantizer_target_bandwidth': [7.5, 15], 'quantizer_threshold_ema_dead_code': 2, 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'resblock_kernel_sizes': [3, 7, 11], 'resblock_num': '1', 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 11, 8, 4], 'upsample_rates': [8, 5, 4, 2]}, discriminator_params: Dict[str, Any] = {'msstft_discriminator_params': {'activation': 'LeakyReLU', 'activation_params': {'negative_slope': 0.2}, 'filters': 32, 'hop_lengths': [256, 512, 128, 64, 32], 'in_channels': 1, 'n_ffts': [1024, 2048, 512, 256, 128], 'norm': 'weight_norm', 'out_channels': 1, 'win_lengths': [1024, 2048, 512, 256, 128]}, 'periods': [2, 3, 5, 7, 11], 'periods_discriminator_params': {'bias': False, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_discriminator_params': {'bias': False, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scale_follow_official_norm': False, 'scales': 3}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, use_feat_match_loss: bool = True, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, use_mel_loss: bool = True, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 16000, 'log_base': None, 'n_mels': 80, 'range_end': 11, 'range_start': 6, 'window': 'hann'}, use_dual_decoder: bool = True, lambda_quantization: float = 1.0, lambda_reconstruct: float = 1.0, lambda_commit: float = 1.0, lambda_adv: float = 1.0, lambda_feat_match: float = 2.0, lambda_mel: float = 45.0, cache_generator_outputs: bool = False, use_loss_balancer: bool = False, balance_ema_decay: float = 0.99)

Bases: AbsGANCodec

HiFiCodec model.

Intialize HiFiCodec model.

decode(x: Tensor, **kwargs) → Tensor

Run encoding.

Parameters:x (Tensor) – Input codes (T_code, N_stream).
Returns: Generated waveform (T_wav,).
Return type: Tensor

encode(x: Tensor, **kwargs) → Tensor

Run encoding.

Parameters:x (Tensor) – Input audio (T_wav,).
Returns: Generated codes (T_code, N_stream).
Return type: Tensor

forward(audio: Tensor, forward_generator: bool = True, **kwargs) → Dict[str, Any]

Perform generator forward.

Parameters:
- audio (Tensor) – Audio waveform tensor (B, T_wav).
- forward_generator (bool) – Whether to forward generator.
Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
Return type: Dict[str, Any]

inference(x: Tensor, **kwargs) → Dict[str, Tensor]

Run inference.

Parameters:x (Tensor) – Input audio (T_wav,).
Returns:
- wav (Tensor): Generated waveform tensor (T_wav,).
- codec (Tensor): Generated neural codec (T_code, N_stream).
Return type: Dict[str, Tensor]

meta_info() → Dict[str, Any]

Return meta information of the codec.