espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer

About 1 min

espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer

class espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer(codec_choice: str, codec_fs: int, device: str = 'cpu', dump_audio: bool = False, checkpoint_path: str | None = None, config_path: str | None = None, max_token_per_frame: int = 32)

Bases: AbsTokenizer

Codec Tokenizer implementation.

Use cases: : - use encode and decode for discrete (de)tokenization

use encode_continuous and decode_continuous for continuous (de)tokenization
use forward and detokenization for discrete (de)tokenization with flatten sequence style, which is more friendly for speechlm task

Codec Tokenizer initialization.

Each of the codec implementation should assign all following features: : self.n_codebook (int): the number of codec codebooks. self.size_codebook (int): the dimension of codebooks. self.sample_rate (int): the sample rate the model trained on. self.subsample (int): the subsample rate, a.k.a., frame shift.

decode(codes)

Recover the waveform from the codes.

Input: : codes (torch.Tensor): Int tensor in shape [B, T, n_codebook]

Output: : waveform (torch.Tensor): float tensor in shape [B, n_sample]

decode_continuous(z)

Recover the waveform from the continuous representations of codec.

Input: : z (torch.Tensor): Float tensor in shape [B, T, D], codec : continuous representations

Output: : waveform (torch.Tensor): float tensor in shape [B, n_sample]

detokenize(codes, n_codebook=None)

Convert flatten codec codes into resynthesis the audio.

Input: : codes (torch.Tensor): int tensor in shape [B, T * n_codebook], : or [T * n_codebook]

Output: : waveform (torch.Tensor): float tensor in shape [B, n_sample], : or [n_sample]

encode(wavs)

Convert audio waveforms into codec codes.

Input: : wavs (torch.Tensor): float tensor in shape [B, 1, n_sample],

Output: : codes (torch.Tensor): Int tensor in shape [B, T, n_codebook]

encode_continuous(wavs)

Convert audio waveforms into continuous codec encoding results.

Input: : wavs (torch.Tensor): float tensor in shape [B, 1, n_sample],

Output: : z (torch.Tensor): float tensor in shape [B, T, D]

forward(wavs)

Convert audio waveforms into flatten codec codes and resynthesis the audio.

Input: : wavs (torch.Tensor): float tensor in shape [B, 1, n_sample],

Output: : codes (torch.Tensor): Int tensor in shape [B, T * n_codebook], resyn_audio (torch.Tensor): float tensor in shape [B, n_samples]