espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer
espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer
class espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer(codec_choice: str, codec_fs: int, device: str = 'cpu', dump_audio: bool = False, checkpoint_path: str | None = None, config_path: str | None = None, max_token_per_frame: int = 32)
Bases: AbsTokenizer
Codec Tokenizer implementation
Use cases: : - use encode and decode for discrete (de)tokenization
- use encode_continuous and decode_continuous for continuous (de)tokenization
- use forward and detokenization for discrete (de)tokenization with flatten sequence style, which is more friendly for speechlm task
Codec Tokenizer initialization
Each of the codec implementation should assign all following features: : self.n_codebook (int): the number of codec codebooks. self.size_codebook (int): the dimension of codebooks. self.sample_rate (int): the sample rate the model trained on. self.subsample (int): the subsample rate, a.k.a., frame shift.
decode(codes)
Recover the waveform from the codes. Input:
codes (torch.Tensor): Int tensor in shape [B, T, n_codebook]
Output: : waveform (torch.Tensor): float tensor in shape [B, n_sample]
decode_continuous(z)
Recover the waveform from the continuous representations of codec Input:
z (torch.Tensor): Float tensor in shape [B, T, D], codec : continuous representations
Output: : waveform (torch.Tensor): float tensor in shape [B, n_sample]
detokenize(codes, n_codebook=None)
Convert flatten codec codes into resynthesis the audio Input:
codes (torch.Tensor): int tensor in shape [B, T * n_codebook], : or [T * n_codebook]
Output: : waveform (torch.Tensor): float tensor in shape [B, n_sample], : or [n_sample]
encode(wavs)
Convert audio waveforms into codec codes Input:
wavs (torch.Tensor): float tensor in shape [B, 1, n_sample],
Output: : codes (torch.Tensor): Int tensor in shape [B, T, n_codebook]
encode_continuous(wavs)
Convert audio waveforms into continuous codec encoding results Input:
wavs (torch.Tensor): float tensor in shape [B, 1, n_sample],
Output: : z (torch.Tensor): float tensor in shape [B, T, D]
forward(wavs)
Convert audio waveforms into flatten codec codes and resynthesis the audio Input:
wavs (torch.Tensor): float tensor in shape [B, 1, n_sample],
Output: : codes (torch.Tensor): Int tensor in shape [B, T * n_codebook], resyn_audio (torch.Tensor): float tensor in shape [B, n_samples]