espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer
espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer
class espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer(codec_choice: str, codec_fs: int, device: str = 'cpu', dump_audio: bool = False, checkpoint_path: str | None = None, config_path: str | None = None, max_token_per_frame: int = 32)
Bases: AbsTokenizer
Codec Tokenizer implementation.
Use cases: : - use encode and decode for discrete (de)tokenization
- use encode_continuous and decode_continuous for continuous (de)tokenization
- use forward and detokenization for discrete (de)tokenization with flatten sequence style, which is more friendly for speechlm task
Codec Tokenizer initialization.
Each of the codec implementation should assign all following features: : self.n_codebook (int): the number of codec codebooks. self.size_codebook (int): the dimension of codebooks. self.sample_rate (int): the sample rate the model trained on. self.subsample (int): the subsample rate, a.k.a., frame shift.
decode(codes)
Recover the waveform from the codes.
Input: : codes (torch.Tensor): Int tensor in shape [B, T, n_codebook]
Output: : waveform (torch.Tensor): float tensor in shape [B, n_sample]
decode_continuous(z)
Recover the waveform from the continuous representations of codec.
Input: : z (torch.Tensor): Float tensor in shape [B, T, D], codec : continuous representations
Output: : waveform (torch.Tensor): float tensor in shape [B, n_sample]
detokenize(codes, n_codebook=None)
Convert flatten codec codes into resynthesis the audio.
Input: : codes (torch.Tensor): int tensor in shape [B, T * n_codebook], : or [T * n_codebook]
Output: : waveform (torch.Tensor): float tensor in shape [B, n_sample], : or [n_sample]
encode(wavs)
Convert audio waveforms into codec codes.
Input: : wavs (torch.Tensor): float tensor in shape [B, 1, n_sample],
Output: : codes (torch.Tensor): Int tensor in shape [B, T, n_codebook]
encode_continuous(wavs)
Convert audio waveforms into continuous codec encoding results.
Input: : wavs (torch.Tensor): float tensor in shape [B, 1, n_sample],
Output: : z (torch.Tensor): float tensor in shape [B, T, D]
forward(wavs)
Convert audio waveforms into flatten codec codes and resynthesis the audio.
Input: : wavs (torch.Tensor): float tensor in shape [B, 1, n_sample],
Output: : codes (torch.Tensor): Int tensor in shape [B, T * n_codebook], resyn_audio (torch.Tensor): float tensor in shape [B, n_samples]