espnet2.speechlm.core_lm.valle.ValleLM
About 1 min
espnet2.speechlm.core_lm.valle.ValleLM
class espnet2.speechlm.core_lm.valle.ValleLM(vocab_size: int, nq: int, share_emb: bool = True, att_unit: int = 256, head: int = 2, ar_layer: int = 4, nar_layer: int = 4, n_ctx: int = 3000)
Bases: AbsCoreLM
Initialize Vall-E model
- Parameters:
- vocab_size (int) – Dimention of vocabulary.
- nq (int) – Number of codes for each token / frame, usually for speech codec.
- share_emb (bool) – If true, share the embedding and lm_head weight.
- att_unit (int) – Dimention of Transformer attention.
- head (int) – Number of heads in Transformer attention.
- ar_layer (int) – Number of layers in AR Transformer.
- nar_layer (int) – Number of layers in NAR Transformer.
- n_ctx (int) – maximum context length of AR & NAR Transformer.
forward(dec_seq: Tensor, dec_seq_lengths: Tensor | None = None, enc_seq: Tensor | None = None, enc_seq_lengths: Tensor | None = None, prefix_len: Tensor | None = None) → Tuple[Tensor, Tensor, Dict]
Vall-E forward for training
- Parameters:
- dec_seq (LongTensor) – Batch of decoder sequences (B, T, nq).
- dec_seq_lengths (LongTensor) – Lengths of batched decoder sequences (B,).
- enc_seq (LongTensor) – Batch of encoder sequences (B, T, nq), keep the interface, may not be used.
- enc_seq_lengths (LongTensor) – Lengths of batched encoder sequences (B,), keep the interface, may not be used.
- prefix_len (LongTensor) – Lengths of condition part in dec_seq (B,).
inference(prefix: Tensor, opts: SpeechLMInferenceOptions, enc_seq: Tensor = None, suffix: Tensor = None)
Vall-E Inference.
- Parameters:
- prefix (LongTensor) – Prefix part of dec_seq (B, T, nq).
- opts (SpeechLMInferenceOptions) – inference options.
- enc_seq (LongTensor) – Encoder token sequence (B, T, nq).
- suffix (LongTensor) – suffix part of dec_seq (B, T, nq), usually the target sequence for teacher-forcing.
prepare_input(dec_seq_emb, prefix_len, level)