espnet2.ps2st.espnet_model.ESPnetQwen2AudioModel
espnet2.ps2st.espnet_model.ESPnetQwen2AudioModel
class espnet2.ps2st.espnet_model.ESPnetQwen2AudioModel(model_name: str = 'Qwen/Qwen2-Audio-7B-Instruct', vocab_size: int = 50000, token_list: Tuple[str, ...] | List[str] = (), ignore_id: int = -1, decode_config_path: str | None = None, pytest_mode: bool | None = False)
Bases: AbsESPnetModel
ESPnet model integrating Qwen2-Audio from transformers
Initialize internal Module state, shared by both nn.Module and ScriptModule.
collect_feats(speech: Tensor, speech_lengths: Tensor, text: Tensor, text_lengths: Tensor, **kwargs) → Dict[str, Tensor]
Collect features for statistics computation
forward(speech: Tensor, speech_lengths: Tensor, text: Tensor, text_lengths: Tensor, **kwargs) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Forward pass required by AbsESPnetModel interface
- Returns: Scalar tensor (dummy for inference-only) stats: Dictionary of statistics for logging weight: Batch size for normalization
- Return type: loss
inference(input_ids: Tensor, attention_mask: Tensor, input_features: Tensor, feature_attention_mask: Tensor, **kwargs) → str
Custom inference method using Qwen2-Audio
