--- license: apache-2.0 datasets: - HuggingFaceFV/finevideo - lmms-lab/LLaVA-Video-178K - ShareGPT4Video/ShareGPT4Video language: - en metrics: - accuracy base_model: - Qwen/Qwen2-7B - lmms-lab/llava-onevision-qwen2-7b-ov - openai/whisper-large-v3 pipeline_tag: video-text-to-text library_name: transformers --- # video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models Official model release of [video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models](https://github.com/bytedance/video-SALMONN-2) [Github Link](https://github.com/bytedance/video-SALMONN-2) [Paper Link](https://arxiv.org/abs/2506.15220) ## Results image