--- base_model: - Qwen/Qwen2-7B - lmms-lab/llava-onevision-qwen2-7b-ov - openai/whisper-large-v3 datasets: - HuggingFaceFV/finevideo - lmms-lab/LLaVA-Video-178K - ShareGPT4Video/ShareGPT4Video language: - en library_name: transformers license: apache-2.0 metrics: - accuracy pipeline_tag: video-text-to-text paper: https://huggingface.co/papers/2506.15220 --- # video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models This repository contains the official model release for **video-SALMONN 2**, an advanced audio-visual large language model (LLM) designed for enhanced video (with paired audio) captioning. [![Paper](https://img.shields.io/badge/Paper-PDF-green)](https://huggingface.co/papers/2506.15220) [![Code](https://img.shields.io/badge/Code-GitHub-blue)](https://github.com/bytedance/video-SALMONN-2) [![Project Page](https://img.shields.io/badge/Project_Page-Website-orange)](https://video-salmonn-2.github.io) ## Abstract Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. ## 🔥 News - **2025-07-08**: We release the 7B version of video-SALMONN 2. - **2025-06-18**: We release the code of video-SALMONN 2. ## ⚡️ Future Plans - Release the high-frame-rate version and 72B verison. ## 👀 Team **Team Tsinghua**: Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Chao Zhang **Team ByteDance**: Wei Li, Zejun Ma ## 🌈 How to Use ### How to train a model 1. Prepare the dataset following `scripts/example_sft.json` and `scripts/example_dpo.json`. 2. Download LLaVA-OneVision Model from [huggingface](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov). 3. Modify the parameters in `scripts/train_sft.sh` and `scripts/train_dpo.sh`. 4. Run `bash scripts/train_sft.sh` or `bash scripts/train_dpo.sh`. ### How to evaluate a checkpoint 1. Prepare the dataset following `scripts/example_sft.json`. 2. Modify the parameters in `scripts/eval.sh`. 3. Run `bash scripts/eval.sh`. ### Quick Inference Example You can easily load `video-SALMONN 2` using the `transformers` library. Below is a quick example for inference. For more advanced usage, including training and evaluation scripts, please refer to the [official GitHub repository](https://github.com/bytedance/video-SALMONN-2). ```python import numpy as np import torch from transformers import AutoProcessor, LlavaAVQwenForCausalLM # Ensure qwen_vl_utils is available, usually from the project's repository from qwen_vl_utils import process_vision_info from decord import VideoReader, cpu # Requires decord: pip install decord # Load model and processor model_name = "tsinghua-ee/video-SALMONN-2" # Replace with your model path if local model = LlavaAVQwenForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, # Use torch.float16 for GPUs that don't support bfloat16 low_cpu_mem_usage=True, trust_remote_code=True ).eval().cuda() # Move model to GPU processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) # Example video path (replace with your video) video_path = "./examples/video1.mp4" # You'll need to provide an example video file # Load video frames vr = VideoReader(video_path, ctx=cpu(0)) fps = float(vr.get_avg_fps()) # Sample frames at 1 FPS frame_indices = np.array([i for i in range(0, len(vr), round(fps))]) video_frames = [vr[int(idx)].asnumpy() for idx in frame_indices] # Prepare messages for the model messages = [ { "role": "user", "content": [ { "type": "video", "video": video_frames, }, {"type": "text", "text": "Describe this video in detail, including sound information."}, ], } ] # Process inputs text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) # Ensure process_vision_info is available from qwen_vl_utils inputs = processor( text=[text_input], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = {k: v.to(model.device) for k, v in inputs.items()} # Move inputs to GPU # Generate response generated_ids = model.generate(**inputs, max_new_tokens=512) # Decode and print output output_text = processor.batch_decode( generated_ids[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True, clean_up_tokenization_spaces=False )[0].strip() print("Generated response:") print(output_text) ``` ## ✨ Citation If you find video-SALMONN 2 useful, please cite the paper: ```bibtex @article{tang2025video, title={{video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models}}, author={Changli Tang and Yixuan Li and Yudong Yang and Jimin Zhuang and Guangzhi Sun and Wei Li and Zejun Ma and Chao Zhang}, journal={arXiv preprint arXiv:2506.15220}, year={2025}, } ```