Video-Text-to-Text
Transformers
Safetensors
English
qwen2
text-generation
text-generation-inference
File size: 6,128 Bytes
b91c40b
411e424
 
 
 
b91c40b
 
 
 
 
 
411e424
 
b91c40b
 
 
411e424
b91c40b
 
 
 
411e424
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
base_model:
- Qwen/Qwen2-7B
- lmms-lab/llava-onevision-qwen2-7b-ov
- openai/whisper-large-v3
datasets:
- HuggingFaceFV/finevideo
- lmms-lab/LLaVA-Video-178K
- ShareGPT4Video/ShareGPT4Video
language:
- en
library_name: transformers
license: apache-2.0
metrics:
- accuracy
pipeline_tag: video-text-to-text
paper: https://huggingface.co/papers/2506.15220
---

# video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

This repository contains the official model release for **video-SALMONN 2**, an advanced audio-visual large language model (LLM) designed for enhanced video (with paired audio) captioning.

[![Paper](https://img.shields.io/badge/Paper-PDF-green)](https://huggingface.co/papers/2506.15220) [![Code](https://img.shields.io/badge/Code-GitHub-blue)](https://github.com/bytedance/video-SALMONN-2) [![Project Page](https://img.shields.io/badge/Project_Page-Website-orange)](https://video-salmonn-2.github.io)

## Abstract

Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size.

## 🔥 News

-   **2025-07-08**: We release the 7B version of video-SALMONN 2.
-   **2025-06-18**: We release the code of video-SALMONN 2.

## ⚡️ Future Plans

-   Release the high-frame-rate version and 72B verison.

## 👀 Team

**Team Tsinghua**: Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Chao Zhang

**Team ByteDance**: Wei Li, Zejun Ma

## 🌈 How to Use

### How to train a model

1.  Prepare the dataset following `scripts/example_sft.json` and `scripts/example_dpo.json`.
2.  Download LLaVA-OneVision Model from [huggingface](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov).
3.  Modify the parameters in `scripts/train_sft.sh` and `scripts/train_dpo.sh`.
4.  Run `bash scripts/train_sft.sh` or `bash scripts/train_dpo.sh`.

### How to evaluate a checkpoint

1.  Prepare the dataset following `scripts/example_sft.json`.
2.  Modify the parameters in `scripts/eval.sh`.
3.  Run `bash scripts/eval.sh`.

### Quick Inference Example

You can easily load `video-SALMONN 2` using the `transformers` library. Below is a quick example for inference. For more advanced usage, including training and evaluation scripts, please refer to the [official GitHub repository](https://github.com/bytedance/video-SALMONN-2).

```python
import numpy as np
import torch
from transformers import AutoProcessor, LlavaAVQwenForCausalLM
# Ensure qwen_vl_utils is available, usually from the project's repository
from qwen_vl_utils import process_vision_info 
from decord import VideoReader, cpu # Requires decord: pip install decord

# Load model and processor
model_name = "tsinghua-ee/video-SALMONN-2" # Replace with your model path if local
model = LlavaAVQwenForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16, # Use torch.float16 for GPUs that don't support bfloat16
    low_cpu_mem_usage=True,
    trust_remote_code=True
).eval().cuda() # Move model to GPU
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Example video path (replace with your video)
video_path = "./examples/video1.mp4" # You'll need to provide an example video file

# Load video frames
vr = VideoReader(video_path, ctx=cpu(0))
fps = float(vr.get_avg_fps())
# Sample frames at 1 FPS
frame_indices = np.array([i for i in range(0, len(vr), round(fps))])
video_frames = [vr[int(idx)].asnumpy() for idx in frame_indices]

# Prepare messages for the model
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_frames,
            },
            {"type": "text", "text": "Describe this video in detail, including sound information."},
        ],
    }
]

# Process inputs
text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages) # Ensure process_vision_info is available from qwen_vl_utils

inputs = processor(
    text=[text_input],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = {k: v.to(model.device) for k, v in inputs.items()} # Move inputs to GPU

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=512)

# Decode and print output
output_text = processor.batch_decode(
    generated_ids[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0].strip()

print("Generated response:")
print(output_text)
```

## ✨ Citation
If you find video-SALMONN 2 useful, please cite the paper:

```bibtex
@article{tang2025video,
    title={{video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models}}, 
    author={Changli Tang and Yixuan Li and Yudong Yang and Jimin Zhuang and Guangzhi Sun and Wei Li and Zejun Ma and Chao Zhang},
    journal={arXiv preprint arXiv:2506.15220},
    year={2025},
}
```