Video-Text-to-Text
Transformers
Safetensors
English
qwen2
text-generation
text-generation-inference
video-SALMONN-2 / README.md
nielsr's picture
nielsr HF Staff
Improve model card with paper link, abstract, project page, and comprehensive usage
411e424 verified
|
raw
history blame
6.13 kB
metadata
base_model:
  - Qwen/Qwen2-7B
  - lmms-lab/llava-onevision-qwen2-7b-ov
  - openai/whisper-large-v3
datasets:
  - HuggingFaceFV/finevideo
  - lmms-lab/LLaVA-Video-178K
  - ShareGPT4Video/ShareGPT4Video
language:
  - en
library_name: transformers
license: apache-2.0
metrics:
  - accuracy
pipeline_tag: video-text-to-text
paper: https://huggingface.co/papers/2506.15220

video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

This repository contains the official model release for video-SALMONN 2, an advanced audio-visual large language model (LLM) designed for enhanced video (with paired audio) captioning.

Paper Code Project Page

Abstract

Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size.

πŸ”₯ News

  • 2025-07-08: We release the 7B version of video-SALMONN 2.
  • 2025-06-18: We release the code of video-SALMONN 2.

⚑️ Future Plans

  • Release the high-frame-rate version and 72B verison.

πŸ‘€ Team

Team Tsinghua: Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Chao Zhang

Team ByteDance: Wei Li, Zejun Ma

🌈 How to Use

How to train a model

  1. Prepare the dataset following scripts/example_sft.json and scripts/example_dpo.json.
  2. Download LLaVA-OneVision Model from huggingface.
  3. Modify the parameters in scripts/train_sft.sh and scripts/train_dpo.sh.
  4. Run bash scripts/train_sft.sh or bash scripts/train_dpo.sh.

How to evaluate a checkpoint

  1. Prepare the dataset following scripts/example_sft.json.
  2. Modify the parameters in scripts/eval.sh.
  3. Run bash scripts/eval.sh.

Quick Inference Example

You can easily load video-SALMONN 2 using the transformers library. Below is a quick example for inference. For more advanced usage, including training and evaluation scripts, please refer to the official GitHub repository.

import numpy as np
import torch
from transformers import AutoProcessor, LlavaAVQwenForCausalLM
# Ensure qwen_vl_utils is available, usually from the project's repository
from qwen_vl_utils import process_vision_info 
from decord import VideoReader, cpu # Requires decord: pip install decord

# Load model and processor
model_name = "tsinghua-ee/video-SALMONN-2" # Replace with your model path if local
model = LlavaAVQwenForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16, # Use torch.float16 for GPUs that don't support bfloat16
    low_cpu_mem_usage=True,
    trust_remote_code=True
).eval().cuda() # Move model to GPU
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Example video path (replace with your video)
video_path = "./examples/video1.mp4" # You'll need to provide an example video file

# Load video frames
vr = VideoReader(video_path, ctx=cpu(0))
fps = float(vr.get_avg_fps())
# Sample frames at 1 FPS
frame_indices = np.array([i for i in range(0, len(vr), round(fps))])
video_frames = [vr[int(idx)].asnumpy() for idx in frame_indices]

# Prepare messages for the model
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_frames,
            },
            {"type": "text", "text": "Describe this video in detail, including sound information."},
        ],
    }
]

# Process inputs
text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages) # Ensure process_vision_info is available from qwen_vl_utils

inputs = processor(
    text=[text_input],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = {k: v.to(model.device) for k, v in inputs.items()} # Move inputs to GPU

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=512)

# Decode and print output
output_text = processor.batch_decode(
    generated_ids[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0].strip()

print("Generated response:")
print(output_text)

✨ Citation

If you find video-SALMONN 2 useful, please cite the paper:

@article{tang2025video,
    title={{video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models}}, 
    author={Changli Tang and Yixuan Li and Yudong Yang and Jimin Zhuang and Guangzhi Sun and Wei Li and Zejun Ma and Chao Zhang},
    journal={arXiv preprint arXiv:2506.15220},
    year={2025},
}