--- license: apache-2.0 language: - en base_model: - Qwen/Qwen2.5-VL-3B-Instruct pipeline_tag: image-text-to-text --- # TBAC-VLR1-3B-preview ## Overview This is a multimodal language model fine-tuned by **Tencent PCG Basic Algorithm Center**. Based on Qwen2.5-VL-3B-Instruct, TBAC-VLR1-3B-preview uses Group Relative Policy Optimization (GRPO) to enhance multimodal reasoning ability, achieving **state-of-the-art** results on several multimodal reasoning benchmarks among models of the same size. ## Performance | Model | **Average** | **MathVista**| **MathVision** | **MathVerse** | **DynaMath** | **WeMath**| **LogicVista** | | :-------------------: | :---------: | :-----------:| :------------: | :-----------: | :-----------: | :-------: | :----------: | | Qwen2-VL-2B | 20.5 | 48.0 | 16.1 | 17.5 | 3.8 | 10.8 | 26.6 | | InternVL2.5-2B | 21.2 | 51.1 | 14.0 | 22.3 | 4.4 | 8.0 | 27.3 | | InternVL3-2B | 29.1 | 57.6 | 20.2 | 24.5 | 14.8 | 22.9 | 40.3 | | Qwen2.5-VL-3B | 31.8 | 61.2 | 21.9 | 31.2 | 13.2 | 22.9 | 40.3 | | VLM-R1-3B-Math-0305 | 33.4 | 62.7 | 21.9 | 32.2 | 13.0 | 30.0 | 40.5 | | Taichu-VLR-3B | 33.6 | 64.9 | 23.1 | 32.1 | 12.6 | 30.4 | 38.7 | | VLAA-Thinker-Qwen2.5VL-3B | 35.4 | 61.0 | 24.4 | 36.4 | 18.2 | 33.8 | 38.5 | | **TBAC-VLR1-3B-preview** | **35.7** | 64.8 | 25.0 | 33.2 | 17.7 | 32.4 | 40.8 | ![Performance](./assets/performance.png) The compared results are sourced from https://opencompass.org.cn. The results of our model are self-reported, obtained by running evaluations offline on each benchmark. ## Usage ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "TencentBAC/TBAC-VLR1-3B-preview", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("TencentBAC/TBAC-VLR1-3B-preview") messages = [ { "role": "system", "content": "You are a helpful assistant. The user asks a question, and you solve it. You need first think about the reasoning process in the mind and then provides the user with the answer. The answer are enclosed within \\boxed{} tags i.e., reasoning process here \\boxed{ answer here }." }, { "role": "user", "content": [ { "type": "image", "image": image_path, }, {"type": "text", "text": query}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## Citation If you find our model useful in your research, please consider giving ❤️ and citations. Thanks! ``` @misc{Xu2025tbacvlr1, title={TBAC-VLR1-3B-preview}, author={Junzhe Xu and Yuyang yin}, url={https://huggingface.co/TencentBAC/TBAC-VLR1-3B-preview}, year={2025}, } ``` --- **About** Created by the Tencent PCG Basic Algorithm Center. All rights reserved.