Qwen2.5-VL-7B-Cognition-Full-SFT
This repo contains the fine-tuned Qwen-2.5-VL 7B Instruct weights (SFT) trained on the CogIP-Bench dataset which is the model demonstrating the effectiveness of cognition alignment by Qwen-Image. More details see GitHub repo.
Figure: Qualitative comparison of images generated by the Qwen-Image pipeline using different LLM backbones (with the same prompt). The figure shows the effect of pretraining versus supervised fine-tuning (SFT) on image cognition properties. For each image pair, Left: Base model; right: SFT model. Generation prompts are shown under each image pair. We can see that images generated with our SFT MLLM backbone better demonstrate the cognitive cues embedded in the prompts.
Quick load example
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoFeatureExtractor
MODEL_ID = "foolen/qwen2.5-vl-7b-cognition-full-sft"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto") # or load_in_4bit=True
- Downloads last month
- 1
Model tree for foolen/qwen2.5-vl-7b-cognition-full-sft
Base model
Qwen/Qwen2.5-VL-7B-Instruct