๐ง Model Card for Kvasir-VQA-x1 Fine-Tuned Models
Fine-tuned visionโlanguage models for Visual Question Answering (VQA) in gastrointestinal (GI) endoscopy, trained on the Kvasir-VQA-x1 benchmark.
๐งฉ Overview
These models extend strong multimodal backbones (Qwen2.5-VL, Qwen2.5-VL-Transf., and MedGemma) using parameter-efficient LoRA fine-tuning on clinically validated imageโquestionโanswer pairs from Kvasir-VQA-x1.
They are designed to generate concise, clinically accurate responses to natural-language questions about endoscopic findings, instruments, and anatomical landmarks.
๐ Key Resources
- Dataset: SimulaMet/Kvasir-VQA-x1
- ArXiv: arXiv:2506.09958
- GitHub: Simula/Kvasir-VQA-x1
- Colab Demo: Usage Notebook โถ๏ธ
- Published in: Data Engineering in Medical Imaging (DEMI), MICCAI 2025
- Springer Chapter: SpringerLink DOI:10.1007/978-3-032-08009-7_6
๐ Model Summary
| Model | Base Model | Hugging Face | Training Logs (W&B) |
|---|---|---|---|
| Qwen2.5-VL-KvasirVQA-x1-ft | Qwen2.5-VL-7B-Instruct | ๐ SimulaMet/Qwen2.5-VL-KvasirVQA-x1-ft | W&B Run 7mk4gz8s |
| Qwen2.5-VL-Transf-KvasirVQA-x1-ft | Qwen2.5-VL-7B-Transf. | ๐ SimulaMet/Qwen2.5-VL-Transf-KvasirVQA-x1-ft | W&B Run megwnbz6 |
| MedGemma-KvasirVQA-x1-ft | MedGemma-4B-IT | ๐ SimulaMet/MedGemma-KvasirVQA-x1-ft | W&B Run 7mk4gz8s |
โ๏ธ Training Configuration
| Attribute | Specification |
|---|---|
| GPUs | 4โ8 ร A100 (80 GB) |
| Precision | bfloat16 (DeepSpeed ZeRO-2) |
| Frameworks | Transformers + Swift + PEFT |
| Optimizer | Fused AdamW |
| Scheduler | Linear / Cosine (model-specific) |
| Effective Batch Size | 36 (MedGemma) / 32 (Qwen) |
๐งช Evaluation Highlights
| Model | Params | Epochs | LR | LoRA (r/ฮฑ) | Time | Eval Acc. | Eval Loss |
|---|---|---|---|---|---|---|---|
| MedGemma-Transf. | 4.3 B | 4 | 2e-5 | 16 / 64 | 27 h | 84.97 % | 0.4111 |
| Qwen2.5-VL-Transf. | 8.3 B | 4 | 2e-5 | 16 / 64 | 30.9 h | 85.91 % | 0.3883 |
| Qwen2.5-VL | 8.3 B | 3 | 2e-5 | 16 / 64 | 23 h | 85.78 % | 0.3906 |
(Evaluation on 1 % held-out subset of training data.)
๐งฎ Evaluation Protocol
Traditional n-gram metrics (BLEU, ROUGE) fail to capture clinical correctness, so these models are evaluated using an LLM-based structured adjudicator (Qwen/Qwen3-30B-A3B). Each model prediction is graded per clinical aspect (polyp_type, instrument_presence, etc.) with binary scores and textual justifications:
{
"eval_json": {
"polyp_type": {"score": 1, "reason": "Model correctly identified a sessile polyp."},
"instrument_presence": {"score": 0, "reason": "Failed to mention visible biopsy forceps."}
}
}
This yields fine-grained, reproducible category-wise accuracy metrics reflecting true clinical reasoning performance. See details in the paper.
๐ผ๏ธ Usage Example
!pip install ms-swift==3.8.0 bitsandbytes qwen_vl_utils==0.0.11
import torch
from swift.llm import PtEngine, RequestConfig, InferRequest
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
)
engine = PtEngine(
adapters=["SimulaMet/Qwen2.5-VL-KvasirVQA-x1-ft"], # or use other fine-tuned model IDs
model_id_or_path="Qwen/Qwen2.5-VL-7B-Instruct", # or use other base model IDs
quantization_config=bnb_config,
attn_impl="sdpa",
use_hf=True,
)
req_cfg = RequestConfig(max_tokens=512, temperature=0.3, top_k=20, top_p=0.7, repetition_penalty=1.05)
infer_requests = [
InferRequest(messages=[{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1/resolve/main/images/clb0kvxvm90y4074yf50vf5nq.jpg"},
{"type": "text", "text": "What is shown in the image?"}
],
}])
]
resp = engine.infer(infer_requests, req_cfg)
print(resp[0].choices[0].message.content)
๐ See detailed examples in the Colab usage notebook.
๐ License
See base model-specific LICENSEs.
๐ข Citation
If you use these models or the dataset, please cite:
@incollection{Gautam2025Oct,
author = {Gautam, Sushant and Riegler, Michael and Halvorsen, P{\aa}l},
title = {{Kvasir-VQA-x1:A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy}},
booktitle = {{Data Engineering in Medical Imaging}},
journal = {SpringerLink},
pages = {53--63},
year = {2025},
month = oct,
isbn = {978-3-032-08009-7},
publisher = {Springer},
address = {Cham, Switzerland},
doi = {10.1007/978-3-032-08009-7_6}
}
- Downloads last month
- 62