๐Ÿง  Model Card for Kvasir-VQA-x1 Fine-Tuned Models

Fine-tuned visionโ€“language models for Visual Question Answering (VQA) in gastrointestinal (GI) endoscopy, trained on the Kvasir-VQA-x1 benchmark.

๐Ÿงฉ Overview

These models extend strong multimodal backbones (Qwen2.5-VL, Qwen2.5-VL-Transf., and MedGemma) using parameter-efficient LoRA fine-tuning on clinically validated imageโ€“questionโ€“answer pairs from Kvasir-VQA-x1.
They are designed to generate concise, clinically accurate responses to natural-language questions about endoscopic findings, instruments, and anatomical landmarks.

๐Ÿ”— Key Resources

๐Ÿ“Š Model Summary

Model Base Model Hugging Face Training Logs (W&B)
Qwen2.5-VL-KvasirVQA-x1-ft Qwen2.5-VL-7B-Instruct ๐Ÿ”— SimulaMet/Qwen2.5-VL-KvasirVQA-x1-ft W&B Run 7mk4gz8s
Qwen2.5-VL-Transf-KvasirVQA-x1-ft Qwen2.5-VL-7B-Transf. ๐Ÿ”— SimulaMet/Qwen2.5-VL-Transf-KvasirVQA-x1-ft W&B Run megwnbz6
MedGemma-KvasirVQA-x1-ft MedGemma-4B-IT ๐Ÿ”— SimulaMet/MedGemma-KvasirVQA-x1-ft W&B Run 7mk4gz8s

โš™๏ธ Training Configuration

Attribute Specification
GPUs 4โ€“8 ร— A100 (80 GB)
Precision bfloat16 (DeepSpeed ZeRO-2)
Frameworks Transformers + Swift + PEFT
Optimizer Fused AdamW
Scheduler Linear / Cosine (model-specific)
Effective Batch Size 36 (MedGemma) / 32 (Qwen)

๐Ÿงช Evaluation Highlights

Model Params Epochs LR LoRA (r/ฮฑ) Time Eval Acc. Eval Loss
MedGemma-Transf. 4.3 B 4 2e-5 16 / 64 27 h 84.97 % 0.4111
Qwen2.5-VL-Transf. 8.3 B 4 2e-5 16 / 64 30.9 h 85.91 % 0.3883
Qwen2.5-VL 8.3 B 3 2e-5 16 / 64 23 h 85.78 % 0.3906

(Evaluation on 1 % held-out subset of training data.)

๐Ÿงฎ Evaluation Protocol

Traditional n-gram metrics (BLEU, ROUGE) fail to capture clinical correctness, so these models are evaluated using an LLM-based structured adjudicator (Qwen/Qwen3-30B-A3B). Each model prediction is graded per clinical aspect (polyp_type, instrument_presence, etc.) with binary scores and textual justifications:

{
  "eval_json": {
    "polyp_type": {"score": 1, "reason": "Model correctly identified a sessile polyp."},
    "instrument_presence": {"score": 0, "reason": "Failed to mention visible biopsy forceps."}
  }
}

This yields fine-grained, reproducible category-wise accuracy metrics reflecting true clinical reasoning performance. See details in the paper.

๐Ÿ–ผ๏ธ Usage Example

!pip install ms-swift==3.8.0 bitsandbytes qwen_vl_utils==0.0.11

import torch
from swift.llm import PtEngine, RequestConfig, InferRequest
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

engine = PtEngine(
    adapters=["SimulaMet/Qwen2.5-VL-KvasirVQA-x1-ft"],  # or use other fine-tuned model IDs
    model_id_or_path="Qwen/Qwen2.5-VL-7B-Instruct",  # or use other base model IDs
    quantization_config=bnb_config,
    attn_impl="sdpa",
    use_hf=True,
)

req_cfg = RequestConfig(max_tokens=512, temperature=0.3, top_k=20, top_p=0.7, repetition_penalty=1.05)

infer_requests = [
    InferRequest(messages=[{
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1/resolve/main/images/clb0kvxvm90y4074yf50vf5nq.jpg"},
            {"type": "text", "text": "What is shown in the image?"}
        ],
    }])
]

resp = engine.infer(infer_requests, req_cfg)
print(resp[0].choices[0].message.content)

๐Ÿ‘‰ See detailed examples in the Colab usage notebook.

๐Ÿ“„ License

See base model-specific LICENSEs.

๐Ÿ“ข Citation

If you use these models or the dataset, please cite:

@incollection{Gautam2025Oct,
  author    = {Gautam, Sushant and Riegler, Michael and Halvorsen, P{\aa}l},
  title     = {{Kvasir-VQA-x1:A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy}},
  booktitle = {{Data Engineering in Medical Imaging}},
  journal   = {SpringerLink},
  pages     = {53--63},
  year      = {2025},
  month     = oct,
  isbn      = {978-3-032-08009-7},
  publisher = {Springer},
  address   = {Cham, Switzerland},
  doi       = {10.1007/978-3-032-08009-7_6}
}
Downloads last month
62
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for SimulaMet/Qwen2.5-VL-KvasirVQA-x1-ft

Base model

Qwen/Qwen2.5-7B
Adapter
(710)
this model

Dataset used to train SimulaMet/Qwen2.5-VL-KvasirVQA-x1-ft