Aloe-Vision


Aloe-Vision is a medical Large Vision–Language Model built on Qwen2-VL-Instruct, released in 7B and 72B sizes. The model is trained on a ~3.5 M samples balanced mixture across medical vs. general and multimodal vs. text-only sources, rebalanced by loss-contributing assistant tokens to avoid long-answer bias. We implement leakage control of evaluation images in the training data via exact 64-bit image-hash matching, removing any duplicates from the training. Quality filtering of the training data combines (1) LVLM-based sample scoring (1–5 scale) for image–question–answer coherence and relevance and (2) answer perplexity checks to flag trivial or noisy annotations. Thresholds are dataset-specific and manually tuned, leading to the removal of low-quality outliers while preserving clinically meaningful diversity. Furthermore, the model is additionally fine-tuned on 17.2 K adversarially perturbed medical samples to enhance robustness against sycophantic and misleading multimodal cues. The model is released for research purposes under CC BY-NC-SA 4.0.


Model Details

  • Base model: Qwen2-VL-Instruct (7B / 72B)
  • Variant: Aloe-Vision-7B-AR (Adversarially Robust)
  • Training type: Two-stage SFT (medical + adversarial fine-tuning)
  • Sizes: 7B, 72B
  • Languages: English
  • Images per turn: Qwen2-VL style multi-image support
  • License: CC BY-NC-SA 4.0
  • Developed by: HPAI — Barcelona Supercomputing Center (BSC)
  • Contact: [email protected]

Intended Use & Out-of-Scope

Intended: research on medical VQA and multimodal reasoning, dataset analysis, academic benchmarking.

Out-of-scope:

  • clinical diagnosis/treatment, triage, or any unsupervised medical use.
  • generation of harmful, misleading, or fraudulent medical content.
  • processing of PHI or any personally identifiable patient data.

How to Use

Aloe-Vision follows the Qwen2-VL chat template and processor API. Replace the image path(s) and prompt content to suit your use case.

Python (Transformers)

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

model_id = "HPAI-BSC/Aloe-Vision-7B-AR"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/your_image.png"},
            {"type": "text", "text": "What abnormality do you observe? Be concise."}
        ]
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs = processor.process_vision_info(messages)
inputs = processor(
    text=[text],
    **image_inputs,
    return_tensors="pt"
).to(model.device)

generated = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=False,
    eos_token_id=processor.tokenizer.eos_token_id,
)
output_text = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(output_text.split(text)[-1].strip())

Grounding: Aloe-Vision supports region-referenced grounding using Qwen2-VL box marker tokens.


Training Summary

  • Training type: Two-stage SFT (medical + adversarial fine-tuning)
  • Stack: TRL + DeepSpeed ZeRO-3
  • Precision: BF16
  • Global batch size: 1024
  • Micro batch size: 16
  • Epochs: 1
  • Sequence length: 4096
  • LR: 3.75e-5, Cosine schedule, warmup 3%
  • Optimizer: AdamW
  • Grad checkpointing: enabled
  • Parallelism: DeepSpeed ZeRO-3

Compute

  • Cluster: MareNostrum-5 (BSC)
  • Nodes/GPUs: 8 nodes × 4× NVIDIA H100 (total 32 GPUs)
  • GPU hours: ~500

Training Data

We construct a balanced mixture across two axes: modality (multimodal vs text-only) and domain (medical vs general). All sources are normalized to a unified trl conversation schema. Medical multimodal includes both global understanding and fine-grained region reasoning.

The dataset can be found in HPAI-BSC/Aloe-Vision-Data.


Evaluation

Aloe-Vision targets comprehensive evaluation across medical multimodal, medical text-only, general multimodal, and general text-only tasks. Benchmarks are run with identical settings for Aloe-Vision and baselines to ensure reproducibility.

Benchmarks:

  • PathMMU (multi, medical, MCQ) — 1.1K
  • GMAI-MMBench (multi, medical, MCQ) — 4.5K
  • OmniMedVQA (multi, medical, MCQ) — 89K
  • ProbMed (multi, medical, Y/N) — 57K
  • SLAKE (multi, medical, open-ended; LLM-as-judge) — 2K
  • MMMU (multi, general, MCQ) — 1.4K
  • MultiMedQA (text, medical, MCQ) — 7K
  • MMLU (text, general, MCQ) — 14K

Evaluation protocol

  • Multimodal via VLMEvalKit, text-only via lm-evaluation-harness.
  • Decoding: greedy, accuracy by exact match for MCQ and Y/N.
  • LLM-as-judge (SLAKE): Qwen2.5-VL-72B with a rubric-based {0.0, 0.5, 1.0} scale.

Results

Model OmniMedVQA GMAI-MMBENCH PathMMU ProbMed SLAKE MMMU MultiMedQA MMLU
Kimi-VL-A3B-Instruct (general) 71.30 46.20 49.65 78.91 65.06 52.00 59.21 69.04
MiMo-7B-RL (general) 63.80 43.82 51.75 74.80 61.13 21.67 55.88 68.42
Qwen2-VL-7B (general) 71.40 46.42 54.90 72.87 64.11 50.44 59.67 67.82
InternVL3.5-8B (general) 87.20 57.96 65.06 79.51 75.31 54.67 63.95 75.56
HuatuoGPT-Vision-7B 71.40 47.23 57.09 76.14 60.65 39.89 57.93 67.61
Linghsu-7B 79.50 52.31 66.55 79.00 80.18 57.89 62.09 69.37
Chiron-o1-8B 71.40 41.41 55.87 73.73 66.49 43.22 59.65 71.56
Aloe-Vision-7B 76.50 52.79 61.82 76.69 65.40 45.11 58.48 65.95
Aloe-Vision-7B-AR 77.60 53.95 65.32 79.35 63.39 48.33 61.82 66.31

Adversarial Robustness

To improve robustness against noisy or misleading inputs, we conducted an additional fine-tuning stage focused on adversarial robustness. This stage aimed to mitigate common LVLM vulnerabilities such as sycophantic behavior or misleading multimodal cues. An adversarial benchmark was first created by applying controlled perturbations to existing medical datasets (distinct from those used in evaluation). These perturbations introduced conflicting or false multimodal signals (e.g., mismatched region annotations or incorrect textual hints). Using this adversarially transformed dataset, we trained an Aloe-Vision-7B-AR variant through a single-stage post-training SFT consisting of 17.2K adversarial samples. The adversarial fine-tuning employed the same optimization setup as the base model and ran for 1 epoch. This procedure yielded substantial improvements across all adversarial evaluation categories while preserving performance on standard benchmarks.

The following table reports model accuracy (%) under different adversarial perturbations. Columns correspond to:

  • Cap = misleading captions inserted into the image
  • Pmt = misleading captions in the prompt
  • Syc = sycophantic prompt bias
  • Leg = misleading legends inserted into the image
Model Cls Base Cap Pmt Syc Det Base Cap Pmt Syc Leg
MiMo-VL-7B 54.4 1.2 1.8 6.9 64.8 5.9 3.2 8.2 35.9
Qwen2-VL-7B 52.5 0.5 2.0 11.4 62.7 27.1 13.2 9.8 37.0
InternVL3.5-8B 66.6 0.8 2.6 20.6 72.8 32.4 24.8 10.2 47.9
HuatuoGPT-Vision-7B 57.9 19.4 6.2 29.4 61.1 40.8 4.8 7.2 47.1
Lingshu-7B 79.5 2.5 20.2 44.8 76.8 18.2 16.1 27.3 51.3
Chiron-o1-8B 48.7 7.1 7.4 56.6 58.1 27.1 12.6 32.9 39.6
Aloe-Vision-7B 59.7 3.9 14.7 42.6 61.7 53.0 16.0 14.3 50.9
Aloe-Vision-7B-AR 65.8 14.2 44.2 50.2 78.7 75.0 70.6 71.1 72.0

Safety, Risks & Limitations

  • Not a medical device. Do not rely on outputs for diagnosis/treatment.
  • Failure modes: may hallucinate, misinterpret findings, or over-generalize across modalities and specialties.
  • Sensitive content: can produce unsafe content if prompted adversarially.

Recommended practice

  • Keep a qualified clinician in the loop for any medically relevant use.

Clinical safety: Aloe-Vision is a research model. It must not be used for diagnosis, treatment, or clinical decision-making. Always place a qualified human in the loop.


Citation

Paper not published yet.


Acknowledgments

Developed by the High Performance Artificial Intelligence (HPAI) group at Barcelona Supercomputing Center (BSC). Contact: [email protected].

Downloads last month
58
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Dataset used to train HPAI-BSC/Aloe-Vision-7B-AR

Space using HPAI-BSC/Aloe-Vision-7B-AR 1

Collection including HPAI-BSC/Aloe-Vision-7B-AR