Aloe-Vision is a medical Large Vision–Language Model built on Qwen2-VL-Instruct, released in 7B and 72B sizes. The model is trained on a ~3.5 M samples balanced mixture across medical vs. general and multimodal vs. text-only sources, rebalanced by loss-contributing assistant tokens to avoid long-answer bias. We implement leakage control of evaluation images in the training data via exact 64-bit image-hash matching, removing any duplicates from the training. Quality filtering of the training data combines (1) LVLM-based sample scoring (1–5 scale) for image–question–answer coherence and relevance and (2) answer perplexity checks to flag trivial or noisy annotations. Thresholds are dataset-specific and manually tuned, leading to the removal of low-quality outliers while preserving clinically meaningful diversity. Furthermore, the model is additionally fine-tuned on 17.2 K adversarially perturbed medical samples to enhance robustness against sycophantic and misleading multimodal cues. The model is released for research purposes under CC BY-NC-SA 4.0.
Model Details
- Base model: Qwen2-VL-Instruct (7B / 72B)
- Variant: Aloe-Vision-7B-AR (Adversarially Robust)
- Training type: Two-stage SFT (medical + adversarial fine-tuning)
- Sizes: 7B, 72B
- Languages: English
- Images per turn: Qwen2-VL style multi-image support
- License: CC BY-NC-SA 4.0
- Developed by: HPAI — Barcelona Supercomputing Center (BSC)
- Contact: [email protected]
Intended Use & Out-of-Scope
Intended: research on medical VQA and multimodal reasoning, dataset analysis, academic benchmarking.
Out-of-scope:
- clinical diagnosis/treatment, triage, or any unsupervised medical use.
- generation of harmful, misleading, or fraudulent medical content.
- processing of PHI or any personally identifiable patient data.
How to Use
Aloe-Vision follows the Qwen2-VL chat template and processor API. Replace the image path(s) and prompt content to suit your use case.
Python (Transformers)
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
model_id = "HPAI-BSC/Aloe-Vision-7B-AR"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/your_image.png"},
{"type": "text", "text": "What abnormality do you observe? Be concise."}
]
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
**image_inputs,
return_tensors="pt"
).to(model.device)
generated = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
eos_token_id=processor.tokenizer.eos_token_id,
)
output_text = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(output_text.split(text)[-1].strip())
Grounding: Aloe-Vision supports region-referenced grounding using Qwen2-VL box marker tokens.
Training Summary
- Training type: Two-stage SFT (medical + adversarial fine-tuning)
- Stack: TRL + DeepSpeed ZeRO-3
- Precision: BF16
- Global batch size: 1024
- Micro batch size: 16
- Epochs: 1
- Sequence length: 4096
- LR: 3.75e-5, Cosine schedule, warmup 3%
- Optimizer: AdamW
- Grad checkpointing: enabled
- Parallelism: DeepSpeed ZeRO-3
Compute
- Cluster: MareNostrum-5 (BSC)
- Nodes/GPUs: 8 nodes × 4× NVIDIA H100 (total 32 GPUs)
- GPU hours: ~500
Training Data
We construct a balanced mixture across two axes: modality (multimodal vs text-only) and domain (medical vs general). All sources are normalized to a unified trl conversation schema. Medical multimodal includes both global understanding and fine-grained region reasoning.
The dataset can be found in HPAI-BSC/Aloe-Vision-Data.
Evaluation
Aloe-Vision targets comprehensive evaluation across medical multimodal, medical text-only, general multimodal, and general text-only tasks. Benchmarks are run with identical settings for Aloe-Vision and baselines to ensure reproducibility.
Benchmarks:
- PathMMU (multi, medical, MCQ) — 1.1K
- GMAI-MMBench (multi, medical, MCQ) — 4.5K
- OmniMedVQA (multi, medical, MCQ) — 89K
- ProbMed (multi, medical, Y/N) — 57K
- SLAKE (multi, medical, open-ended; LLM-as-judge) — 2K
- MMMU (multi, general, MCQ) — 1.4K
- MultiMedQA (text, medical, MCQ) — 7K
- MMLU (text, general, MCQ) — 14K
Evaluation protocol
- Multimodal via VLMEvalKit, text-only via lm-evaluation-harness.
- Decoding: greedy, accuracy by exact match for MCQ and Y/N.
- LLM-as-judge (SLAKE): Qwen2.5-VL-72B with a rubric-based {0.0, 0.5, 1.0} scale.
Results
| Model | OmniMedVQA | GMAI-MMBENCH | PathMMU | ProbMed | SLAKE | MMMU | MultiMedQA | MMLU |
|---|---|---|---|---|---|---|---|---|
| Kimi-VL-A3B-Instruct (general) | 71.30 | 46.20 | 49.65 | 78.91 | 65.06 | 52.00 | 59.21 | 69.04 |
| MiMo-7B-RL (general) | 63.80 | 43.82 | 51.75 | 74.80 | 61.13 | 21.67 | 55.88 | 68.42 |
| Qwen2-VL-7B (general) | 71.40 | 46.42 | 54.90 | 72.87 | 64.11 | 50.44 | 59.67 | 67.82 |
| InternVL3.5-8B (general) | 87.20 | 57.96 | 65.06 | 79.51 | 75.31 | 54.67 | 63.95 | 75.56 |
| HuatuoGPT-Vision-7B | 71.40 | 47.23 | 57.09 | 76.14 | 60.65 | 39.89 | 57.93 | 67.61 |
| Linghsu-7B | 79.50 | 52.31 | 66.55 | 79.00 | 80.18 | 57.89 | 62.09 | 69.37 |
| Chiron-o1-8B | 71.40 | 41.41 | 55.87 | 73.73 | 66.49 | 43.22 | 59.65 | 71.56 |
| Aloe-Vision-7B | 76.50 | 52.79 | 61.82 | 76.69 | 65.40 | 45.11 | 58.48 | 65.95 |
| Aloe-Vision-7B-AR | 77.60 | 53.95 | 65.32 | 79.35 | 63.39 | 48.33 | 61.82 | 66.31 |
Adversarial Robustness
To improve robustness against noisy or misleading inputs, we conducted an additional fine-tuning stage focused on adversarial robustness. This stage aimed to mitigate common LVLM vulnerabilities such as sycophantic behavior or misleading multimodal cues. An adversarial benchmark was first created by applying controlled perturbations to existing medical datasets (distinct from those used in evaluation). These perturbations introduced conflicting or false multimodal signals (e.g., mismatched region annotations or incorrect textual hints). Using this adversarially transformed dataset, we trained an Aloe-Vision-7B-AR variant through a single-stage post-training SFT consisting of 17.2K adversarial samples. The adversarial fine-tuning employed the same optimization setup as the base model and ran for 1 epoch. This procedure yielded substantial improvements across all adversarial evaluation categories while preserving performance on standard benchmarks.
The following table reports model accuracy (%) under different adversarial perturbations. Columns correspond to:
- Cap = misleading captions inserted into the image
- Pmt = misleading captions in the prompt
- Syc = sycophantic prompt bias
- Leg = misleading legends inserted into the image
| Model | Cls Base | Cap | Pmt | Syc | Det Base | Cap | Pmt | Syc | Leg |
|---|---|---|---|---|---|---|---|---|---|
| MiMo-VL-7B | 54.4 | 1.2 | 1.8 | 6.9 | 64.8 | 5.9 | 3.2 | 8.2 | 35.9 |
| Qwen2-VL-7B | 52.5 | 0.5 | 2.0 | 11.4 | 62.7 | 27.1 | 13.2 | 9.8 | 37.0 |
| InternVL3.5-8B | 66.6 | 0.8 | 2.6 | 20.6 | 72.8 | 32.4 | 24.8 | 10.2 | 47.9 |
| HuatuoGPT-Vision-7B | 57.9 | 19.4 | 6.2 | 29.4 | 61.1 | 40.8 | 4.8 | 7.2 | 47.1 |
| Lingshu-7B | 79.5 | 2.5 | 20.2 | 44.8 | 76.8 | 18.2 | 16.1 | 27.3 | 51.3 |
| Chiron-o1-8B | 48.7 | 7.1 | 7.4 | 56.6 | 58.1 | 27.1 | 12.6 | 32.9 | 39.6 |
| Aloe-Vision-7B | 59.7 | 3.9 | 14.7 | 42.6 | 61.7 | 53.0 | 16.0 | 14.3 | 50.9 |
| Aloe-Vision-7B-AR | 65.8 | 14.2 | 44.2 | 50.2 | 78.7 | 75.0 | 70.6 | 71.1 | 72.0 |
Safety, Risks & Limitations
- Not a medical device. Do not rely on outputs for diagnosis/treatment.
- Failure modes: may hallucinate, misinterpret findings, or over-generalize across modalities and specialties.
- Sensitive content: can produce unsafe content if prompted adversarially.
Recommended practice
- Keep a qualified clinician in the loop for any medically relevant use.
Clinical safety: Aloe-Vision is a research model. It must not be used for diagnosis, treatment, or clinical decision-making. Always place a qualified human in the loop.
Citation
Paper not published yet.
Acknowledgments
Developed by the High Performance Artificial Intelligence (HPAI) group at Barcelona Supercomputing Center (BSC). Contact: [email protected].
- Downloads last month
- 58