Aloe-Vision

Aloe-Vision is a medical Large Vision–Language Model built on Qwen2-VL-Instruct, released in 7B and 72B sizes. The model is trained on a ~3.5 M samples balanced mixture across medical vs. general and multimodal vs. text-only sources, rebalanced by loss-contributing assistant tokens to avoid long-answer bias. We implement leakage control of evaluation images in the training data via exact 64-bit image-hash matching, removing any duplicates from the training. Quality filtering of the training data combines (1) LVLM-based sample scoring (1–5 scale) for image–question–answer coherence and relevance and (2) answer perplexity checks to flag trivial or noisy annotations. Thresholds are dataset-specific and manually tuned, leading to the removal of low-quality outliers while preserving clinically meaningful diversity. Furthermore, the model is additionally fine-tuned on 17.2 K adversarially perturbed medical samples to enhance robustness against sycophantic and misleading multimodal cues. The model is released for research purposes under CC BY-NC-SA 4.0.

Model Details

Base model: Qwen2-VL-Instruct (7B / 72B)
Variant: Aloe-Vision-7B-AR (Adversarially Robust)
Training type: Two-stage SFT (medical + adversarial fine-tuning)
Sizes: 7B, 72B
Languages: English
Images per turn: Qwen2-VL style multi-image support
License: CC BY-NC-SA 4.0
Developed by: HPAI — Barcelona Supercomputing Center (BSC)
Contact: [email protected]

Intended Use & Out-of-Scope

Intended: research on medical VQA and multimodal reasoning, dataset analysis, academic benchmarking.

Out-of-scope:

clinical diagnosis/treatment, triage, or any unsupervised medical use.
generation of harmful, misleading, or fraudulent medical content.
processing of PHI or any personally identifiable patient data.

How to Use

Aloe-Vision follows the Qwen2-VL chat template and processor API. Replace the image path(s) and prompt content to suit your use case.

Python (Transformers)

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

model_id = "HPAI-BSC/Aloe-Vision-7B-AR"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/your_image.png"},
            {"type": "text", "text": "What abnormality do you observe? Be concise."}
        ]
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs = processor.process_vision_info(messages)
inputs = processor(
    text=[text],
    **image_inputs,
    return_tensors="pt"
).to(model.device)

generated = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=False,
    eos_token_id=processor.tokenizer.eos_token_id,
)
output_text = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(output_text.split(text)[-1].strip())

Grounding: Aloe-Vision supports region-referenced grounding using Qwen2-VL box marker tokens.

Training Summary

Training type: Two-stage SFT (medical + adversarial fine-tuning)
Stack: TRL + DeepSpeed ZeRO-3
Precision: BF16
Global batch size: 1024
Micro batch size: 16
Epochs: 1
Sequence length: 4096
LR: 3.75e-5, Cosine schedule, warmup 3%
Optimizer: AdamW
Grad checkpointing: enabled
Parallelism: DeepSpeed ZeRO-3

Compute

Cluster: MareNostrum-5 (BSC)
Nodes/GPUs: 8 nodes × 4× NVIDIA H100 (total 32 GPUs)
GPU hours: ~500

Training Data

We construct a balanced mixture across two axes: modality (multimodal vs text-only) and domain (medical vs general). All sources are normalized to a unified trl conversation schema. Medical multimodal includes both global understanding and fine-grained region reasoning.

The dataset can be found in HPAI-BSC/Aloe-Vision-Data.

Evaluation

Aloe-Vision targets comprehensive evaluation across medical multimodal, medical text-only, general multimodal, and general text-only tasks. Benchmarks are run with identical settings for Aloe-Vision and baselines to ensure reproducibility.

Benchmarks:

PathMMU (multi, medical, MCQ) — 1.1K
GMAI-MMBench (multi, medical, MCQ) — 4.5K
OmniMedVQA (multi, medical, MCQ) — 89K
ProbMed (multi, medical, Y/N) — 57K
SLAKE (multi, medical, open-ended; LLM-as-judge) — 2K
MMMU (multi, general, MCQ) — 1.4K
MultiMedQA (text, medical, MCQ) — 7K
MMLU (text, general, MCQ) — 14K

Evaluation protocol

Multimodal via VLMEvalKit, text-only via lm-evaluation-harness.
Decoding: greedy, accuracy by exact match for MCQ and Y/N.
LLM-as-judge (SLAKE): Qwen2.5-VL-72B with a rubric-based {0.0, 0.5, 1.0} scale.

Results

Model	OmniMedVQA	GMAI-MMBENCH	PathMMU	ProbMed	SLAKE	MMMU	MultiMedQA	MMLU
Kimi-VL-A3B-Instruct (general)	71.30	46.20	49.65	78.91	65.06	52.00	59.21	69.04
MiMo-7B-RL (general)	63.80	43.82	51.75	74.80	61.13	21.67	55.88	68.42
Qwen2-VL-7B (general)	71.40	46.42	54.90	72.87	64.11	50.44	59.67	67.82
InternVL3.5-8B (general)	87.20	57.96	65.06	79.51	75.31	54.67	63.95	75.56
HuatuoGPT-Vision-7B	71.40	47.23	57.09	76.14	60.65	39.89	57.93	67.61
Linghsu-7B	79.50	52.31	66.55	79.00	80.18	57.89	62.09	69.37
Chiron-o1-8B	71.40	41.41	55.87	73.73	66.49	43.22	59.65	71.56
Aloe-Vision-7B	76.50	52.79	61.82	76.69	65.40	45.11	58.48	65.95
Aloe-Vision-7B-AR	77.60	53.95	65.32	79.35	63.39	48.33	61.82	66.31

Adversarial Robustness

To improve robustness against noisy or misleading inputs, we conducted an additional fine-tuning stage focused on adversarial robustness. This stage aimed to mitigate common LVLM vulnerabilities such as sycophantic behavior or misleading multimodal cues. An adversarial benchmark was first created by applying controlled perturbations to existing medical datasets (distinct from those used in evaluation). These perturbations introduced conflicting or false multimodal signals (e.g., mismatched region annotations or incorrect textual hints). Using this adversarially transformed dataset, we trained an Aloe-Vision-7B-AR variant through a single-stage post-training SFT consisting of 17.2K adversarial samples. The adversarial fine-tuning employed the same optimization setup as the base model and ran for 1 epoch. This procedure yielded substantial improvements across all adversarial evaluation categories while preserving performance on standard benchmarks.

The following table reports model accuracy (%) under different adversarial perturbations. Columns correspond to:

Cap = misleading captions inserted into the image
Pmt = misleading captions in the prompt
Syc = sycophantic prompt bias
Leg = misleading legends inserted into the image

Model	Cls Base	Cap	Pmt	Syc	Det Base	Cap	Pmt	Syc	Leg
MiMo-VL-7B	54.4	1.2	1.8	6.9	64.8	5.9	3.2	8.2	35.9
Qwen2-VL-7B	52.5	0.5	2.0	11.4	62.7	27.1	13.2	9.8	37.0
InternVL3.5-8B	66.6	0.8	2.6	20.6	72.8	32.4	24.8	10.2	47.9
HuatuoGPT-Vision-7B	57.9	19.4	6.2	29.4	61.1	40.8	4.8	7.2	47.1
Lingshu-7B	79.5	2.5	20.2	44.8	76.8	18.2	16.1	27.3	51.3
Chiron-o1-8B	48.7	7.1	7.4	56.6	58.1	27.1	12.6	32.9	39.6
Aloe-Vision-7B	59.7	3.9	14.7	42.6	61.7	53.0	16.0	14.3	50.9
Aloe-Vision-7B-AR	65.8	14.2	44.2	50.2	78.7	75.0	70.6	71.1	72.0

Safety, Risks & Limitations

Not a medical device. Do not rely on outputs for diagnosis/treatment.
Failure modes: may hallucinate, misinterpret findings, or over-generalize across modalities and specialties.
Sensitive content: can produce unsafe content if prompted adversarially.

Recommended practice

Keep a qualified clinician in the loop for any medically relevant use.

Clinical safety: Aloe-Vision is a research model. It must not be used for diagnosis, treatment, or clinical decision-making. Always place a qualified human in the loop.

Citation

Paper not published yet.

Acknowledgments

Developed by the High Performance Artificial Intelligence (HPAI) group at Barcelona Supercomputing Center (BSC). Contact: [email protected].

Downloads last month: 58

Safetensors

Model size

8B params

Tensor type

BF16

Dataset used to train HPAI-BSC/Aloe-Vision-7B-AR

Space using HPAI-BSC/Aloe-Vision-7B-AR 1

Collection including HPAI-BSC/Aloe-Vision-7B-AR

Healthcare VLMs (Aloe Vision)

Collection

Multimodal datasets and VLMs for healthcare • 4 items • Updated 29 days ago • 1