Model Card for Model ID

EvoQwen2.5-VL-Retriever-3B-v1 is a high-performance multimodal retrieval model built upon the Qwen2.5-VL-3B-Instruct backbone and employing multi-vector late-interaction. The model is fine-tuned by using an innovative evolutionary training framework (Evo-Retriever), enabling accurate retrieval of complex visual documents.

Version Specificity

• Base Model: ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1

• Parameter Size: 3 billion (3B)

• Features: As a smaller-weight model in this series, this model outperforms other models of similar size on evaluation benchmarks, delivering higher retrieval accuracy in resource-constrained scenarios.

Performance

Model	ViDoRe V2 (nDCG@5)	MMEB VisDoc (ndcg_linear@5)
ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1	63.00	75.96
ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-7B-v1	65.24	77.10

Usage

Make sure that you have installed Transformers, Torch, Pillow, and colpali-engine.


  import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor
model_name = "ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1" 
model = ColQwen2_5.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()
processor = ColQwen2_5_Processor.from_pretrained(model_name)
# Your inputs
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "Is attention really all you need?",
    "What is the amount of bananas farmed in Salvador?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)
print(scores)




	
		
	
	
		Parameters
	

All models are fine-tuned by using the Evo-Retriever paradigm with a two-stage training schedule (one epoch per stage). Unless otherwise noted, parameter-efficient fine-tuning is achieved through low-rank adapters (LoRA) with a rank of 32 for both 3B and 7B models. Training is performed in bfloat16 precision with the paged_adamw_8bit optimizer on an 8-GPU H20 server, employing a data-parallel strategy, a learning rate of 2e-5, cosine decay, 2% warm-up steps, and a batch size of 32.

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

Visual Document Retrieval

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

vidore/colqwen2.5-base

Finetuned

(5)

this model

ApsaraStackMaaS
/

EvoQwen2.5-VL-Retriever-3B-v1

Model Card for Model ID

Version Specificity

Performance

Usage

Parameters

Model tree for ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1

Datasets used to train ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1