metadata
language:
- en
license: mit
library_name: peft
tags:
- reranking
- information-retrieval
- listwise
- lora
- peft
- generative
base_model: meta-llama/Llama-3.1-8B
datasets:
- abdoelsayed/DeAR-COT
pipeline_tag: text-generation
DeAR-8B-Reranker-Listwise-LoRA-v1
Model Description
DeAR-8B-Reranker-Listwise-LoRA-v1 is a LoRA adapter for listwise neural reranking. This adapter enables generative document ranking with Chain-of-Thought reasoning while requiring only ~100MB storage. It achieves near full-model performance on complex ranking tasks.
Model Details
- Model Type: LoRA Adapter for Listwise Reranking
- Base Model: meta-llama/Llama-3.1-8B
- Adapter Size: ~100MB
- Training Method: LoRA with Supervised Fine-tuning + CoT
- LoRA Rank: 16
- LoRA Alpha: 32
- Framework: LLaMA-Factory
Key Features
β
Lightweight: Only 100MB vs 16GB full model
β
CoT Reasoning: Generates ranking explanations
β
Listwise: Considers document relationships
β
State-of-the-Art: Outperforms GPT-4 on NovelEval
β
Efficient: Faster training and deployment
Usage
Load with PEFT
import torch
from transformers import AutoTokenizer
from peft import AutoPeftModelForCausalLM
# Load LoRA adapter (automatically loads base model)
adapter_path = "abdoelsayed/dear-8b-reranker-listwise-lora-v1"
dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
tokenizer = AutoTokenizer.from_pretrained(adapter_path, use_fast=True)
model = AutoPeftModelForCausalLM.from_pretrained(
adapter_path,
torch_dtype=dtype,
device_map="auto",
trust_remote_code=True,
low_cpu_mem_usage=True
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Prepare ranking prompt
query = "When did Thomas Edison invent the light bulb?"
documents = [
"Lightning strike at Seoul National University",
"Thomas Edison tried to invent a device for car but failed",
"Coffee is good for diet",
"KEPCO fixes light problems",
"Thomas Edison invented the light bulb in 1879",
]
doc_list = "\n".join([f"[{i}] {doc}" for i, doc in enumerate(documents)])
prompt = f"""I will provide you with {len(documents)} passages, each indicated by a number identifier [].
Rank the passages based on their relevance to the search query: {query}.
{doc_list}
Search Query: {query}.
Rank the passages above based on their relevance to the search query. Output the ranking as a list of numbers."""
# Generate ranking
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=50,
temperature=0.7,
do_sample=False,
pad_token_id=tokenizer.pad_token_id
)
ranking = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(f"Ranking: {ranking}")
# Output: [4] > [1] > [0] > [3] > [2]
4-bit Quantization (Low Memory)
from peft import AutoPeftModelForCausalLM
# Load with 4-bit quantization
model = AutoPeftModelForCausalLM.from_pretrained(
adapter_path,
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
Complete Reranking Pipeline
import re
from typing import List
class ListwiseLoRAReranker:
def __init__(self, adapter_path: str):
self.tokenizer = AutoTokenizer.from_pretrained(adapter_path, use_fast=True)
self.model = AutoPeftModelForCausalLM.from_pretrained(
adapter_path,
torch_dtype=torch.bfloat16,
device_map="auto",
low_cpu_mem_usage=True
)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
def create_prompt(self, query: str, documents: List[str]) -> str:
doc_list = "\n".join([f"[{i}] {doc[:300]}" for i, doc in enumerate(documents)])
return f"""I will provide you with {len(documents)} passages, each indicated by a number identifier [].
Rank the passages based on their relevance to the search query: {query}.
{doc_list}
Search Query: {query}.
Rank the passages above based on their relevance to the search query. Output the ranking as a list of numbers."""
def parse_ranking(self, text: str, num_docs: int) -> List[int]:
numbers = re.findall(r'\[(\d+)\]', text)
ranking = [int(n) for n in numbers if int(n) < num_docs]
# Add missing docs
for i in range(num_docs):
if i not in ranking:
ranking.append(i)
return ranking[:num_docs]
def rerank(self, query: str, documents: List[str]) -> List[int]:
prompt = self.create_prompt(query, documents)
inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=50,
do_sample=False,
pad_token_id=self.tokenizer.pad_token_id
)
output_text = self.tokenizer.decode(
outputs[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True
)
return self.parse_ranking(output_text, len(documents))
# Usage
reranker = ListwiseLoRAReranker("abdoelsayed/dear-8b-reranker-listwise-lora-v1")
ranking = reranker.rerank(query, documents)
print(f"Ranked indices: {ranking}")
Training Details
LoRA Configuration
lora_rank: 16
lora_alpha: 32
target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- up_proj
- down_proj
lora_dropout: 0.05
task_type: CAUSAL_LM
Training Setup
- Framework: LLaMA-Factory
- Dataset: DeAR-COT
- Learning Rate: 1e-5
- Batch Size: 4
- Gradient Accumulation: 4
- Epochs: 2
- Max Length: 2048
- GPUs: 4x A100 (80GB)
- Training Time: ~24 hours (3x faster than full)
- Memory: ~50GB per GPU
Advantages of LoRA
| Feature | LoRA | Full Model |
|---|---|---|
| Storage | 100MB | 16GB |
| Training Time | 24h | 72h |
| Training Memory | 50GB | 70GB |
| Performance | 99% | 100% |
| Deployment | Fast | Slow |
Performance Comparison
TREC Deep Learning
| Method | DL19 | DL20 | Avg |
|---|---|---|---|
| LoRA | 77.6 | 75.3 | 76.5 |
| Full | 77.9 | 75.6 | 76.8 |
| RankGPT-4 | 75.6 | 70.6 | 73.1 |
NovelEval
| Method | NDCG@10 |
|---|---|
| LoRA | 90.6 |
| Full | 91.0 |
| GPT-4 | 87.9 |
When to Use
Best for:
- β Resource-constrained environments
- β Multiple domain-specific versions
- β Fast experimentation
- β Complex reasoning queries
Use full model for:
- β Absolute maximum performance
- β Single production deployment
Limitations
- Slightly lower performance (-0.3 NDCG@10)
- Still slower than pointwise models (~11s)
- Limited to ~20-50 documents per query
- Requires base model for inference
Related Models
Full Version:
Other LoRA:
Resources:
Citation
@article{abdallah2025dear,
title={DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation},
author={Abdallah, Abdelrahman and Mozafari, Jamshid and Piryani, Bhawna and Jatowt, Adam},
journal={arXiv preprint arXiv:2508.16998},
year={2025}
}
License
MIT License
More Information
- GitHub: DataScienceUIBK/DeAR-Reranking
- Paper: arXiv:2508.16998
- Collection: DeAR Models