dear-8b-reranker-ranknet-lora-v1 / README.md

Update README.md

4f1113e verified about 2 months ago

10.1 kB

	---
	language:
	- en
	license: mit
	library_name: peft
	tags:
	- reranking
	- information-retrieval
	- pointwise
	- lora
	- peft
	- ranknet
	base_model: meta-llama/Llama-3.1-8B
	datasets:
	- Tevatron/msmarco-passage
	- abdoelsayed/DeAR-COT
	pipeline_tag: text-classification
	---

	# DeAR-8B-Reranker-RankNet-LoRA-v1

	## Model Description

	DeAR-8B-Reranker-RankNet-LoRA-v1 is a LoRA (Low-Rank Adaptation) adapter for neural reranking. This lightweight adapter can be applied to LLaMA-3.1-8B to create a reranker with minimal storage overhead. It achieves comparable performance to the full fine-tuned model while requiring only ~100MB of storage.

	## Model Details

	- Model Type: LoRA Adapter for Pointwise Reranking
	- Base Model: meta-llama/Llama-3.1-8B
	- Adapter Size: ~100MB (vs 16GB for full model)
	- Training Method: LoRA with RankNet Loss + Knowledge Distillation
	- LoRA Rank: 16
	- LoRA Alpha: 32
	- Target Modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj

	## Key Features

	✅ Lightweight: Only 100MB vs 16GB full model
	✅ Efficient Training: Trains 3x faster than full fine-tuning
	✅ Easy Deployment: Just load adapter on top of base model
	✅ Comparable Performance: ~98% of full model performance
	✅ Memory Efficient: Lower GPU memory during training

	## Usage

	### Option 1: Load with PEFT (Recommended)

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	from peft import PeftModel, PeftConfig

	# Load LoRA adapter
	adapter_path = "abdoelsayed/dear-8b-reranker-ranknet-lora-v1"

	# Get base model from adapter config
	config = PeftConfig.from_pretrained(adapter_path)
	base_model_name = config.base_model_name_or_path

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained(base_model_name)
	if tokenizer.pad_token is None:
	tokenizer.pad_token = tokenizer.eos_token
	tokenizer.pad_token_id = tokenizer.eos_token_id

	# Load base model
	base_model = AutoModelForSequenceClassification.from_pretrained(
	base_model_name,
	num_labels=1,
	torch_dtype=torch.bfloat16
	)

	# Load and merge LoRA adapter
	model = PeftModel.from_pretrained(base_model, adapter_path)
	model = model.merge_and_unload() # Merge adapter into base model

	model.eval().cuda()

	# Use the model
	query = "What is machine learning?"
	document = "Machine learning is a subset of artificial intelligence..."

	inputs = tokenizer(
	f"query: {query}",
	f"document: {document}",
	return_tensors="pt",
	truncation=True,
	max_length=228,
	padding="max_length"
	)
	inputs = {k: v.cuda() for k, v in inputs.items()}

	with torch.no_grad():
	score = model(**inputs).logits.squeeze().item()

	print(f"Relevance score: {score}")
	```

	### Option 2: Use Helper Function

	```python
	import torch
	from typing import List, Tuple
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	from peft import PeftModel, PeftConfig

	def load_lora_ranker(adapter_path: str, device: str = "cuda"):
	"""Load LoRA adapter and merge with base model."""
	# Get base model path from adapter config
	peft_config = PeftConfig.from_pretrained(adapter_path)
	base_model_name = peft_config.base_model_name_or_path

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained(base_model_name)
	if tokenizer.pad_token is None:
	tokenizer.pad_token = tokenizer.eos_token
	tokenizer.pad_token_id = tokenizer.eos_token_id
	tokenizer.padding_side = "right"

	# Load base model
	base_model = AutoModelForSequenceClassification.from_pretrained(
	base_model_name,
	num_labels=1,
	torch_dtype=torch.bfloat16
	)

	# Load LoRA adapter and merge
	model = PeftModel.from_pretrained(base_model, adapter_path)
	model = model.merge_and_unload()

	model.eval().to(device)
	return tokenizer, model

	# Load model
	tokenizer, model = load_lora_ranker("abdoelsayed/dear-8b-reranker-ranknet-lora-v1")

	# Rerank documents
	@torch.inference_mode()
	def rerank(tokenizer, model, query: str, docs: List[Tuple[str, str]], batch_size: int = 64):
	"""Rerank documents for a query."""
	device = next(model.parameters()).device
	scores = []

	for i in range(0, len(docs), batch_size):
	batch = docs[i:i + batch_size]
	queries = [f"query: {query}"] * len(batch)
	documents = [f"document: {title} {text}" for title, text in batch]

	inputs = tokenizer(
	queries,
	documents,
	return_tensors="pt",
	truncation=True,
	max_length=228,
	padding=True
	)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	logits = model(**inputs).logits.squeeze(-1)
	scores.extend(logits.cpu().tolist())

	return sorted(enumerate(scores), key=lambda x: x[1], reverse=True)

	# Example
	query = "When did Thomas Edison invent the light bulb?"
	docs = [
	("", "Thomas Edison invented the light bulb in 1879"),
	("", "Coffee is good for diet"),
	("", "Lightning strike at Seoul"),
	]

	ranking = rerank(tokenizer, model, query, docs)
	print(ranking) # [(0, 5.2), (2, -3.1), (1, -4.8)]
	```

	### Using Without Merging (Memory Efficient)

	```python
	from peft import PeftModel, PeftConfig
	from transformers import AutoModelForSequenceClassification

	adapter_path = "abdoelsayed/dear-8b-reranker-ranknet-lora-v1"
	config = PeftConfig.from_pretrained(adapter_path)

	# Load base model
	base_model = AutoModelForSequenceClassification.from_pretrained(
	config.base_model_name_or_path,
	num_labels=1,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	# Load adapter (without merging)
	model = PeftModel.from_pretrained(base_model, adapter_path)
	model.eval()

	# Use model (adapter layers will be applied automatically)
	# ... same inference code as above ...
	```

	## Performance

	\| Benchmark \| LoRA \| Full Model \| Difference \|
	\|-----------\|------\|------------\|------------\|
	\| TREC DL19 \| 74.2 \| 74.5 \| -0.3 \|
	\| TREC DL20 \| 72.5 \| 72.8 \| -0.3 \|
	\| BEIR (Avg) \| 44.9 \| 45.2 \| -0.3 \|
	\| MS MARCO \| 68.6 \| 68.9 \| -0.3 \|

	✅ 98% of full model performance with only 0.6% of the storage!

	## Training Details

	### LoRA Configuration
	```python
	lora_config = {
	"r": 16, # LoRA rank
	"lora_alpha": 32, # Scaling factor
	"target_modules": [
	"q_proj", "v_proj", "k_proj", "o_proj",
	"gate_proj", "up_proj", "down_proj"
	],
	"lora_dropout": 0.05,
	"bias": "none",
	"task_type": "SEQ_CLS"
	}
	```

	### Training Hyperparameters
	```python
	training_args = {
	"learning_rate": 1e-4, # Higher than full fine-tuning
	"batch_size": 4, # Larger batch possible due to lower memory
	"gradient_accumulation": 2,
	"epochs": 2,
	"warmup_ratio": 0.1,
	"weight_decay": 0.01,
	"max_length": 228,
	"bf16": True
	}
	```

	### Hardware
	- GPUs: 4x NVIDIA A100 (40GB)
	- Training Time: ~12 hours (3x faster than full model)
	- Memory Usage: ~28GB per GPU (vs ~38GB for full)
	- Trainable Parameters: 67M (0.8% of total)

	## Advantages of LoRA Version

	\| Aspect \| LoRA \| Full Model \|
	\|--------\|------\|------------\|
	\| Storage \| 100MB \| 16GB \|
	\| Training Time \| 12h \| 36h \|
	\| Training Memory \| 28GB \| 38GB \|
	\| Performance \| 98% \| 100% \|
	\| Loading Time \| Fast \| Slow \|
	\| Easy Updates \| ✅ Yes \| ❌ No \|

	## When to Use LoRA vs Full Model

	Use LoRA when:
	- ✅ Storage is limited
	- ✅ Training multiple domain-specific versions
	- ✅ Need fast iteration/experimentation
	- ✅ 0.3 NDCG@10 difference is acceptable

	Use Full Model when:
	- ❌ Maximum performance required
	- ❌ Storage not a concern
	- ❌ Single production deployment

	## Fine-tuning on Your Data

	```python
	from peft import LoraConfig, get_peft_model, TaskType
	from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

	# Load base model
	base_model = AutoModelForSequenceClassification.from_pretrained(
	"meta-llama/Llama-3.1-8B",
	num_labels=1
	)

	# Configure LoRA
	lora_config = LoraConfig(
	task_type=TaskType.SEQ_CLS,
	r=16,
	lora_alpha=32,
	target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
	lora_dropout=0.05,
	bias="none",
	)

	# Apply LoRA
	model = get_peft_model(base_model, lora_config)
	model.print_trainable_parameters()
	# Output: trainable params: 67M \|\| all params: 8B \|\| trainable%: 0.8%

	# Train
	training_args = TrainingArguments(
	output_dir="./lora-finetuned",
	learning_rate=1e-4,
	per_device_train_batch_size=8,
	num_train_epochs=3,
	bf16=True,
	)

	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=your_dataset,
	)

	trainer.train()

	# Save only the LoRA adapter
	model.save_pretrained("./lora-adapter")
	```

	## Model Files

	This adapter contains:
	- `adapter_config.json` - LoRA configuration
	- `adapter_model.safetensors` or `adapter_model.bin` - Adapter weights (~100MB)
	- `README.md` - This documentation

	## Related Models

	Full Model:
	- [DeAR-8B-RankNet](https://huggingface.co/abdoelsayed/dear-8b-reranker-ranknet-v1) - Full fine-tuned version

	Other LoRA Adapters:
	- [DeAR-8B-CE-LoRA](https://huggingface.co/abdoelsayed/dear-8b-reranker-ce-lora-v1) - Binary Cross-Entropy
	- [DeAR-8B-Listwise-LoRA](https://huggingface.co/abdoelsayed/dear-8b-reranker-listwise-lora-v1) - Listwise ranking

	Resources:
	- [DeAR-COT Dataset](https://huggingface.co/datasets/abdoelsayed/DeAR-COT)
	- [Teacher Model](https://huggingface.co/abdoelsayed/llama2-13b-rankllama-teacher)

	## Citation

	```bibtex
	@article{abdallah2025dear,
	title={DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation},
	author={Abdallah, Abdelrahman and Mozafari, Jamshid and Piryani, Bhawna and Jatowt, Adam},
	journal={arXiv preprint arXiv:2508.16998},
	year={2025}
	}
	```

	## License

	MIT License

	## More Information

	- GitHub: [DataScienceUIBK/DeAR-Reranking](https://github.com/DataScienceUIBK/DeAR-Reranking)
	- Paper: [arXiv:2508.16998](https://arxiv.org/abs/2508.16998)
	- Collection: [DeAR Models](https://huggingface.co/collections/abdoelsayed/dear-reranking)