README.md · rafiaa/legal-passive-to-active-llama-7b at main

legal-passive-to-active-llama-7b / README.md

rafiaa

Upload README.md with huggingface_hub

b55c054 verified about 1 month ago

preview code

raw

history blame contribute delete

8.71 kB

	---
	library_name: peft
	base_model: meta-llama/Llama-2-7b-chat-hf
	tags:
	- legal
	- legal-text
	- passive-to-active
	- voice-transformation
	- legal-nlp
	- text-simplification
	- legal-documents
	- sentence-transformation
	- lora
	- qlora
	- peft
	- llama-2
	- natural-language-processing
	- legal-language
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	---

	# legal-passive-to-active-llama-7b

	A specialized LoRA fine-tuned model for transforming legal text from passive voice to active voice, built on Llama-2-7b-Chat. This model simplifies complex legal language while maintaining semantic accuracy and legal precision.

	## Model Description

	This model is a LoRA (Low-Rank Adaptation) fine-tuned version of Llama-2-7b-Chat-hf, specifically optimized for passive-to-active voice transformation in legal documents. It was trained on a curated dataset of 319 legal sentences from authoritative sources including UN documents, GDPR, Fair Work Act, and insurance regulations to understand legal syntax, passive constructions, and voice transformation patterns.

	### Key Features

	- Legal Text Simplification: Converts passive voice to active voice in legal documents
	- Domain-Specific: Fine-tuned on authentic legal text from multiple jurisdictions
	- Efficient Training: Uses QLoRA for memory-efficient fine-tuning
	- Semantic Preservation: Maintains legal meaning while simplifying sentence structure
	- Accessibility: Makes legal documents more readable and accessible

	## Model Details

	- Developed by: Rafi Al Attrach
	- Model type: LoRA fine-tuned Llama-2
	- Language(s): English
	- License: Apache 2.0
	- Finetuned from: [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
	- Training method: QLoRA (4-bit quantization + LoRA)
	- Research Focus: Legal text simplification and accessibility (2024)

	### Technical Specifications

	- Base Model: Llama-2-7b-Chat-hf
	- LoRA Rank: 64
	- Training Samples: 319 legal sentences
	- Data Sources: UN legal documents, GDPR, Fair Work Act, Insurance regulations
	- Evaluation: BERTScore metrics and human evaluation
	- Performance: ~6% improvement over base model in human evaluation

	## Uses

	### Direct Use

	This model is designed for:
	- Legal document simplification: Converting passive legal text to active voice
	- Accessibility improvement: Making legal documents more readable
	- Legal writing assistance: Helping legal professionals write clearer documents
	- Educational purposes: Teaching legal language transformation
	- Document processing: Batch processing of legal texts

	### Example Use Cases

	```python
	# Transform a legal passive sentence to active voice
	passive_sentence = "The contract shall be executed by both parties within 30 days."
	# Model output: "Both parties shall execute the contract within 30 days."
	```

	```python
	# Simplify GDPR text
	passive_sentence = "Personal data may be processed by the controller for legitimate interests."
	# Model output: "The controller may process personal data for legitimate interests."
	```

	## How to Get Started

	### Installation

	```bash
	pip install transformers torch peft accelerate bitsandbytes
	```

	### Loading the Model

	#### GPU Usage (Recommended)
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel
	import torch

	# Load base model with 4-bit quantization
	base_model = "meta-llama/Llama-2-7b-chat-hf"
	model = AutoModelForCausalLM.from_pretrained(
	base_model,
	load_in_4bit=True,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# Load LoRA adapter
	model = PeftModel.from_pretrained(model, "rafiaa/legal-passive-to-active-llama-7b")
	tokenizer = AutoTokenizer.from_pretrained(base_model)

	# Set pad token
	if tokenizer.pad_token is None:
	tokenizer.pad_token = tokenizer.eos_token
	```

	#### CPU Usage (Alternative)
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel
	import torch

	# Load base model (CPU compatible)
	base_model = "meta-llama/Llama-2-7b-chat-hf"
	model = AutoModelForCausalLM.from_pretrained(
	base_model,
	torch_dtype=torch.float32,
	device_map="cpu"
	)

	# Load LoRA adapter
	model = PeftModel.from_pretrained(model, "rafiaa/legal-passive-to-active-llama-7b")
	tokenizer = AutoTokenizer.from_pretrained(base_model)

	# Set pad token
	if tokenizer.pad_token is None:
	tokenizer.pad_token = tokenizer.eos_token
	```

	### Usage Example

	```python
	def transform_passive_to_active(passive_sentence, max_length=512):
	# Create instruction prompt
	instruction = """You are a legal text transformation expert. Your task is to convert passive voice sentences to active voice while maintaining the exact legal meaning and terminology.

	Input: Transform the following legal sentence from passive to active voice.

	Legal Sentence: """

	prompt = instruction + passive_sentence
	inputs = tokenizer(prompt, return_tensors="pt")

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_length=max_length,
	temperature=0.7,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id
	)

	return tokenizer.decode(outputs[0], skip_special_tokens=True)

	# Example usage
	passive = "The agreement shall be signed by the authorized representatives."
	active = transform_passive_to_active(passive)
	print(active)
	```

	## Training Details

	### Training Data

	- Dataset Size: 319 legal sentences
	- Source Documents:
	- United Nations legal documents
	- General Data Protection Regulation (GDPR)
	- Fair Work Act (Australia)
	- Insurance Council of Australia regulations
	- Data Split: 85% training, 15% testing (with 15% of training for validation)
	- Domain: Legal text across multiple jurisdictions

	### Training Procedure

	- Method: QLoRA (4-bit quantization + LoRA)
	- LoRA Configuration: Rank 64, Alpha 16
	- Library: unsloth (2.2x faster, 43% less VRAM)
	- Hardware: Tesla T4 GPU (Google Colab)
	- Training Loss: Downward trending validation loss indicating good generalization

	### Evaluation Metrics

	- BERTScore: Semantic similarity evaluation
	- Human Evaluation: Binary correctness assessment by legal evaluators
	- Performance Improvement: ~6% increase over base Llama-2 model

	## Performance

	The model was evaluated using both automatic metrics (BERTScore - Precision, Recall, F1) and human evaluation:

	- BERTScore F1: High semantic similarity preservation
	- Human Evaluation: ~6% improvement over base model
	- Strengths: Good transformation of standard passive constructions
	- Challenges: Complex sentences with nuanced word placement (e.g., "only")

	## Limitations and Bias

	### Known Limitations

	- Word Position Sensitivity: Struggles with sentences where word position significantly alters meaning
	- Dataset Size: Limited to 319 training samples
	- Non-Determinism: LLM outputs may vary between runs
	- Domain Coverage: Primarily trained on English common law and EU legal documents
	- 'By' Constructions: Occasionally faces challenges with sentences containing 'by' (subject indicator)

	### Recommendations

	- Validate transformed sentences for legal accuracy before use
	- Use human review for critical legal documents
	- Consider context and jurisdiction when applying transformations
	- Test with domain-specific legal texts for best results

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{legal-passive-active-llama2,
	title={legal-passive-to-active-llama2-7b: A LoRA Fine-tuned Model for Legal Voice Transformation},
	author={Rafi Al Attrach},
	year={2024},
	url={https://huggingface.co/rafiaa/legal-passive-to-active-llama-7b}
	}
	```

	## Related Models

	- Base Model: [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
	- Enhanced Version: [rafiaa/legal-passive-to-active-mistral-7b](https://huggingface.co/rafiaa/legal-passive-to-active-mistral-7b) (Recommended - better performance)

	## Model Card Contact

	- Author: Rafi Al Attrach
	- Model Repository: [HuggingFace Model](https://huggingface.co/rafiaa/legal-passive-to-active-llama-7b)
	- Issues: Please report issues through the HuggingFace model page

	## Acknowledgments

	- Research Project: Legal text simplification and accessibility research (2024)
	- Training Data: Public legal documents and regulations
	- Base Model: Meta's Llama-2-7b-Chat-hf

	---

	This model is part of a research project on legal text simplification and accessibility, focusing on passive-to-active voice transformation in legal documents.