Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +255 -150

README.md CHANGED Viewed

@@ -1,202 +1,307 @@
 ---
-library_name: transformers
 tags:
-- unsloth
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
 ### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
 ## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+library_name: peft
+base_model: mistralai/Mistral-7B-Instruct-v0.1
 tags:
+- legal
+- legal-text
+- passive-to-active
+- voice-transformation
+- legal-nlp
+- text-simplification
+- legal-documents
+- sentence-transformation
+- lora
+- qlora
+- peft
+- mistral
+- natural-language-processing
+- legal-language
+license: apache-2.0
+language:
+- en
+pipeline_tag: text-generation
 ---
+# legal-passive-to-active-mistral-7b
+**RECOMMENDED MODEL** - An advanced LoRA fine-tuned model for transforming legal text from passive voice to active voice, built on Mistral-7B-Instruct. This model demonstrates superior performance in simplifying complex legal language while maintaining semantic accuracy and legal precision.
+## Model Description
+This is the **enhanced model** for legal passive-to-active transformation. Built on Mistral-7B-Instruct-v0.1, it outperforms comparable models on legal voice transformation tasks. The model was fine-tuned on a curated dataset of 319 legal sentences from authoritative sources including UN documents, GDPR, Fair Work Act, and insurance regulations.
+### Key Features
+- **Superior Performance**: ~15% improvement over base model in human evaluation
+- **Legal Text Simplification**: Converts passive voice to active voice in legal documents
+- **Domain-Specific**: Fine-tuned on authentic legal text from multiple jurisdictions
+- **Efficient Training**: Uses QLoRA for memory-efficient fine-tuning
+- **Semantic Preservation**: Maintains legal meaning while simplifying sentence structure
+- **Accessibility**: Makes legal documents more readable and accessible
+## Model Details
+- **Developed by**: Rafi Al Attrach
+- **Model type**: LoRA fine-tuned Mistral (Enhanced)
+- **Language(s)**: English
+- **License**: Apache 2.0
+- **Finetuned from**: [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
+- **Training method**: QLoRA (4-bit quantization + LoRA)
+- **Research Focus**: Legal text simplification and accessibility (2024)
+### Technical Specifications
+- **Base Model**: Mistral-7B-Instruct-v0.1
+- **LoRA Rank**: 64
+- **Training Samples**: 319 legal sentences
+- **Data Sources**: UN legal documents, GDPR, Fair Work Act, Insurance regulations
+- **Evaluation**: BERTScore metrics and human evaluation
+- **Performance**: ~15% improvement over base model in human evaluation
 ## Uses
 ### Direct Use
+This model is designed for:
+- **Legal document simplification**: Converting passive legal text to active voice
+- **Accessibility improvement**: Making legal documents more readable
+- **Legal writing assistance**: Helping legal professionals write clearer documents
+- **Educational purposes**: Teaching legal language transformation
+- **Document processing**: Batch processing of legal texts
+- **Regulatory compliance**: Simplifying complex regulatory language
+### Example Use Cases
+```python
+# Transform a legal passive sentence to active voice
+passive_sentence = "The contract shall be executed by both parties within 30 days."
+# Model output: "Both parties shall execute the contract within 30 days."
+```
+```python
+# Simplify GDPR text
+passive_sentence = "Personal data may be processed by the controller for legitimate interests."
+# Model output: "The controller may process personal data for legitimate interests."
+```
+```python
+# Transform UN legal text
+passive_sentence = "All necessary measures shall be taken by Member States to ensure compliance."
+# Model output: "Member States shall take all necessary measures to ensure compliance."
+```
+## How to Get Started
+### Installation
+```bash
+pip install transformers torch peft accelerate bitsandbytes
+```
+### Loading the Model
+#### GPU Usage (Recommended)
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+import torch
+# Load base model with 4-bit quantization
+base_model = "mistralai/Mistral-7B-Instruct-v0.1"
+model = AutoModelForCausalLM.from_pretrained(
+    base_model,
+    load_in_4bit=True,
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+# Load LoRA adapter
+model = PeftModel.from_pretrained(model, "rafiaa/legal-passive-to-active-mistral-7b")
+tokenizer = AutoTokenizer.from_pretrained(base_model)
+# Set pad token
+if tokenizer.pad_token is None:
+    tokenizer.pad_token = tokenizer.eos_token
+```
+#### CPU Usage (Alternative)
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+import torch
+# Load base model (CPU compatible)
+base_model = "mistralai/Mistral-7B-Instruct-v0.1"
+model = AutoModelForCausalLM.from_pretrained(
+    base_model,
+    torch_dtype=torch.float32,
+    device_map="cpu"
+)
+# Load LoRA adapter
+model = PeftModel.from_pretrained(model, "rafiaa/legal-passive-to-active-mistral-7b")
+tokenizer = AutoTokenizer.from_pretrained(base_model)
+# Set pad token
+if tokenizer.pad_token is None:
+    tokenizer.pad_token = tokenizer.eos_token
+```
+### Usage Example
+```python
+def transform_passive_to_active(passive_sentence, max_length=512):
+    # Create instruction prompt
+    instruction = """You are a legal text transformation expert. Your task is to convert passive voice sentences to active voice while maintaining the exact legal meaning and terminology.
+Input: Transform the following legal sentence from passive to active voice.
+Legal Sentence: """
+    prompt = instruction + passive_sentence
+    inputs = tokenizer(prompt, return_tensors="pt")
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_length=max_length,
+            temperature=0.7,
+            do_sample=True,
+            pad_token_id=tokenizer.eos_token_id
+        )
+    return tokenizer.decode(outputs[0], skip_special_tokens=True)
+# Example usage
+passive = "The agreement shall be signed by the authorized representatives."
+active = transform_passive_to_active(passive)
+print(active)
+```
+### Advanced Usage
+```python
+# Batch processing multiple legal sentences
+legal_sentences = [
+    "The policy was established by the board of directors.",
+    "All documents must be reviewed by legal counsel.",
+    "The regulations were enacted by Parliament."
+]
+for sentence in legal_sentences:
+    transformed = transform_passive_to_active(sentence)
+    print(f"Passive: {sentence}")
+    print(f"Active: {transformed}\n")
+```
 ## Training Details
 ### Training Data
+- **Dataset Size**: 319 legal sentences
+- **Source Documents**:
+  - United Nations legal documents
+  - General Data Protection Regulation (GDPR)
+  - Fair Work Act (Australia)
+  - Insurance Council of Australia regulations
+- **Data Split**: 85% training, 15% testing (with 15% of training for validation)
+- **Domain**: Legal text across multiple jurisdictions
+- **Format**: Alpaca format for instruction-based training
+### Training Procedure
+- **Method**: QLoRA (4-bit quantization + LoRA)
+- **LoRA Configuration**: Rank 64, Alpha 16
+- **Library**: unsloth (2.2x faster, 62% less VRAM for Mistral)
+- **Hardware**: Tesla T4 GPU (Google Colab)
+- **Training Loss**: Downward trending validation loss indicating excellent generalization
+### Evaluation Metrics
+- **BERTScore**: Semantic similarity evaluation (Precision, Recall, F1)
+- **Human Evaluation**: Binary correctness assessment by legal evaluators
+- **Performance Improvement**: ~15% increase over base Mistral model
+## Performance Comparison
+| Model | Human Eval Score | BERTScore F1 | Performance |
+|-------|-----------------|--------------|-------------|
+| Mistral-7B Base | Baseline | High | Good |
+| **legal-passive-to-active-mistral-7b** | +15% | Higher | Excellent |
+| legal-passive-to-active-llama2-7b | +6% | High | Good |
+This model demonstrates the best performance among 7B parameter models for legal passive-to-active transformation.
+## Strengths and Characteristics
+### Model Strengths
+- **High accuracy** in passive-to-active transformations
+- **Semantic preservation** - maintains legal meaning
+- **Better generalization** compared to Llama-2 variants
+- **Responsive to prompts** - adapts well to instruction modifications
+- **Vocabulary diversity** - uses appropriate legal terminology
+### Notable Behaviors
+- Occasionally substitutes words with synonyms (trade-off for flexibility)
+- Better precision compared to base model after fine-tuning
+- Strong performance on complex legal constructions
+## Limitations and Bias
+### Known Limitations
+- **Word Position Sensitivity**: Struggles with sentences where word position significantly alters meaning
+- **Dataset Size**: Limited to 319 training samples
+- **Non-Determinism**: LLM outputs may vary between runs
+- **Domain Coverage**: Primarily trained on English common law and EU legal documents
+- **Synonym Substitution**: May occasionally use synonyms instead of exact original words
+### Recommendations
+- Validate transformed sentences for legal accuracy before use
+- Use human review for critical legal documents
+- Consider context and jurisdiction when applying transformations
+- Test with domain-specific legal texts for best results
+- Review outputs for unintended synonym substitutions in critical documents
 ## Environmental Impact
+- **Training Method**: QLoRA reduces computational requirements by 62% for Mistral
+- **Hardware**: Efficient training using 4-bit quantization
+- **Carbon Footprint**: Significantly reduced compared to full fine-tuning
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{legal-passive-active-mistral,
+  title={legal-passive-to-active-mistral-7b: An Enhanced LoRA Fine-tuned Model for Legal Voice Transformation},
+  author={Rafi Al Attrach},
+  year={2024},
+  url={https://huggingface.co/rafiaa/legal-passive-to-active-mistral-7b}
+}
+```
+## Related Models
+- **Base Model**: [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
+- **Alternative**: [rafiaa/legal-passive-to-active-llama2-7b](https://huggingface.co/rafiaa/legal-passive-to-active-llama2-7b)
+- **This Model**: [rafiaa/legal-passive-to-active-mistral-7b](https://huggingface.co/rafiaa/legal-passive-to-active-mistral-7b) (Recommended)
+## Model Card Contact
+- **Author**: Rafi Al Attrach
+- **Model Repository**: [HuggingFace Model](https://huggingface.co/rafiaa/legal-passive-to-active-mistral-7b)
+- **Issues**: Please report issues through the HuggingFace model page
+## Acknowledgments
+- **Research Project**: Legal text simplification and accessibility research (2024)
+- **Training Data**: Public legal documents and regulations
+- **Base Model**: Mistral AI's Mistral-7B-Instruct-v0.1
+- **Training Method**: QLoRA for efficient fine-tuning
+---
+*This model represents advanced research in legal text simplification and accessibility, demonstrating superior performance in passive-to-active voice transformation for legal documents.*