Llama-3.2-3B Medical NER LoRA
A fine-tuned medical Named Entity Recognition (NER) model based on Llama-3.2-3B-Instruct using LoRA (Low-Rank Adaptation) for efficient parameter tuning. This model is specialized for extracting medical entities and relationships from biomedical texts.
Model Details
Model Description
This model fine-tunes Llama-3.2-3B-Instruct for medical Named Entity Recognition across three specialized tasks:
- Chemical Extraction: Identifies drug and chemical compound names
- Disease Extraction: Identifies disease and medical condition names
- Relationship Extraction: Identifies chemical-disease interactions (which chemicals influence which diseases)
The model was trained on a curated dataset derived from the ChemProt corpus with 2,994 high-quality medical text samples, achieving balanced performance across all three tasks.
- Developed by: Alberto Clemente (@albyos)
- Model type: Causal Language Model with LoRA adapters
- Language(s): English (medical/biomedical domain)
- License: Llama 3.2 Community License
- Finetuned from model: meta-llama/Llama-3.2-3B-Instruct
Model Sources
- Repository: https://github.com/albertoclemente/medical-ner-fine-tuning
- Training Notebook:
notebooks/training/Medical_NER_Fine_Tuning_run_20251111.ipynb - Evaluation Notebook:
notebooks/evaluation/Medical_NER_Evaluation_BioMistral_7B_SLERP_AWQ_Quantized_20251115.ipynb
Uses
Direct Use
This model is designed for extracting structured medical information from unstructured biomedical texts, including:
- Research papers and clinical studies
- Medical literature reviews
- Drug interaction documentation
- Disease characterization documents
Input format:
The following article contains technical terms including diseases, drugs and chemicals.
Create a list only of the [chemicals/diseases/influences] mentioned.
[MEDICAL TEXT]
List of extracted [chemicals/diseases/influences]:
Output format:
- For chemicals/diseases: Bullet list of entities
- For relationships: Pipe-separated pairs (
chemical | disease)
Downstream Use
This model can be integrated into:
- Medical literature mining pipelines
- Drug discovery workflows
- Clinical decision support systems
- Pharmacovigilance systems
- Biomedical knowledge graph construction
Out-of-Scope Use
This model is NOT suitable for:
- Clinical diagnosis or treatment recommendations
- Patient-facing medical advice
- Real-time critical healthcare decisions
- Languages other than English
- Non-medical domain NER tasks
Important: This model is for research and information extraction purposes only. It should not be used as a substitute for professional medical judgment.
Bias, Risks, and Limitations
Known Limitations
- Domain Specificity: Trained on scientific/biomedical literature; may not perform well on clinical notes or patient-facing text
- Entity Coverage: Limited to chemicals, diseases, and their relationships; doesn't extract other medical entities (procedures, anatomy, etc.)
- Training Data Bias: Reflects patterns in ChemProt corpus; may not generalize to all medical subdomains
- Hallucination Risk: As with all LLMs, may occasionally generate plausible but incorrect entities
- Format Sensitivity: Performance depends on using the exact prompt format from training
Recommendations
- Always validate extracted entities against authoritative medical databases (ChEBI, MeSH, UMLS)
- Use in conjunction with human expert review for high-stakes applications
- Monitor for false positives (hallucinated entities) and false negatives (missed entities)
- Implement confidence thresholding based on your use case requirements
- Consider ensemble methods with other biomedical NER tools (e.g., BioMistral, PubMedBERT)
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base model and tokenizer
base_model_id = "meta-llama/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
# Load LoRA adapter
adapter_model_id = "albyos/llama3-medical-ner-lora-{timestamp}" # Replace with actual model ID
model = PeftModel.from_pretrained(model, adapter_model_id)
# Format prompt (example for chemical extraction)
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a medical NER expert specialized in extracting entities from biomedical texts.
Extract entities EXACTLY as they appear in the text.
CRITICAL RULES:
1. Return ONLY entities found verbatim in the article
2. Preserve exact formatting: hyphens, capitalization, special characters
3. Extract complete multi-word terms
4. For relationships: use format 'chemical NAME | disease NAME'
OUTPUT FORMAT:
- One entity per line with leading dash
- No explanations or additional text<|eot_id|><|start_header_id|>user<|end_header_id|>
The following article contains technical terms including diseases, drugs and chemicals.
Create a list only of the chemicals mentioned.
Aspirin and ibuprofen are commonly used to treat inflammation. Recent studies show
that metformin may reduce the risk of type-2 diabetes complications.
List of extracted chemicals:
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=128,
do_sample=False,
temperature=1.0,
repetition_penalty=1.15,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract assistant response
if "<|start_header_id|>assistant<|end_header_id|>" in response:
result = response.split("<|start_header_id|>assistant<|end_header_id|>")[-1].strip()
print(result)
Training Details
Training Data
Dataset: Custom medical NER dataset derived from ChemProt corpus
- Total samples: 2,994 (after cleaning and deduplication)
- Source: Biomedical literature abstracts
- Tasks: Chemical extraction, disease extraction, relationship extraction
- Split: 80% train (2,397), 10% validation (298), 10% test (299)
- Quality: 99.8% retention rate, 0 empty completions, stratified by task
Data Characteristics (from exploration analysis):
- Unique chemicals: 1,578 entities
- Unique diseases: 2,199 entities
- Vocabulary size: 13,710 unique words
- Prompt length: Median 1,357 characters (195 words), range 345-4,018 chars
- Hyphenated entities: ~459 (e.g., "type-2 diabetes", "5-fluorouracil")
- Format conversion: 2,050 relationships converted from sentence to pipe format
Training Procedure
Preprocessing
- Deduplication: Removed duplicate prompts by normalized hash
- Format standardization: Converted relationship format from
"chemical X influences disease Y"to"X | Y" - Entity normalization: Lowercase, whitespace normalization, hyphen preservation
- Stratified splitting: Ensures 33.3% distribution per task across all splits
- Leakage prevention: Hard assertions verify zero overlap between train/val/test
Training Hyperparameters
LoRA Configuration:
- LoRA rank (r): 16
- LoRA alpha: 32
- LoRA dropout: 0.05
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training Parameters:
- Training regime: fp16 mixed precision
- Quantization: 4-bit NF4 (BitsAndBytes)
- Epochs: 5
- Batch size: 4 per device
- Gradient accumulation: 4 steps (effective batch = 16)
- Learning rate: 5e-5
- LR scheduler: Cosine with 3% warmup
- Weight decay: 0.01
- Optimizer: paged_adamw_8bit
- Max sequence length: 2048 tokens
- Gradient checkpointing: Enabled
Data-Driven Justification: All hyperparameters were validated against dataset characteristics:
- Batch size 4-8 optimal for 3,000 samples
- 5 epochs sufficient for format learning without overfitting
- Conservative LR (5e-5) for 13,710 vocabulary size
- Max length 2048 covers 99%+ of prompts (median 1,357 chars)
Speeds, Sizes, Times
- Training time: ~2-3 hours on NVIDIA A100 GPU
- Model size: ~3.5 GB (quantized base model + LoRA adapters)
- Trainable parameters: ~1.5% of total model parameters
- Checkpoint frequency: Every 50 steps
- Evaluation frequency: Every 50 steps
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Dataset: Held-out test set from cleaned splits (299 samples)
- Split date: November 13, 2025
- Distribution: 100 chemicals, 99 diseases, 100 relationships
- Source: ChemProt corpus (biomedical literature)
Factors
Evaluation disaggregated by task type:
- Chemical extraction: Drug and chemical compound identification
- Disease extraction: Disease and medical condition identification
- Relationship extraction: Chemical-disease interaction pairs
Metrics
- F1 Score (primary): Harmonic mean of precision and recall
- Precision: Fraction of predicted entities that are correct
- Recall: Fraction of gold standard entities that were found
- Macro-average: Equal weight to each task (chemicals, diseases, relationships)
Evaluation methodology:
- Enhanced filtering to reduce false positives
- Normalized entity matching (lowercase, whitespace)
- Hyphen preservation during normalization
- Task-specific parsing (bullet lists for entities, pipe format for relationships)
Results
Llama-3.2-3B Baseline (before considering BioMistral):
- Overall F1: 53.8% (macro-average across 3 tasks)
- Precision: ~52-55%
- Recall: ~54-56%
Key Insights:
- Model successfully learned pipe format for relationships (was 0% before fine-tuning)
- Balanced performance across all three tasks
- Format conversion (2,050 samples) successfully integrated during training
- Clean data (99.8% retention) contributed to stable training
Baseline Comparison:
- Pre-training: 0% F1 on relationships (couldn't extract pairs)
- Post-training: ~50% F1 on relationships (significant improvement)
- Chemical/disease extraction improved from generic to domain-specific recognition
Planned Evaluation
Next Step: Baseline evaluation of BioMistral-7B-SLERP-AWQ (quantized, no fine-tuning)
- Hypothesis: Medical domain pre-training may outperform fine-tuned Llama-3.2-3B
- Target: 70-80% F1 (medical domain models typically show 15-20 point advantage)
- Decision criteria:
- If BioMistral โฅ70% F1 โ Deploy quantized model as-is
- If BioMistral 60-70% F1 โ Fine-tune BioMistral (expected 75-85% F1)
- If BioMistral <60% F1 โ Fine-tuning mandatory
Tracking: GitHub Issue #3
Model Examination
Error Analysis
Common error patterns observed:
- False positives: Generic medical terms (e.g., "pain", "treatment") occasionally extracted
- False negatives: Complex multi-word entities sometimes partially extracted
- Boundary issues: Entity boundaries unclear for nested or compound terms
- Format sensitivity: Deviations from training prompt format reduce performance
Filtering Strategy
Enhanced filtering applied during evaluation:
- Blacklist of generic terms (drug, disease, chemical, etc.)
- Entity type validation (disease markers shouldn't appear in chemical extractions)
- Text grounding (only entities found in source text)
- Minimum length threshold (โฅ3 characters)
Environmental Impact
Carbon emissions estimated using the Machine Learning Impact calculator.
- Hardware Type: NVIDIA A100 80GB GPU
- Hours used: ~2.5 hours
- Cloud Provider: RunPod / Cloud GPU provider
- Compute Region: US (variable)
- Carbon Emitted: ~0.5 kg CO2eq (estimated)
Note: LoRA fine-tuning is significantly more efficient than full model training, using only ~1.5% of trainable parameters and ~3 hours of compute time vs. days/weeks for full training.
Technical Specifications
Model Architecture and Objective
Base Architecture: Llama-3.2-3B-Instruct (Meta AI)
- Parameters: 3 billion (base model)
- Architecture: Transformer decoder with grouped-query attention
- Context length: 8,192 tokens
- Vocabulary: 128,000 tokens (SentencePiece)
LoRA Adaptation:
- Trainable parameters:
47 million (1.5% of total) - LoRA rank: 16 (low-rank decomposition dimension)
- Adapter placement: All attention and MLP projection layers
- Training objective: Next-token prediction (causal language modeling)
Compute Infrastructure
Hardware
- Training: NVIDIA A100 80GB GPU
- Memory: 80GB VRAM (4-bit quantization reduces to ~7GB usage)
- CPU: High-memory instance (for data preprocessing)
Software
- Framework: Hugging Face Transformers 4.x
- Training: Hugging Face Trainer with PEFT (Parameter-Efficient Fine-Tuning)
- Quantization: BitsAndBytes (4-bit NF4 quantization)
- Monitoring: Weights & Biases
- Python: 3.10+
- PyTorch: 2.x with CUDA 12.x
- Key libraries:
transformers(model loading, training)peft(LoRA implementation)bitsandbytes(quantization)accelerate(distributed training)datasets(data loading)wandb(experiment tracking)
Citation
If you use this model in your research, please cite:
BibTeX:
@misc{clemente2025medical-ner-lora,
author = {Clemente, Alberto},
title = {Llama-3.2-3B Medical NER with LoRA},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/albyos/llama3-medical-ner-lora}},
}
APA:
Clemente, A. (2025). Llama-3.2-3B Medical NER with LoRA [Computer software]. Hugging Face. https://huggingface.co/albyos/llama3-medical-ner-lora
Glossary
- NER (Named Entity Recognition): Task of identifying and classifying named entities in text
- LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning method that adds trainable low-rank matrices to model layers
- ChemProt: Chemical-protein interaction corpus from biomedical literature
- Stratified splitting: Data splitting that preserves class distribution across splits
- Quantization: Reducing model precision (e.g., 32-bit โ 4-bit) to save memory
- Macro-average: Averaging metrics across classes with equal weight (vs. micro-average)
- Pipe format: Relationship representation as
"entity1 | entity2"(used for chemical-disease pairs)
More Information
Project Documentation:
- Quick Start Guide
- Fine-Tuning Plan
- Three-Way Split Guide
- Checkpoint Naming Strategy
- Implementation Summary
- Validation Strategy
Related Work:
- Base Model: Llama-3.2-3B-Instruct
- Alternative: BioMistral-7B-SLERP (medical domain pre-trained)
- Dataset Source: ChemProt Corpus
GitHub Issues:
- Issue #2: Retrain with BioMistral-7B-SLERP (Closed)
- Issue #3: Baseline Evaluation - BioMistral-7B-SLERP-AWQ (Open)
Model Card Authors
- Alberto Clemente (@albyos)
Model Card Contact
- GitHub: https://github.com/albertoclemente/medical-ner-fine-tuning
- Issues: https://github.com/albertoclemente/medical-ner-fine-tuning/issues
Framework Versions
- PEFT: 0.17.1+
- Transformers: 4.40.0+
- PyTorch: 2.2.0+
- BitsAndBytes: 0.42.0+
- Accelerate: 0.27.0+
- Datasets: 2.18.0+
- Tokenizers: 0.19.0+
Last Updated: November 15, 2025
- Downloads last month
- 21
Model tree for albyos/llama3-medical-ner-checkpoint-750-20251111_093839
Base model
meta-llama/Llama-3.2-3B-Instruct