Llama-3.2-3B Medical NER LoRA

A fine-tuned medical Named Entity Recognition (NER) model based on Llama-3.2-3B-Instruct using LoRA (Low-Rank Adaptation) for efficient parameter tuning. This model is specialized for extracting medical entities and relationships from biomedical texts.

Model Details

Model Description

This model fine-tunes Llama-3.2-3B-Instruct for medical Named Entity Recognition across three specialized tasks:

Chemical Extraction: Identifies drug and chemical compound names
Disease Extraction: Identifies disease and medical condition names
Relationship Extraction: Identifies chemical-disease interactions (which chemicals influence which diseases)

The model was trained on a curated dataset derived from the ChemProt corpus with 2,994 high-quality medical text samples, achieving balanced performance across all three tasks.

Developed by: Alberto Clemente (@albyos)
Model type: Causal Language Model with LoRA adapters
Language(s): English (medical/biomedical domain)
License: Llama 3.2 Community License
Finetuned from model: meta-llama/Llama-3.2-3B-Instruct

Model Sources

Repository: https://github.com/albertoclemente/medical-ner-fine-tuning
Training Notebook: notebooks/training/Medical_NER_Fine_Tuning_run_20251111.ipynb
Evaluation Notebook: notebooks/evaluation/Medical_NER_Evaluation_BioMistral_7B_SLERP_AWQ_Quantized_20251115.ipynb

Uses

Direct Use

This model is designed for extracting structured medical information from unstructured biomedical texts, including:

Research papers and clinical studies
Medical literature reviews
Drug interaction documentation
Disease characterization documents

Input format:

The following article contains technical terms including diseases, drugs and chemicals. 
Create a list only of the [chemicals/diseases/influences] mentioned.

[MEDICAL TEXT]

List of extracted [chemicals/diseases/influences]:

Output format:

For chemicals/diseases: Bullet list of entities
For relationships: Pipe-separated pairs (chemical | disease)

Downstream Use

This model can be integrated into:

Medical literature mining pipelines
Drug discovery workflows
Clinical decision support systems
Pharmacovigilance systems
Biomedical knowledge graph construction

Out-of-Scope Use

This model is NOT suitable for:

Clinical diagnosis or treatment recommendations
Patient-facing medical advice
Real-time critical healthcare decisions
Languages other than English
Non-medical domain NER tasks

Important: This model is for research and information extraction purposes only. It should not be used as a substitute for professional medical judgment.

Bias, Risks, and Limitations

Known Limitations

Domain Specificity: Trained on scientific/biomedical literature; may not perform well on clinical notes or patient-facing text
Entity Coverage: Limited to chemicals, diseases, and their relationships; doesn't extract other medical entities (procedures, anatomy, etc.)
Training Data Bias: Reflects patterns in ChemProt corpus; may not generalize to all medical subdomains
Hallucination Risk: As with all LLMs, may occasionally generate plausible but incorrect entities
Format Sensitivity: Performance depends on using the exact prompt format from training

Recommendations

Always validate extracted entities against authoritative medical databases (ChEBI, MeSH, UMLS)
Use in conjunction with human expert review for high-stakes applications
Monitor for false positives (hallucinated entities) and false negatives (missed entities)
Implement confidence thresholding based on your use case requirements
Consider ensemble methods with other biomedical NER tools (e.g., BioMistral, PubMedBERT)

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model and tokenizer
base_model_id = "meta-llama/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

# Load LoRA adapter
adapter_model_id = "albyos/llama3-medical-ner-lora-{timestamp}"  # Replace with actual model ID
model = PeftModel.from_pretrained(model, adapter_model_id)

# Format prompt (example for chemical extraction)
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a medical NER expert specialized in extracting entities from biomedical texts.
Extract entities EXACTLY as they appear in the text.

CRITICAL RULES:
1. Return ONLY entities found verbatim in the article
2. Preserve exact formatting: hyphens, capitalization, special characters
3. Extract complete multi-word terms
4. For relationships: use format 'chemical NAME | disease NAME'

OUTPUT FORMAT:
- One entity per line with leading dash
- No explanations or additional text<|eot_id|><|start_header_id|>user<|end_header_id|>

The following article contains technical terms including diseases, drugs and chemicals. 
Create a list only of the chemicals mentioned.

Aspirin and ibuprofen are commonly used to treat inflammation. Recent studies show 
that metformin may reduce the risk of type-2 diabetes complications.

List of extracted chemicals:
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=False,
    temperature=1.0,
    repetition_penalty=1.15,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract assistant response
if "<|start_header_id|>assistant<|end_header_id|>" in response:
    result = response.split("<|start_header_id|>assistant<|end_header_id|>")[-1].strip()
    print(result)

Training Details

Training Data

Dataset: Custom medical NER dataset derived from ChemProt corpus

Total samples: 2,994 (after cleaning and deduplication)
Source: Biomedical literature abstracts
Tasks: Chemical extraction, disease extraction, relationship extraction
Split: 80% train (2,397), 10% validation (298), 10% test (299)
Quality: 99.8% retention rate, 0 empty completions, stratified by task

Data Characteristics (from exploration analysis):

Unique chemicals: 1,578 entities
Unique diseases: 2,199 entities
Vocabulary size: 13,710 unique words
Prompt length: Median 1,357 characters (195 words), range 345-4,018 chars
Hyphenated entities: ~459 (e.g., "type-2 diabetes", "5-fluorouracil")
Format conversion: 2,050 relationships converted from sentence to pipe format

Training Procedure

Preprocessing

Deduplication: Removed duplicate prompts by normalized hash
Format standardization: Converted relationship format from "chemical X influences disease Y" to "X | Y"
Entity normalization: Lowercase, whitespace normalization, hyphen preservation
Stratified splitting: Ensures 33.3% distribution per task across all splits
Leakage prevention: Hard assertions verify zero overlap between train/val/test

Training Hyperparameters

LoRA Configuration:

LoRA rank (r): 16
LoRA alpha: 32
LoRA dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Parameters:

Training regime: fp16 mixed precision
Quantization: 4-bit NF4 (BitsAndBytes)
Epochs: 5
Batch size: 4 per device
Gradient accumulation: 4 steps (effective batch = 16)
Learning rate: 5e-5
LR scheduler: Cosine with 3% warmup
Weight decay: 0.01
Optimizer: paged_adamw_8bit
Max sequence length: 2048 tokens
Gradient checkpointing: Enabled

Data-Driven Justification: All hyperparameters were validated against dataset characteristics:

Batch size 4-8 optimal for 3,000 samples
5 epochs sufficient for format learning without overfitting
Conservative LR (5e-5) for 13,710 vocabulary size
Max length 2048 covers 99%+ of prompts (median 1,357 chars)

Speeds, Sizes, Times

Training time: ~2-3 hours on NVIDIA A100 GPU
Model size: ~3.5 GB (quantized base model + LoRA adapters)
Trainable parameters: ~1.5% of total model parameters
Checkpoint frequency: Every 50 steps
Evaluation frequency: Every 50 steps

Evaluation

Testing Data, Factors & Metrics

Testing Data

Dataset: Held-out test set from cleaned splits (299 samples)
Split date: November 13, 2025
Distribution: 100 chemicals, 99 diseases, 100 relationships
Source: ChemProt corpus (biomedical literature)

Factors

Evaluation disaggregated by task type:

Chemical extraction: Drug and chemical compound identification
Disease extraction: Disease and medical condition identification
Relationship extraction: Chemical-disease interaction pairs

Metrics

F1 Score (primary): Harmonic mean of precision and recall
Precision: Fraction of predicted entities that are correct
Recall: Fraction of gold standard entities that were found
Macro-average: Equal weight to each task (chemicals, diseases, relationships)

Evaluation methodology:

Enhanced filtering to reduce false positives
Normalized entity matching (lowercase, whitespace)
Hyphen preservation during normalization
Task-specific parsing (bullet lists for entities, pipe format for relationships)

Results

Llama-3.2-3B Baseline (before considering BioMistral):

Overall F1: 53.8% (macro-average across 3 tasks)
Precision: ~52-55%
Recall: ~54-56%

Key Insights:

Model successfully learned pipe format for relationships (was 0% before fine-tuning)
Balanced performance across all three tasks
Format conversion (2,050 samples) successfully integrated during training
Clean data (99.8% retention) contributed to stable training

Baseline Comparison:

Pre-training: 0% F1 on relationships (couldn't extract pairs)
Post-training: ~50% F1 on relationships (significant improvement)
Chemical/disease extraction improved from generic to domain-specific recognition

Planned Evaluation

Next Step: Baseline evaluation of BioMistral-7B-SLERP-AWQ (quantized, no fine-tuning)

Hypothesis: Medical domain pre-training may outperform fine-tuned Llama-3.2-3B
Target: 70-80% F1 (medical domain models typically show 15-20 point advantage)
Decision criteria:
- If BioMistral ≥70% F1 → Deploy quantized model as-is
- If BioMistral 60-70% F1 → Fine-tune BioMistral (expected 75-85% F1)
- If BioMistral <60% F1 → Fine-tuning mandatory

Tracking: GitHub Issue #3

Model Examination

Error Analysis

Common error patterns observed:

False positives: Generic medical terms (e.g., "pain", "treatment") occasionally extracted
False negatives: Complex multi-word entities sometimes partially extracted
Boundary issues: Entity boundaries unclear for nested or compound terms
Format sensitivity: Deviations from training prompt format reduce performance

Filtering Strategy

Enhanced filtering applied during evaluation:

Blacklist of generic terms (drug, disease, chemical, etc.)
Entity type validation (disease markers shouldn't appear in chemical extractions)
Text grounding (only entities found in source text)
Minimum length threshold (≥3 characters)

Environmental Impact

Carbon emissions estimated using the Machine Learning Impact calculator.

Hardware Type: NVIDIA A100 80GB GPU
Hours used: ~2.5 hours
Cloud Provider: RunPod / Cloud GPU provider
Compute Region: US (variable)
Carbon Emitted: ~0.5 kg CO2eq (estimated)

Note: LoRA fine-tuning is significantly more efficient than full model training, using only ~1.5% of trainable parameters and ~3 hours of compute time vs. days/weeks for full training.

Technical Specifications

Model Architecture and Objective

Base Architecture: Llama-3.2-3B-Instruct (Meta AI)

Parameters: 3 billion (base model)
Architecture: Transformer decoder with grouped-query attention
Context length: 8,192 tokens
Vocabulary: 128,000 tokens (SentencePiece)

LoRA Adaptation:

Trainable parameters: ~~47 million (~~1.5% of total)
LoRA rank: 16 (low-rank decomposition dimension)
Adapter placement: All attention and MLP projection layers
Training objective: Next-token prediction (causal language modeling)

Compute Infrastructure

Hardware

Training: NVIDIA A100 80GB GPU
Memory: 80GB VRAM (4-bit quantization reduces to ~7GB usage)
CPU: High-memory instance (for data preprocessing)

Software

Framework: Hugging Face Transformers 4.x
Training: Hugging Face Trainer with PEFT (Parameter-Efficient Fine-Tuning)
Quantization: BitsAndBytes (4-bit NF4 quantization)
Monitoring: Weights & Biases
Python: 3.10+
PyTorch: 2.x with CUDA 12.x
Key libraries:
- transformers (model loading, training)
- peft (LoRA implementation)
- bitsandbytes (quantization)
- accelerate (distributed training)
- datasets (data loading)
- wandb (experiment tracking)

Citation

If you use this model in your research, please cite:

BibTeX:

@misc{clemente2025medical-ner-lora,
  author = {Clemente, Alberto},
  title = {Llama-3.2-3B Medical NER with LoRA},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/albyos/llama3-medical-ner-lora}},
}

APA:

Clemente, A. (2025). Llama-3.2-3B Medical NER with LoRA [Computer software]. Hugging Face. https://huggingface.co/albyos/llama3-medical-ner-lora

Glossary

NER (Named Entity Recognition): Task of identifying and classifying named entities in text
LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning method that adds trainable low-rank matrices to model layers
ChemProt: Chemical-protein interaction corpus from biomedical literature
Stratified splitting: Data splitting that preserves class distribution across splits
Quantization: Reducing model precision (e.g., 32-bit → 4-bit) to save memory
Macro-average: Averaging metrics across classes with equal weight (vs. micro-average)
Pipe format: Relationship representation as "entity1 | entity2" (used for chemical-disease pairs)