Llama-3.2-3B Medical NER LoRA

A fine-tuned medical Named Entity Recognition (NER) model based on Llama-3.2-3B-Instruct using LoRA (Low-Rank Adaptation) for efficient parameter tuning. This model is specialized for extracting medical entities and relationships from biomedical texts.

Model Details

Model Description

This model fine-tunes Llama-3.2-3B-Instruct for medical Named Entity Recognition across three specialized tasks:

  1. Chemical Extraction: Identifies drug and chemical compound names
  2. Disease Extraction: Identifies disease and medical condition names
  3. Relationship Extraction: Identifies chemical-disease interactions (which chemicals influence which diseases)

The model was trained on a curated dataset derived from the ChemProt corpus with 2,994 high-quality medical text samples, achieving balanced performance across all three tasks.

  • Developed by: Alberto Clemente (@albyos)
  • Model type: Causal Language Model with LoRA adapters
  • Language(s): English (medical/biomedical domain)
  • License: Llama 3.2 Community License
  • Finetuned from model: meta-llama/Llama-3.2-3B-Instruct

Model Sources

Uses

Direct Use

This model is designed for extracting structured medical information from unstructured biomedical texts, including:

  • Research papers and clinical studies
  • Medical literature reviews
  • Drug interaction documentation
  • Disease characterization documents

Input format:

The following article contains technical terms including diseases, drugs and chemicals. 
Create a list only of the [chemicals/diseases/influences] mentioned.

[MEDICAL TEXT]

List of extracted [chemicals/diseases/influences]:

Output format:

  • For chemicals/diseases: Bullet list of entities
  • For relationships: Pipe-separated pairs (chemical | disease)

Downstream Use

This model can be integrated into:

  • Medical literature mining pipelines
  • Drug discovery workflows
  • Clinical decision support systems
  • Pharmacovigilance systems
  • Biomedical knowledge graph construction

Out-of-Scope Use

This model is NOT suitable for:

  • Clinical diagnosis or treatment recommendations
  • Patient-facing medical advice
  • Real-time critical healthcare decisions
  • Languages other than English
  • Non-medical domain NER tasks

Important: This model is for research and information extraction purposes only. It should not be used as a substitute for professional medical judgment.

Bias, Risks, and Limitations

Known Limitations

  1. Domain Specificity: Trained on scientific/biomedical literature; may not perform well on clinical notes or patient-facing text
  2. Entity Coverage: Limited to chemicals, diseases, and their relationships; doesn't extract other medical entities (procedures, anatomy, etc.)
  3. Training Data Bias: Reflects patterns in ChemProt corpus; may not generalize to all medical subdomains
  4. Hallucination Risk: As with all LLMs, may occasionally generate plausible but incorrect entities
  5. Format Sensitivity: Performance depends on using the exact prompt format from training

Recommendations

  • Always validate extracted entities against authoritative medical databases (ChEBI, MeSH, UMLS)
  • Use in conjunction with human expert review for high-stakes applications
  • Monitor for false positives (hallucinated entities) and false negatives (missed entities)
  • Implement confidence thresholding based on your use case requirements
  • Consider ensemble methods with other biomedical NER tools (e.g., BioMistral, PubMedBERT)

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model and tokenizer
base_model_id = "meta-llama/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

# Load LoRA adapter
adapter_model_id = "albyos/llama3-medical-ner-lora-{timestamp}"  # Replace with actual model ID
model = PeftModel.from_pretrained(model, adapter_model_id)

# Format prompt (example for chemical extraction)
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a medical NER expert specialized in extracting entities from biomedical texts.
Extract entities EXACTLY as they appear in the text.

CRITICAL RULES:
1. Return ONLY entities found verbatim in the article
2. Preserve exact formatting: hyphens, capitalization, special characters
3. Extract complete multi-word terms
4. For relationships: use format 'chemical NAME | disease NAME'

OUTPUT FORMAT:
- One entity per line with leading dash
- No explanations or additional text<|eot_id|><|start_header_id|>user<|end_header_id|>

The following article contains technical terms including diseases, drugs and chemicals. 
Create a list only of the chemicals mentioned.

Aspirin and ibuprofen are commonly used to treat inflammation. Recent studies show 
that metformin may reduce the risk of type-2 diabetes complications.

List of extracted chemicals:
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=False,
    temperature=1.0,
    repetition_penalty=1.15,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract assistant response
if "<|start_header_id|>assistant<|end_header_id|>" in response:
    result = response.split("<|start_header_id|>assistant<|end_header_id|>")[-1].strip()
    print(result)

Training Details

Training Data

Dataset: Custom medical NER dataset derived from ChemProt corpus

  • Total samples: 2,994 (after cleaning and deduplication)
  • Source: Biomedical literature abstracts
  • Tasks: Chemical extraction, disease extraction, relationship extraction
  • Split: 80% train (2,397), 10% validation (298), 10% test (299)
  • Quality: 99.8% retention rate, 0 empty completions, stratified by task

Data Characteristics (from exploration analysis):

  • Unique chemicals: 1,578 entities
  • Unique diseases: 2,199 entities
  • Vocabulary size: 13,710 unique words
  • Prompt length: Median 1,357 characters (195 words), range 345-4,018 chars
  • Hyphenated entities: ~459 (e.g., "type-2 diabetes", "5-fluorouracil")
  • Format conversion: 2,050 relationships converted from sentence to pipe format

Training Procedure

Preprocessing

  1. Deduplication: Removed duplicate prompts by normalized hash
  2. Format standardization: Converted relationship format from "chemical X influences disease Y" to "X | Y"
  3. Entity normalization: Lowercase, whitespace normalization, hyphen preservation
  4. Stratified splitting: Ensures 33.3% distribution per task across all splits
  5. Leakage prevention: Hard assertions verify zero overlap between train/val/test

Training Hyperparameters

LoRA Configuration:

  • LoRA rank (r): 16
  • LoRA alpha: 32
  • LoRA dropout: 0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Parameters:

  • Training regime: fp16 mixed precision
  • Quantization: 4-bit NF4 (BitsAndBytes)
  • Epochs: 5
  • Batch size: 4 per device
  • Gradient accumulation: 4 steps (effective batch = 16)
  • Learning rate: 5e-5
  • LR scheduler: Cosine with 3% warmup
  • Weight decay: 0.01
  • Optimizer: paged_adamw_8bit
  • Max sequence length: 2048 tokens
  • Gradient checkpointing: Enabled

Data-Driven Justification: All hyperparameters were validated against dataset characteristics:

  • Batch size 4-8 optimal for 3,000 samples
  • 5 epochs sufficient for format learning without overfitting
  • Conservative LR (5e-5) for 13,710 vocabulary size
  • Max length 2048 covers 99%+ of prompts (median 1,357 chars)

Speeds, Sizes, Times

  • Training time: ~2-3 hours on NVIDIA A100 GPU
  • Model size: ~3.5 GB (quantized base model + LoRA adapters)
  • Trainable parameters: ~1.5% of total model parameters
  • Checkpoint frequency: Every 50 steps
  • Evaluation frequency: Every 50 steps

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • Dataset: Held-out test set from cleaned splits (299 samples)
  • Split date: November 13, 2025
  • Distribution: 100 chemicals, 99 diseases, 100 relationships
  • Source: ChemProt corpus (biomedical literature)

Factors

Evaluation disaggregated by task type:

  • Chemical extraction: Drug and chemical compound identification
  • Disease extraction: Disease and medical condition identification
  • Relationship extraction: Chemical-disease interaction pairs

Metrics

  • F1 Score (primary): Harmonic mean of precision and recall
  • Precision: Fraction of predicted entities that are correct
  • Recall: Fraction of gold standard entities that were found
  • Macro-average: Equal weight to each task (chemicals, diseases, relationships)

Evaluation methodology:

  • Enhanced filtering to reduce false positives
  • Normalized entity matching (lowercase, whitespace)
  • Hyphen preservation during normalization
  • Task-specific parsing (bullet lists for entities, pipe format for relationships)

Results

Llama-3.2-3B Baseline (before considering BioMistral):

  • Overall F1: 53.8% (macro-average across 3 tasks)
  • Precision: ~52-55%
  • Recall: ~54-56%

Key Insights:

  • Model successfully learned pipe format for relationships (was 0% before fine-tuning)
  • Balanced performance across all three tasks
  • Format conversion (2,050 samples) successfully integrated during training
  • Clean data (99.8% retention) contributed to stable training

Baseline Comparison:

  • Pre-training: 0% F1 on relationships (couldn't extract pairs)
  • Post-training: ~50% F1 on relationships (significant improvement)
  • Chemical/disease extraction improved from generic to domain-specific recognition

Planned Evaluation

Next Step: Baseline evaluation of BioMistral-7B-SLERP-AWQ (quantized, no fine-tuning)

  • Hypothesis: Medical domain pre-training may outperform fine-tuned Llama-3.2-3B
  • Target: 70-80% F1 (medical domain models typically show 15-20 point advantage)
  • Decision criteria:
    • If BioMistral โ‰ฅ70% F1 โ†’ Deploy quantized model as-is
    • If BioMistral 60-70% F1 โ†’ Fine-tune BioMistral (expected 75-85% F1)
    • If BioMistral <60% F1 โ†’ Fine-tuning mandatory

Tracking: GitHub Issue #3

Model Examination

Error Analysis

Common error patterns observed:

  1. False positives: Generic medical terms (e.g., "pain", "treatment") occasionally extracted
  2. False negatives: Complex multi-word entities sometimes partially extracted
  3. Boundary issues: Entity boundaries unclear for nested or compound terms
  4. Format sensitivity: Deviations from training prompt format reduce performance

Filtering Strategy

Enhanced filtering applied during evaluation:

  • Blacklist of generic terms (drug, disease, chemical, etc.)
  • Entity type validation (disease markers shouldn't appear in chemical extractions)
  • Text grounding (only entities found in source text)
  • Minimum length threshold (โ‰ฅ3 characters)

Environmental Impact

Carbon emissions estimated using the Machine Learning Impact calculator.

  • Hardware Type: NVIDIA A100 80GB GPU
  • Hours used: ~2.5 hours
  • Cloud Provider: RunPod / Cloud GPU provider
  • Compute Region: US (variable)
  • Carbon Emitted: ~0.5 kg CO2eq (estimated)

Note: LoRA fine-tuning is significantly more efficient than full model training, using only ~1.5% of trainable parameters and ~3 hours of compute time vs. days/weeks for full training.

Technical Specifications

Model Architecture and Objective

Base Architecture: Llama-3.2-3B-Instruct (Meta AI)

  • Parameters: 3 billion (base model)
  • Architecture: Transformer decoder with grouped-query attention
  • Context length: 8,192 tokens
  • Vocabulary: 128,000 tokens (SentencePiece)

LoRA Adaptation:

  • Trainable parameters: 47 million (1.5% of total)
  • LoRA rank: 16 (low-rank decomposition dimension)
  • Adapter placement: All attention and MLP projection layers
  • Training objective: Next-token prediction (causal language modeling)

Compute Infrastructure

Hardware

  • Training: NVIDIA A100 80GB GPU
  • Memory: 80GB VRAM (4-bit quantization reduces to ~7GB usage)
  • CPU: High-memory instance (for data preprocessing)

Software

  • Framework: Hugging Face Transformers 4.x
  • Training: Hugging Face Trainer with PEFT (Parameter-Efficient Fine-Tuning)
  • Quantization: BitsAndBytes (4-bit NF4 quantization)
  • Monitoring: Weights & Biases
  • Python: 3.10+
  • PyTorch: 2.x with CUDA 12.x
  • Key libraries:
    • transformers (model loading, training)
    • peft (LoRA implementation)
    • bitsandbytes (quantization)
    • accelerate (distributed training)
    • datasets (data loading)
    • wandb (experiment tracking)

Citation

If you use this model in your research, please cite:

BibTeX:

@misc{clemente2025medical-ner-lora,
  author = {Clemente, Alberto},
  title = {Llama-3.2-3B Medical NER with LoRA},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/albyos/llama3-medical-ner-lora}},
}

APA:

Clemente, A. (2025). Llama-3.2-3B Medical NER with LoRA [Computer software]. Hugging Face. https://huggingface.co/albyos/llama3-medical-ner-lora

Glossary

  • NER (Named Entity Recognition): Task of identifying and classifying named entities in text
  • LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning method that adds trainable low-rank matrices to model layers
  • ChemProt: Chemical-protein interaction corpus from biomedical literature
  • Stratified splitting: Data splitting that preserves class distribution across splits
  • Quantization: Reducing model precision (e.g., 32-bit โ†’ 4-bit) to save memory
  • Macro-average: Averaging metrics across classes with equal weight (vs. micro-average)
  • Pipe format: Relationship representation as "entity1 | entity2" (used for chemical-disease pairs)

More Information

Project Documentation:

Related Work:

GitHub Issues:

Model Card Authors

  • Alberto Clemente (@albyos)

Model Card Contact


Framework Versions

  • PEFT: 0.17.1+
  • Transformers: 4.40.0+
  • PyTorch: 2.2.0+
  • BitsAndBytes: 0.42.0+
  • Accelerate: 0.27.0+
  • Datasets: 2.18.0+
  • Tokenizers: 0.19.0+

Last Updated: November 15, 2025

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for albyos/llama3-medical-ner-checkpoint-750-20251111_093839

Adapter
(533)
this model