๐Ÿฅ Building MedNER-TR: A Turkish Medical NER Baseline in 7 Hours

Community Article Published October 30, 2025

A proof-of-concept demonstrating transfer learning for Turkish medical text โ€” with honest performance metrics and real limitations


Author: TuฤŸrul Kaya
Date: October 30, 2025
Reading Time: 8 minutes


๐ŸŽฏ What I Actually Built

A baseline Turkish medical named entity recognition system that achieves 99.49% F1 score on template-generated synthetic data.

Important Context:

  • โš ๏ธ Trained on synthetic data, NOT real clinical notes
  • โš ๏ธ Real-world performance likely 75-85% (industry standard)
  • โœ… Demonstrates feasibility of Turkish medical NER
  • โœ… First open-source Turkish medical NER model
  • โœ… Strong starting point for future work

Try it:


๐Ÿ“– Why This Matters

Turkish is spoken by 80+ million people, yet medical NLP resources are scarce:

  • โŒ No public Turkish medical NER datasets
  • โŒ No pre-trained Turkish medical models
  • โŒ Limited clinical text processing tools

The Goal: Create an open-source baseline that the Turkish NLP community can improve.


๐Ÿ› ๏ธ Technical Approach

Architecture

Text โ†’ BERTurk Tokenizer โ†’ Token Classification โ†’ Entity Extraction

Base Model: BERTurk
Task: Token classification with BIO tagging
Entities: 5 types (medications, diseases, symptoms, organs, tests)

Why BERTurk?

  • Pre-trained on 35GB Turkish corpus
  • Understands Turkish morphology
  • Easy integration with Transformers
  • Active community support

๐Ÿ“Š The Dataset Challenge

Problem: No Turkish medical NER dataset exists.

Solution: Generate synthetic data with validation.

Data Generation Process

1. Templates (200+ patterns)

templates = [
    "Hastaya {drug} baลŸlandฤฑ.",
    "{disease} iรงin {drug} verildi.",
    "Hasta {symptom} ile baลŸvurdu."
]

2. Medical Vocabularies

  • 150+ medications (Parol, Metformin, Aspirin...)
  • 200+ diseases (diyabet, hipertansiyon, grip...)
  • 150+ symptoms (ateลŸ, รถksรผrรผk, baลŸ aฤŸrฤฑsฤฑ...)
  • 80+ organs (kalp, akciฤŸer, karaciฤŸer...)
  • 120+ tests (EKG, kan tahlili, MR...)

3. Automatic Annotation

# Simple substring matching with overlap handling
for entity in sorted(entities, key=len, reverse=True):
    if entity.lower() in text.lower():
        annotate(entity, entity_type)

4. Quality Control

  • Removed duplicates
  • Handled overlapping entities
  • Manual review of 200 samples

Final Dataset: 2000 sentences, ~3500 entities


๐Ÿ”ฌ Training

Setup

from transformers import AutoModelForTokenClassification, Trainer

model = AutoModelForTokenClassification.from_pretrained(
    "dbmdz/bert-base-turkish-cased",
    num_labels=11  # O + 5 entities ร— 2 (B/I)
)

training_args = TrainingArguments(
    learning_rate=2e-5,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    eval_strategy="epoch"
)

Results

Epoch Loss F1 Accuracy
1 0.088 95.3% 98.5%
2 0.036 99.1% 99.6%
3 0.010 99.5% 99.8%

Training: Google Colab T4 GPU, ~10 minutes


โš ๏ธ Honest Performance Assessment

What the 99.49% F1 Really Means

This high score reflects performance on:

  • โœ… Clean, template-generated text
  • โœ… Controlled vocabulary
  • โœ… Structured sentences

Real Clinical Notes Are Different

Real-world medical text has:

  • โŒ Abbreviations (DM, HT, KOAH)
  • โŒ Misspellings and typos
  • โŒ Turkish-English code-switching
  • โŒ Domain jargon and slang
  • โŒ Incomplete sentences
  • โŒ Ambiguous contexts

Expected Real-World Performance

Based on similar medical NER systems:

Data Type Expected F1
Synthetic test (current) 99.5%
Clean clinical notes ~85-90%
Raw clinical notes ~75-80%
Diverse medical docs ~70-75%

Reality: This model needs validation on real clinical data before production use.


๐Ÿ“ Example Usage

Quick Start

from transformers import pipeline

# Load model
ner = pipeline(
    "token-classification",
    model="tugrulkaya/medner-tr",
    aggregation_strategy="simple"
)

# Predict
text = "Hastaya Parol 500mg baลŸlandฤฑ."
results = ner(text)

for entity in results:
    print(f"{entity['entity_group']}: {entity['word']}")
# Output: ILAC: Parol

What Works Well

# โœ… Simple, clean sentences
"Hasta ateลŸ ve รถksรผrรผk ile baลŸvurdu."
# โ†’ SEMPTOM: ateลŸ, SEMPTOM: รถksรผrรผk

# โœ… Standard medication names
"Metformin 850mg gรผnde 2 kez."
# โ†’ ILAC: Metformin

What Might Not Work

# โŒ Abbreviations
"Hasta DM ve HT tanฤฑlฤฑ."
# โ†’ Might miss "DM" (diabetes) and "HT" (hypertension)

# โŒ Typos
"Hastaya Poral verildi."
# โ†’ Might miss "Poral" (typo of "Parol")

# โŒ Complex medical jargon
"Hasta akut MI geรงirdi."
# โ†’ Might miss "MI" (myocardial infarction)

๐Ÿ’ก What This Project Demonstrates

โœ… Successes

  1. Feasibility - Turkish medical NER is possible
  2. Transfer Learning - BERTurk adapts well to medical domain
  3. Rapid Prototyping - 7 hours from idea to working system
  4. Open Source - First Turkish medical NER baseline
  5. Reproducible - All code and methodology documented

โŒ Limitations

  1. Synthetic Data - Not validated on real clinical notes
  2. Limited Vocabulary - Missing many medical terms
  3. No Abbreviations - Can't handle "DM", "HT", "KOAH"
  4. No Relations - Doesn't extract drug-disease relationships
  5. No Normalization - Doesn't map to medical codes (ICD-10)

๐ŸŽฏ Real Achievement

This isn't about claiming production-ready accuracy.

The real value:

  • Proves Turkish medical NER is viable
  • Provides baseline for comparison
  • Open-source starting point
  • Educational resource
  • Community can improve it

๐Ÿš€ Use Cases (With Caveats)

โœ… Good For

  • Education - Learning Turkish NLP
  • Research - Baseline comparisons
  • Prototyping - Quick demos
  • Feature Extraction - Simple clinical text

โŒ NOT Ready For

  • Production EHR systems
  • Clinical decision support
  • Medical coding automation
  • Regulatory compliance
  • Patient safety applications

Bottom Line: Use for research and prototyping, NOT clinical production.


๐Ÿ”ฎ Next Steps for Real-World Readiness

1. Real Data Collection

  • Partner with hospitals (ethics approval)
  • Collect diverse clinical notes
  • Manual annotation by medical professionals

2. Handle Edge Cases

  • Add common abbreviations
  • Train on noisy text
  • Include code-switching

3. Evaluation

  • Test on real clinical data
  • Report honest metrics
  • Compare with human annotators

4. Domain Expertise

  • Validate with doctors
  • Iterate based on feedback
  • Add missing medical terms

5. Production Features

  • Entity normalization (ICD-10 codes)
  • Relation extraction
  • Confidence thresholds
  • Error handling

๐Ÿ“š Lessons Learned

1. Synthetic Data Has Limits

What worked:

  • Fast iteration
  • Controlled experiments
  • Proof of concept

What didn't:

  • Real-world robustness
  • Edge case handling
  • Production readiness

2. High Metrics โ‰  Success

99% F1 on synthetic data doesn't mean:

  • โœ… Production-ready
  • โœ… Solves real problems
  • โœ… Works on messy data

It means:

  • โœ… Model can learn
  • โœ… Approach is promising
  • โœ… Needs more work

3. Transfer Learning is Powerful

BERTurk saved months:

  • Already knows Turkish
  • No morphology learning needed
  • Fast fine-tuning

4. Open Source Matters

Sharing early (even imperfect) helps:

  • Community feedback
  • Collaborative improvement
  • Avoid duplicate work

5. Honesty Builds Trust

Being clear about limitations:

  • Sets realistic expectations
  • Prevents misuse
  • Encourages proper validation

๐Ÿงช Try It Yourself

pip install transformers torch

from transformers import pipeline

ner = pipeline("token-classification", 
               model="tugrulkaya/medner-tr",
               aggregation_strategy="simple")

# Test on your data
text = "Your Turkish medical text here"
results = ner(text)

for e in results:
    print(f"{e['entity_group']}: {e['word']} ({e['score']:.2%})")

Expected behavior:

  • โœ… Works well on clean, simple sentences
  • โš ๏ธ May struggle with abbreviations and typos
  • โŒ Not tested on real clinical notes

๐ŸŽฏ Conclusion

What I Built

A baseline Turkish medical NER system demonstrating that transfer learning from BERTurk can achieve strong results on synthetic data.

What I Didn't Build

A production-ready system validated on real clinical data.

Why It Matters

First open-source Turkish medical NER model. A starting point for the community to improve.

Next Steps

Validation on real clinical data with domain experts.


๐Ÿ”— Resources


๐Ÿ’ฌ Feedback Welcome

This is a starting point, not a finished product.

How to help:

  • Test on your data
  • Report issues
  • Suggest improvements
  • Contribute medical terms
  • Share real-world results

๐Ÿ™ Acknowledgments

  • Hugging Face - Infrastructure and tools
  • dbmdz - BERTurk model
  • Turkish NLP community - Inspiration and support

Tags: turkish medical ner baseline proof-of-concept bert synthetic-data

Disclaimer: This model is for research and educational purposes. Not validated for clinical use.

Community

Sign up or log in to comment