๐ฅ Building MedNER-TR: A Turkish Medical NER Baseline in 7 Hours
A proof-of-concept demonstrating transfer learning for Turkish medical text โ with honest performance metrics and real limitations
Author: Tuฤrul Kaya
Date: October 30, 2025
Reading Time: 8 minutes
๐ฏ What I Actually Built
A baseline Turkish medical named entity recognition system that achieves 99.49% F1 score on template-generated synthetic data.
Important Context:
- โ ๏ธ Trained on synthetic data, NOT real clinical notes
- โ ๏ธ Real-world performance likely 75-85% (industry standard)
- โ Demonstrates feasibility of Turkish medical NER
- โ First open-source Turkish medical NER model
- โ Strong starting point for future work
Try it:
๐ Why This Matters
Turkish is spoken by 80+ million people, yet medical NLP resources are scarce:
- โ No public Turkish medical NER datasets
- โ No pre-trained Turkish medical models
- โ Limited clinical text processing tools
The Goal: Create an open-source baseline that the Turkish NLP community can improve.
๐ ๏ธ Technical Approach
Architecture
Text โ BERTurk Tokenizer โ Token Classification โ Entity Extraction
Base Model: BERTurk
Task: Token classification with BIO tagging
Entities: 5 types (medications, diseases, symptoms, organs, tests)
Why BERTurk?
- Pre-trained on 35GB Turkish corpus
- Understands Turkish morphology
- Easy integration with Transformers
- Active community support
๐ The Dataset Challenge
Problem: No Turkish medical NER dataset exists.
Solution: Generate synthetic data with validation.
Data Generation Process
1. Templates (200+ patterns)
templates = [
"Hastaya {drug} baลlandฤฑ.",
"{disease} iรงin {drug} verildi.",
"Hasta {symptom} ile baลvurdu."
]
2. Medical Vocabularies
- 150+ medications (Parol, Metformin, Aspirin...)
- 200+ diseases (diyabet, hipertansiyon, grip...)
- 150+ symptoms (ateล, รถksรผrรผk, baล aฤrฤฑsฤฑ...)
- 80+ organs (kalp, akciฤer, karaciฤer...)
- 120+ tests (EKG, kan tahlili, MR...)
3. Automatic Annotation
# Simple substring matching with overlap handling
for entity in sorted(entities, key=len, reverse=True):
if entity.lower() in text.lower():
annotate(entity, entity_type)
4. Quality Control
- Removed duplicates
- Handled overlapping entities
- Manual review of 200 samples
Final Dataset: 2000 sentences, ~3500 entities
๐ฌ Training
Setup
from transformers import AutoModelForTokenClassification, Trainer
model = AutoModelForTokenClassification.from_pretrained(
"dbmdz/bert-base-turkish-cased",
num_labels=11 # O + 5 entities ร 2 (B/I)
)
training_args = TrainingArguments(
learning_rate=2e-5,
num_train_epochs=3,
per_device_train_batch_size=16,
eval_strategy="epoch"
)
Results
| Epoch | Loss | F1 | Accuracy |
|---|---|---|---|
| 1 | 0.088 | 95.3% | 98.5% |
| 2 | 0.036 | 99.1% | 99.6% |
| 3 | 0.010 | 99.5% | 99.8% |
Training: Google Colab T4 GPU, ~10 minutes
โ ๏ธ Honest Performance Assessment
What the 99.49% F1 Really Means
This high score reflects performance on:
- โ Clean, template-generated text
- โ Controlled vocabulary
- โ Structured sentences
Real Clinical Notes Are Different
Real-world medical text has:
- โ Abbreviations (DM, HT, KOAH)
- โ Misspellings and typos
- โ Turkish-English code-switching
- โ Domain jargon and slang
- โ Incomplete sentences
- โ Ambiguous contexts
Expected Real-World Performance
Based on similar medical NER systems:
| Data Type | Expected F1 |
|---|---|
| Synthetic test (current) | 99.5% |
| Clean clinical notes | ~85-90% |
| Raw clinical notes | ~75-80% |
| Diverse medical docs | ~70-75% |
Reality: This model needs validation on real clinical data before production use.
๐ Example Usage
Quick Start
from transformers import pipeline
# Load model
ner = pipeline(
"token-classification",
model="tugrulkaya/medner-tr",
aggregation_strategy="simple"
)
# Predict
text = "Hastaya Parol 500mg baลlandฤฑ."
results = ner(text)
for entity in results:
print(f"{entity['entity_group']}: {entity['word']}")
# Output: ILAC: Parol
What Works Well
# โ
Simple, clean sentences
"Hasta ateล ve รถksรผrรผk ile baลvurdu."
# โ SEMPTOM: ateล, SEMPTOM: รถksรผrรผk
# โ
Standard medication names
"Metformin 850mg gรผnde 2 kez."
# โ ILAC: Metformin
What Might Not Work
# โ Abbreviations
"Hasta DM ve HT tanฤฑlฤฑ."
# โ Might miss "DM" (diabetes) and "HT" (hypertension)
# โ Typos
"Hastaya Poral verildi."
# โ Might miss "Poral" (typo of "Parol")
# โ Complex medical jargon
"Hasta akut MI geรงirdi."
# โ Might miss "MI" (myocardial infarction)
๐ก What This Project Demonstrates
โ Successes
- Feasibility - Turkish medical NER is possible
- Transfer Learning - BERTurk adapts well to medical domain
- Rapid Prototyping - 7 hours from idea to working system
- Open Source - First Turkish medical NER baseline
- Reproducible - All code and methodology documented
โ Limitations
- Synthetic Data - Not validated on real clinical notes
- Limited Vocabulary - Missing many medical terms
- No Abbreviations - Can't handle "DM", "HT", "KOAH"
- No Relations - Doesn't extract drug-disease relationships
- No Normalization - Doesn't map to medical codes (ICD-10)
๐ฏ Real Achievement
This isn't about claiming production-ready accuracy.
The real value:
- Proves Turkish medical NER is viable
- Provides baseline for comparison
- Open-source starting point
- Educational resource
- Community can improve it
๐ Use Cases (With Caveats)
โ Good For
- Education - Learning Turkish NLP
- Research - Baseline comparisons
- Prototyping - Quick demos
- Feature Extraction - Simple clinical text
โ NOT Ready For
- Production EHR systems
- Clinical decision support
- Medical coding automation
- Regulatory compliance
- Patient safety applications
Bottom Line: Use for research and prototyping, NOT clinical production.
๐ฎ Next Steps for Real-World Readiness
1. Real Data Collection
- Partner with hospitals (ethics approval)
- Collect diverse clinical notes
- Manual annotation by medical professionals
2. Handle Edge Cases
- Add common abbreviations
- Train on noisy text
- Include code-switching
3. Evaluation
- Test on real clinical data
- Report honest metrics
- Compare with human annotators
4. Domain Expertise
- Validate with doctors
- Iterate based on feedback
- Add missing medical terms
5. Production Features
- Entity normalization (ICD-10 codes)
- Relation extraction
- Confidence thresholds
- Error handling
๐ Lessons Learned
1. Synthetic Data Has Limits
What worked:
- Fast iteration
- Controlled experiments
- Proof of concept
What didn't:
- Real-world robustness
- Edge case handling
- Production readiness
2. High Metrics โ Success
99% F1 on synthetic data doesn't mean:
- โ Production-ready
- โ Solves real problems
- โ Works on messy data
It means:
- โ Model can learn
- โ Approach is promising
- โ Needs more work
3. Transfer Learning is Powerful
BERTurk saved months:
- Already knows Turkish
- No morphology learning needed
- Fast fine-tuning
4. Open Source Matters
Sharing early (even imperfect) helps:
- Community feedback
- Collaborative improvement
- Avoid duplicate work
5. Honesty Builds Trust
Being clear about limitations:
- Sets realistic expectations
- Prevents misuse
- Encourages proper validation
๐งช Try It Yourself
pip install transformers torch
from transformers import pipeline
ner = pipeline("token-classification",
model="tugrulkaya/medner-tr",
aggregation_strategy="simple")
# Test on your data
text = "Your Turkish medical text here"
results = ner(text)
for e in results:
print(f"{e['entity_group']}: {e['word']} ({e['score']:.2%})")
Expected behavior:
- โ Works well on clean, simple sentences
- โ ๏ธ May struggle with abbreviations and typos
- โ Not tested on real clinical notes
๐ฏ Conclusion
What I Built
A baseline Turkish medical NER system demonstrating that transfer learning from BERTurk can achieve strong results on synthetic data.
What I Didn't Build
A production-ready system validated on real clinical data.
Why It Matters
First open-source Turkish medical NER model. A starting point for the community to improve.
Next Steps
Validation on real clinical data with domain experts.
๐ Resources
- ๐จ Demo: huggingface.co/spaces/tugrulkaya/medner-tr-demo
- ๐ฆ Model: huggingface.co/tugrulkaya/medner-tr
- ๐ป Code: github.com/mtkaya/medner-tr
- ๐ License: MIT (free for research and commercial use)
๐ฌ Feedback Welcome
This is a starting point, not a finished product.
How to help:
- Test on your data
- Report issues
- Suggest improvements
- Contribute medical terms
- Share real-world results
๐ Acknowledgments
- Hugging Face - Infrastructure and tools
- dbmdz - BERTurk model
- Turkish NLP community - Inspiration and support
Tags: turkish medical ner baseline proof-of-concept bert synthetic-data
Disclaimer: This model is for research and educational purposes. Not validated for clinical use.