Turkish Safety - Content Moderation Classifier v5.0
Multi-label classification model for Turkish content moderation
Developed by SiriusAI Tech Brain Team
Mission
Empowering digital platforms with AI-driven content safety solutions.
Turkish Safety is an advanced NLP model that analyzes Turkish content in real-time and detects harmful content across 7 different categories. It provides comprehensive content moderation for social media platforms, messaging applications, in-game chats, and community forums.
Why This Model Matters
- 7 Risk Categories: Detects SAFE, GROOMING, SEXUAL, OFFENSIVE, BULLYING, SELF_HARM, and THREAT
- Turkish-First Design: Optimized for Turkish linguistics and cultural context using BERTurk
- Production-Ready: <50ms inference, battle-tested architecture, enterprise-grade reliability
- Multi-Label Intelligence: Smart classification that understands content can belong to multiple categories
- Expert Validation: Curated training data with clear category boundaries and edge case handling
Model Overview
| Property |
Value |
| Architecture |
BERT (Bidirectional Encoder Representations from Transformers) |
| Base Model |
dbmdz/bert-base-turkish-uncased (BERTurk) |
| Task |
Multi-label Text Classification |
| Language |
Turkish (tr) |
| Categories |
7 content safety labels |
| Model Size |
443 MB (FP32) |
| Inference Time |
~10-15ms (GPU) / ~40-50ms (CPU) |
Performance Metrics
Final Evaluation Results (Epoch 2)
| Metric |
Score |
Description |
| Macro F1 |
0.9165 |
Harmonic mean of precision and recall across all categories |
| MCC |
0.9045 |
Matthews Correlation Coefficient (robust multi-class metric) |
| Eval Loss |
0.0268 |
Focal loss on validation set |
Training Progress
| Epoch |
Train Loss |
Eval Loss |
Macro F1 |
MCC |
| 1 |
0.038 |
0.0282 |
0.9085 |
0.8957 |
| 2 |
0.038 |
0.0268 |
0.9165 |
0.9045 |
Validation Test Results (86.4% Accuracy)
| Category |
Test Cases |
Correct |
Notes |
| SAFE |
5 |
4 |
One false positive (compliment → offensive) |
| GROOMING |
4 |
2 |
Boundary cases with SEXUAL/THREAT |
| SEXUAL |
3 |
3 |
Perfect detection |
| OFFENSIVE |
3 |
3 |
Perfect detection |
| THREAT |
3 |
3 |
Perfect detection |
| SELF_HARM |
2 |
2 |
Perfect detection |
| BULLYING |
2 |
2 |
Perfect detection |
Dataset
Dataset Statistics
| Split |
Samples |
Purpose |
| Train |
68,128 |
Model training |
| Test |
17,033 |
Model evaluation |
| Total |
85,161 |
Complete dataset |
Category Distribution (Full Dataset)
| Category |
Samples |
Percentage |
Description |
| SAFE |
25,488 |
29.9% |
Benign, normal communication |
| SELF_HARM |
14,234 |
16.7% |
Self-harm ideation, suicidal thoughts |
| BULLYING |
13,259 |
15.6% |
Harassment, exclusion, cyberbullying |
| THREAT |
9,193 |
10.8% |
Physical threats, violence, blackmail |
| SEXUAL |
8,642 |
10.1% |
Sexual content, body comments |
| GROOMING |
7,517 |
8.8% |
Manipulation, trust-building tactics |
| OFFENSIVE |
6,849 |
8.0% |
Profanity, slurs, offensive language |
Subcategory Breakdown
| Category |
Subcategories |
| SAFE |
greetings (1,958), farewells (1,485), wellbeing_questions (2,900), daily_conversation (2,435), weather_talk (1,445), food_drink (1,481), normal_questions (1,861), school_talk (1,961), family_talk (1,487), hobbies_games (1,455), sports_talk (1,000), tech_internet (994), genuine_compliments (1,000), encouragement (1,000), appreciation (1,000), apology_understanding (998), help_cooperation (1,000) |
| GROOMING |
secrecy (953), isolation (729), trust_manipulation (792), meeting_private (701), gift_promise (565), age_questioning (688), private_communication (628), emotional_manipulation (654), normalization (655), excessive_flattery (559), testing_boundaries (583) |
| THREAT |
physical_violence (1,307), weapon_threat (936), blackmail (1,168), family_threat (1,071), implicit_threat (906), revenge (947), death_threat (886), social_threat (930), stalking_threat (532), property_threat (500) |
| OFFENSIVE |
insults (1,286), cursing_sik (1,535), cursing_am (1,398), cursing_ana_orospu (1,383), derogatory (849), mockery (398) |
| SEXUAL |
explicit_content (1,085), sexual_body_focus (1,612), sexual_invitation (1,237), pornographic (1,060), sexual_questions (1,232), romantic_pressure (1,030), inappropriate_comments (856), sexual_fantasy (530) |
| BULLYING |
exclusion (1,904), mockery_repeated (1,690), emotional_abuse (1,678), appearance_attack (1,490), public_humiliation (1,091), intimidation (979), cyberbullying (1,138), name_calling (1,178), spreading_rumors (1,000), academic_bullying (1,111) |
| SELF_HARM |
hopelessness (1,923), giving_up (1,690), not_waking_up (1,435), suicide_ideation (1,413), self_harm_plan (1,532), burden_feeling (1,018), worthlessness (1,037), isolation_feeling (1,025), goodbye_messages (807), self_blame (894), depression_signs (1,452) |
Data Generation Methodology
- Synthetic Generation: LLM-based generation with expert-defined category boundaries
- Hard Negative Mining: Difficult edge cases for boundary discrimination
- Quality Filtering: Duplicate detection, minimum word count, forbidden token filtering
- Parallel Processing: 20 concurrent workers with batch size of 50
- Pass Rate: 97.5% average acceptance rate across all categories
Label Definitions
The model classifies text into 7 mutually non-exclusive categories:
| Label |
ID |
Description |
Turkish Examples |
SAFE |
0 |
Benign, normal communication |
"Bugün hava güzel", "Oyun oynayalım mı?" |
OFFENSIVE |
1 |
Profanity, slurs, offensive language |
"Aptal mısın", "Salak herif" |
SELF_HARM |
2 |
Self-harm ideation, suicidal thoughts |
"Ölmek istiyorum", "Kendimi kesmek istiyorum" |
GROOMING |
3 |
Manipulation, trust-building, isolation tactics |
"Kimseye söyleme", "Sen özelsin", "Evime gel" |
BULLYING |
4 |
Harassment, exclusion, cyberbullying |
"Kimse seninle oynamak istemiyor", "Çirkinsin" |
SEXUAL |
5 |
Sexual content, body comments, inappropriate questions |
"Vücudun güzel", "Hiç öpüştün mü?", "Ne giyiyorsun?" |
THREAT |
6 |
Physical threats, violence, blackmail |
"Seni döverim", "Fotoğrafını yayarım" |
Important: Category Boundaries
GROOMING vs SEXUAL Distinction:
- GROOMING: Non-sexual manipulation tactics (trust-building, secrecy, gift promises, meeting requests)
- SEXUAL: Any body-related comments, physical compliments, sexual questions, explicit content
"Kimseye söyleme tamam mı?" → GROOMING (secrecy/isolation)
"Vücudun çok güzel" → SEXUAL (body comment)
"Telefon alırım sana" → GROOMING (gift promise)
"Dudakların çok güzel" → SEXUAL (body-focused compliment)
"Gel evime yalnızım" → GROOMING (meeting request/isolation)
"Hiç öpüştün mü?" → SEXUAL (sexual experience question)
Training Procedure
Hyperparameters
| Parameter |
Value |
| Base Model |
dbmdz/bert-base-turkish-uncased |
| Max Sequence Length |
64 tokens |
| Batch Size |
16 (effective 32 with gradient accumulation) |
| Gradient Accumulation |
2 steps |
| Learning Rate |
2e-5 (with cosine restarts) |
| Epochs |
2 |
| Optimizer |
AdamW |
| Weight Decay |
0.01 |
| Warmup Ratio |
0.1 |
| Loss Function |
Focal Loss (gamma=1.2) |
| Label Smoothing |
0.05 |
| Problem Type |
Multi-label Classification |
| Evaluation Strategy |
Per epoch |
Training Environment
| Resource |
Specification |
| Hardware |
Apple M1 Pro (MPS) |
| Framework |
PyTorch 2.x + Transformers 4.37+ |
| Training Time |
~14 minutes (864 seconds) |
| Throughput |
157.8 samples/second |
| Steps |
4,258 total |
Learning Rate Schedule
Peak LR: 2e-5 (after warmup)
Schedule: Cosine with restarts
Final LR: ~1.1e-8
Usage
Installation
pip install transformers torch
Quick Start
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "hayatiali/turkish-safety"
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-uncased")
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
LABELS = ["SAFE", "OFFENSIVE", "SELF_HARM", "GROOMING", "BULLYING", "SEXUAL", "THREAT"]
def predict(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.sigmoid(outputs.logits)[0].numpy()
scores = {label: float(prob) for label, prob in zip(LABELS, probs)}
primary = max(scores, key=scores.get)
return {"category": primary, "confidence": scores[primary], "all_scores": scores}
print(predict("Vücudun çok güzel"))
print(predict("Kimseye söyleme tamam mı"))
print(predict("Ölmek istiyorum"))
print(predict("Bugün hava güzel"))
Production Class
class TurkishSafetyClassifier:
LABELS = ["SAFE", "OFFENSIVE", "SELF_HARM", "GROOMING", "BULLYING", "SEXUAL", "THREAT"]
def __init__(self, model_path="hayatiali/turkish-safety"):
self.tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-uncased")
self.model = AutoModelForSequenceClassification.from_pretrained(model_path)
self.device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
self.model.to(self.device).eval()
def predict(self, text: str) -> dict:
inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
logits = self.model(**inputs).logits
probs = torch.sigmoid(logits)[0].cpu().numpy()
scores = dict(zip(self.LABELS, probs))
primary = max(scores, key=scores.get)
return {
"category": primary,
"confidence": scores[primary],
"scores": scores,
"action": self._get_action(scores[primary], primary)
}
def _get_action(self, score: float, category: str) -> str:
if category in ["GROOMING", "SEXUAL", "SELF_HARM", "THREAT"]:
if score > 0.5: return "hard_block"
if score > 0.3: return "soft_block"
if score > 0.75: return "hard_block"
if score > 0.60: return "soft_block"
if score > 0.45: return "flag"
if score > 0.30: return "allow_log"
return "allow"
Batch Inference
def predict_batch(texts: list, batch_size: int = 32) -> list:
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
inputs = tokenizer(batch, return_tensors="pt", truncation=True, max_length=128, padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
probs = torch.sigmoid(model(**inputs).logits).cpu().numpy()
for prob in probs:
scores = dict(zip(LABELS, prob))
results.append(scores)
return results
Limitations & Known Issues
⚠️ Evaluation Limitations
Note: Two separate evaluation sets exist:
- Automated Test Set: 17,033 samples from test.csv → Macro F1: 0.9165, MCC: 0.9045
- Manual Edge Case Test: 22 hand-picked samples → 86.4% accuracy (19/22 correct)
| Limitation |
Details |
Impact |
| Small Manual Test Set |
Edge case validation on only 22 samples (86.4%) |
Manual test not statistically significant; automated metrics (17K samples) more reliable |
| No Per-Class Metrics |
Only Macro F1 and MCC reported for 17K test set |
Cannot assess individual category performance (e.g., SELF_HARM Precision/Recall vs SAFE) |
| No Confusion Matrix |
Category confusion patterns not documented |
Unclear which categories are most confused beyond GROOMING/SEXUAL boundary |
| No PR/ROC Curves |
Precision-Recall and ROC analysis not performed |
Optimal threshold selection methodology not documented |
| No Calibration Analysis |
Model confidence calibration not tested |
Unknown if 0.7 confidence truly represents 70% probability |
⚠️ Architectural Limitations
| Limitation |
Details |
Impact |
| Short Context Window |
Max sequence length: 64 tokens |
Long messages may lose critical information; truncation may remove key context |
| Single-Turn Only |
No conversation history analysis |
GROOMING patterns often emerge across multiple messages ("Kaç yaşındasın?", "Nerelisin?", "Fotoğraf atar mısın?" may each appear SAFE individually) |
| No Temporal Patterns |
No escalation detection capability |
Cannot detect behavior changes over time; user history not considered |
| Static Analysis |
Each message analyzed independently |
Contextual red flags from message sequences not captured |
⚠️ Data & Coverage Limitations
| Limitation |
Details |
Impact |
| Dialect/Slang Gaps |
Regional dialects and internet slang underrepresented |
Performance may degrade on: "napıon", "nbr", "slm", "mrb", regional variations |
| No Adversarial Testing |
Evasion techniques not systematically tested |
Unknown robustness against: "S 3 x" instead of "sex", character substitution, unicode tricks |
| Synthetic Data Bias |
97.5% of training data is LLM-generated |
May not capture real-world linguistic patterns; potential distribution shift |
| Spelling Error Tolerance |
Not explicitly tested |
Common typos and intentional misspellings may bypass detection |
⚠️ Production Deployment Considerations
| Consideration |
Details |
Recommendation |
| Threshold Selection |
Current thresholds (0.3, 0.5, 0.75) are heuristic |
Perform PR curve analysis for your specific use case; adjust based on FP/FN tolerance |
| Confidence Calibration |
Model may be over/under-confident |
Consider temperature scaling or Platt calibration before production |
| Category Boundaries |
GROOMING ↔ SEXUAL boundary is known issue |
Review flagged content in these categories; implement human review for edge cases |
| Real-Time Context |
No session-level analysis |
Consider implementing sliding window or conversation aggregation layer |
Not Suitable For
- Languages other than Turkish
- Adult content moderation (requires different domain expertise)
- Sole decision-making without human review for high-stakes situations
- Legal evidence or court proceedings
- Detection of sophisticated, multi-turn grooming attempts without additional context layer
- Highly informal/slang-heavy communications without additional preprocessing
Ethical Considerations
Intended Use
- Social media content moderation
- Messaging platform safety filters
- Gaming chat moderation
- Community forum monitoring
- Parental control applications
- Research and educational purposes
Risks
- False Negatives: May miss sophisticated grooming attempts
- False Positives: May flag benign content incorrectly
- Automation Bias: Over-reliance on model predictions
Recommendations
- Human Oversight: Always combine with human review for critical decisions
- Threshold Calibration: Adjust thresholds based on your risk tolerance
- Monitoring: Track performance metrics in production
- Regular Updates: Retrain with new data periodically
- Transparency: Inform users about automated moderation
Technical Specifications
Model Architecture
BertForSequenceClassification(
(bert): BertModel(
(embeddings): BertEmbeddings
(encoder): BertEncoder (12 layers)
(pooler): BertPooler
)
(dropout): Dropout(p=0.1)
(classifier): Linear(in_features=768, out_features=7)
)
Total Parameters: ~110M
Trainable Parameters: ~110M
Input/Output
- Input: Turkish text (max 128 tokens)
- Output: 7-dimensional probability vector (sigmoid activated)
- Tokenizer: BERTurk WordPiece (32k vocab)
Citation
@misc{turkish-safety-2025,
title={Turkish Safety - Content Moderation Classifier},
author={SiriusAI Tech Brain Team},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/hayatiali/turkish-safety}},
note={Fine-tuned from dbmdz/bert-base-turkish-uncased, Macro F1: 0.9076}
}
Model Card Authors
SiriusAI Tech Brain Team
Contact
Changelog
v5.0 (Current)
- Major dataset expansion: 85,161 samples (68,128 train / 17,033 test)
- Improved metrics: Macro F1: 0.9165, MCC: 0.9045
- Optimized hyperparameters for large dataset (Focal Loss, cosine restarts)
- 67 subcategories across 7 main categories
- 86.4% validation accuracy on edge cases
v4.0
- Initial production release
- 7-category multi-label content safety classification
- Macro F1: 0.9076, MCC: 0.8931
- Training on 30,596 samples
- Clear category boundary definitions (GROOMING vs SEXUAL)
- Optimized for real-time inference (<50ms)
License: SiriusAI Tech Premium License v1.0
Commercial Use: Requires Premium License. Contact: [email protected]
Free Use Allowed For:
- Academic research and education
- Non-profit organizations (with approval)
- Evaluation (30 days)
Disclaimer: This model is designed for content moderation and safety applications. Always implement with appropriate safeguards and human oversight. Model predictions should inform decisions, not replace human judgment.