Turkish Safety - Content Moderation Classifier v5.0

Multi-label classification model for Turkish content moderation

Developed by SiriusAI Tech Brain Team


Mission

Empowering digital platforms with AI-driven content safety solutions.

Turkish Safety is an advanced NLP model that analyzes Turkish content in real-time and detects harmful content across 7 different categories. It provides comprehensive content moderation for social media platforms, messaging applications, in-game chats, and community forums.

Why This Model Matters

  • 7 Risk Categories: Detects SAFE, GROOMING, SEXUAL, OFFENSIVE, BULLYING, SELF_HARM, and THREAT
  • Turkish-First Design: Optimized for Turkish linguistics and cultural context using BERTurk
  • Production-Ready: <50ms inference, battle-tested architecture, enterprise-grade reliability
  • Multi-Label Intelligence: Smart classification that understands content can belong to multiple categories
  • Expert Validation: Curated training data with clear category boundaries and edge case handling

Model Overview

Property Value
Architecture BERT (Bidirectional Encoder Representations from Transformers)
Base Model dbmdz/bert-base-turkish-uncased (BERTurk)
Task Multi-label Text Classification
Language Turkish (tr)
Categories 7 content safety labels
Model Size 443 MB (FP32)
Inference Time ~10-15ms (GPU) / ~40-50ms (CPU)

Performance Metrics

Final Evaluation Results (Epoch 2)

Metric Score Description
Macro F1 0.9165 Harmonic mean of precision and recall across all categories
MCC 0.9045 Matthews Correlation Coefficient (robust multi-class metric)
Eval Loss 0.0268 Focal loss on validation set

Training Progress

Epoch Train Loss Eval Loss Macro F1 MCC
1 0.038 0.0282 0.9085 0.8957
2 0.038 0.0268 0.9165 0.9045

Validation Test Results (86.4% Accuracy)

Category Test Cases Correct Notes
SAFE 5 4 One false positive (compliment → offensive)
GROOMING 4 2 Boundary cases with SEXUAL/THREAT
SEXUAL 3 3 Perfect detection
OFFENSIVE 3 3 Perfect detection
THREAT 3 3 Perfect detection
SELF_HARM 2 2 Perfect detection
BULLYING 2 2 Perfect detection

Dataset

Dataset Statistics

Split Samples Purpose
Train 68,128 Model training
Test 17,033 Model evaluation
Total 85,161 Complete dataset

Category Distribution (Full Dataset)

Category Samples Percentage Description
SAFE 25,488 29.9% Benign, normal communication
SELF_HARM 14,234 16.7% Self-harm ideation, suicidal thoughts
BULLYING 13,259 15.6% Harassment, exclusion, cyberbullying
THREAT 9,193 10.8% Physical threats, violence, blackmail
SEXUAL 8,642 10.1% Sexual content, body comments
GROOMING 7,517 8.8% Manipulation, trust-building tactics
OFFENSIVE 6,849 8.0% Profanity, slurs, offensive language

Subcategory Breakdown

Category Subcategories
SAFE greetings (1,958), farewells (1,485), wellbeing_questions (2,900), daily_conversation (2,435), weather_talk (1,445), food_drink (1,481), normal_questions (1,861), school_talk (1,961), family_talk (1,487), hobbies_games (1,455), sports_talk (1,000), tech_internet (994), genuine_compliments (1,000), encouragement (1,000), appreciation (1,000), apology_understanding (998), help_cooperation (1,000)
GROOMING secrecy (953), isolation (729), trust_manipulation (792), meeting_private (701), gift_promise (565), age_questioning (688), private_communication (628), emotional_manipulation (654), normalization (655), excessive_flattery (559), testing_boundaries (583)
THREAT physical_violence (1,307), weapon_threat (936), blackmail (1,168), family_threat (1,071), implicit_threat (906), revenge (947), death_threat (886), social_threat (930), stalking_threat (532), property_threat (500)
OFFENSIVE insults (1,286), cursing_sik (1,535), cursing_am (1,398), cursing_ana_orospu (1,383), derogatory (849), mockery (398)
SEXUAL explicit_content (1,085), sexual_body_focus (1,612), sexual_invitation (1,237), pornographic (1,060), sexual_questions (1,232), romantic_pressure (1,030), inappropriate_comments (856), sexual_fantasy (530)
BULLYING exclusion (1,904), mockery_repeated (1,690), emotional_abuse (1,678), appearance_attack (1,490), public_humiliation (1,091), intimidation (979), cyberbullying (1,138), name_calling (1,178), spreading_rumors (1,000), academic_bullying (1,111)
SELF_HARM hopelessness (1,923), giving_up (1,690), not_waking_up (1,435), suicide_ideation (1,413), self_harm_plan (1,532), burden_feeling (1,018), worthlessness (1,037), isolation_feeling (1,025), goodbye_messages (807), self_blame (894), depression_signs (1,452)

Data Generation Methodology

  1. Synthetic Generation: LLM-based generation with expert-defined category boundaries
  2. Hard Negative Mining: Difficult edge cases for boundary discrimination
  3. Quality Filtering: Duplicate detection, minimum word count, forbidden token filtering
  4. Parallel Processing: 20 concurrent workers with batch size of 50
  5. Pass Rate: 97.5% average acceptance rate across all categories

Label Definitions

The model classifies text into 7 mutually non-exclusive categories:

Label ID Description Turkish Examples
SAFE 0 Benign, normal communication "Bugün hava güzel", "Oyun oynayalım mı?"
OFFENSIVE 1 Profanity, slurs, offensive language "Aptal mısın", "Salak herif"
SELF_HARM 2 Self-harm ideation, suicidal thoughts "Ölmek istiyorum", "Kendimi kesmek istiyorum"
GROOMING 3 Manipulation, trust-building, isolation tactics "Kimseye söyleme", "Sen özelsin", "Evime gel"
BULLYING 4 Harassment, exclusion, cyberbullying "Kimse seninle oynamak istemiyor", "Çirkinsin"
SEXUAL 5 Sexual content, body comments, inappropriate questions "Vücudun güzel", "Hiç öpüştün mü?", "Ne giyiyorsun?"
THREAT 6 Physical threats, violence, blackmail "Seni döverim", "Fotoğrafını yayarım"

Important: Category Boundaries

GROOMING vs SEXUAL Distinction:

  • GROOMING: Non-sexual manipulation tactics (trust-building, secrecy, gift promises, meeting requests)
  • SEXUAL: Any body-related comments, physical compliments, sexual questions, explicit content
"Kimseye söyleme tamam mı?"  → GROOMING (secrecy/isolation)
"Vücudun çok güzel"          → SEXUAL (body comment)
"Telefon alırım sana"        → GROOMING (gift promise)
"Dudakların çok güzel"       → SEXUAL (body-focused compliment)
"Gel evime yalnızım"         → GROOMING (meeting request/isolation)
"Hiç öpüştün mü?"            → SEXUAL (sexual experience question)

Training Procedure

Hyperparameters

Parameter Value
Base Model dbmdz/bert-base-turkish-uncased
Max Sequence Length 64 tokens
Batch Size 16 (effective 32 with gradient accumulation)
Gradient Accumulation 2 steps
Learning Rate 2e-5 (with cosine restarts)
Epochs 2
Optimizer AdamW
Weight Decay 0.01
Warmup Ratio 0.1
Loss Function Focal Loss (gamma=1.2)
Label Smoothing 0.05
Problem Type Multi-label Classification
Evaluation Strategy Per epoch

Training Environment

Resource Specification
Hardware Apple M1 Pro (MPS)
Framework PyTorch 2.x + Transformers 4.37+
Training Time ~14 minutes (864 seconds)
Throughput 157.8 samples/second
Steps 4,258 total

Learning Rate Schedule

Peak LR: 2e-5 (after warmup)
Schedule: Cosine with restarts
Final LR: ~1.1e-8

Usage

Installation

pip install transformers torch

Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model
model_name = "hayatiali/turkish-safety"
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-uncased")
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# Label mapping (MUST match model's id2label)
LABELS = ["SAFE", "OFFENSIVE", "SELF_HARM", "GROOMING", "BULLYING", "SEXUAL", "THREAT"]

def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

    with torch.no_grad():
        outputs = model(**inputs)
        # Multi-label: use sigmoid (NOT softmax!)
        probs = torch.sigmoid(outputs.logits)[0].numpy()

    scores = {label: float(prob) for label, prob in zip(LABELS, probs)}
    primary = max(scores, key=scores.get)

    return {"category": primary, "confidence": scores[primary], "all_scores": scores}

# Examples
print(predict("Vücudun çok güzel"))       # → SEXUAL
print(predict("Kimseye söyleme tamam mı")) # → GROOMING
print(predict("Ölmek istiyorum"))          # → SELF_HARM
print(predict("Bugün hava güzel"))         # → SAFE

Production Class

class TurkishSafetyClassifier:
    LABELS = ["SAFE", "OFFENSIVE", "SELF_HARM", "GROOMING", "BULLYING", "SEXUAL", "THREAT"]

    def __init__(self, model_path="hayatiali/turkish-safety"):
        self.tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-uncased")
        self.model = AutoModelForSequenceClassification.from_pretrained(model_path)
        self.device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
        self.model.to(self.device).eval()

    def predict(self, text: str) -> dict:
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        with torch.no_grad():
            logits = self.model(**inputs).logits
            probs = torch.sigmoid(logits)[0].cpu().numpy()

        scores = dict(zip(self.LABELS, probs))
        primary = max(scores, key=scores.get)

        return {
            "category": primary,
            "confidence": scores[primary],
            "scores": scores,
            "action": self._get_action(scores[primary], primary)
        }

    def _get_action(self, score: float, category: str) -> str:
        # Critical categories have lower thresholds
        if category in ["GROOMING", "SEXUAL", "SELF_HARM", "THREAT"]:
            if score > 0.5: return "hard_block"
            if score > 0.3: return "soft_block"

        if score > 0.75: return "hard_block"
        if score > 0.60: return "soft_block"
        if score > 0.45: return "flag"
        if score > 0.30: return "allow_log"
        return "allow"

Batch Inference

def predict_batch(texts: list, batch_size: int = 32) -> list:
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        inputs = tokenizer(batch, return_tensors="pt", truncation=True, max_length=128, padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            probs = torch.sigmoid(model(**inputs).logits).cpu().numpy()

        for prob in probs:
            scores = dict(zip(LABELS, prob))
            results.append(scores)

    return results

Limitations & Known Issues

⚠️ Evaluation Limitations

Note: Two separate evaluation sets exist:

  • Automated Test Set: 17,033 samples from test.csv → Macro F1: 0.9165, MCC: 0.9045
  • Manual Edge Case Test: 22 hand-picked samples → 86.4% accuracy (19/22 correct)
Limitation Details Impact
Small Manual Test Set Edge case validation on only 22 samples (86.4%) Manual test not statistically significant; automated metrics (17K samples) more reliable
No Per-Class Metrics Only Macro F1 and MCC reported for 17K test set Cannot assess individual category performance (e.g., SELF_HARM Precision/Recall vs SAFE)
No Confusion Matrix Category confusion patterns not documented Unclear which categories are most confused beyond GROOMING/SEXUAL boundary
No PR/ROC Curves Precision-Recall and ROC analysis not performed Optimal threshold selection methodology not documented
No Calibration Analysis Model confidence calibration not tested Unknown if 0.7 confidence truly represents 70% probability

⚠️ Architectural Limitations

Limitation Details Impact
Short Context Window Max sequence length: 64 tokens Long messages may lose critical information; truncation may remove key context
Single-Turn Only No conversation history analysis GROOMING patterns often emerge across multiple messages ("Kaç yaşındasın?", "Nerelisin?", "Fotoğraf atar mısın?" may each appear SAFE individually)
No Temporal Patterns No escalation detection capability Cannot detect behavior changes over time; user history not considered
Static Analysis Each message analyzed independently Contextual red flags from message sequences not captured

⚠️ Data & Coverage Limitations

Limitation Details Impact
Dialect/Slang Gaps Regional dialects and internet slang underrepresented Performance may degrade on: "napıon", "nbr", "slm", "mrb", regional variations
No Adversarial Testing Evasion techniques not systematically tested Unknown robustness against: "S 3 x" instead of "sex", character substitution, unicode tricks
Synthetic Data Bias 97.5% of training data is LLM-generated May not capture real-world linguistic patterns; potential distribution shift
Spelling Error Tolerance Not explicitly tested Common typos and intentional misspellings may bypass detection

⚠️ Production Deployment Considerations

Consideration Details Recommendation
Threshold Selection Current thresholds (0.3, 0.5, 0.75) are heuristic Perform PR curve analysis for your specific use case; adjust based on FP/FN tolerance
Confidence Calibration Model may be over/under-confident Consider temperature scaling or Platt calibration before production
Category Boundaries GROOMING ↔ SEXUAL boundary is known issue Review flagged content in these categories; implement human review for edge cases
Real-Time Context No session-level analysis Consider implementing sliding window or conversation aggregation layer

Not Suitable For

  • Languages other than Turkish
  • Adult content moderation (requires different domain expertise)
  • Sole decision-making without human review for high-stakes situations
  • Legal evidence or court proceedings
  • Detection of sophisticated, multi-turn grooming attempts without additional context layer
  • Highly informal/slang-heavy communications without additional preprocessing

Ethical Considerations

Intended Use

  • Social media content moderation
  • Messaging platform safety filters
  • Gaming chat moderation
  • Community forum monitoring
  • Parental control applications
  • Research and educational purposes

Risks

  • False Negatives: May miss sophisticated grooming attempts
  • False Positives: May flag benign content incorrectly
  • Automation Bias: Over-reliance on model predictions

Recommendations

  1. Human Oversight: Always combine with human review for critical decisions
  2. Threshold Calibration: Adjust thresholds based on your risk tolerance
  3. Monitoring: Track performance metrics in production
  4. Regular Updates: Retrain with new data periodically
  5. Transparency: Inform users about automated moderation

Technical Specifications

Model Architecture

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings
    (encoder): BertEncoder (12 layers)
    (pooler): BertPooler
  )
  (dropout): Dropout(p=0.1)
  (classifier): Linear(in_features=768, out_features=7)
)

Total Parameters: ~110M
Trainable Parameters: ~110M

Input/Output

  • Input: Turkish text (max 128 tokens)
  • Output: 7-dimensional probability vector (sigmoid activated)
  • Tokenizer: BERTurk WordPiece (32k vocab)

Citation

@misc{turkish-safety-2025,
  title={Turkish Safety - Content Moderation Classifier},
  author={SiriusAI Tech Brain Team},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/hayatiali/turkish-safety}},
  note={Fine-tuned from dbmdz/bert-base-turkish-uncased, Macro F1: 0.9076}
}

Model Card Authors

SiriusAI Tech Brain Team

Contact


Changelog

v5.0 (Current)

  • Major dataset expansion: 85,161 samples (68,128 train / 17,033 test)
  • Improved metrics: Macro F1: 0.9165, MCC: 0.9045
  • Optimized hyperparameters for large dataset (Focal Loss, cosine restarts)
  • 67 subcategories across 7 main categories
  • 86.4% validation accuracy on edge cases

v4.0

  • Initial production release
  • 7-category multi-label content safety classification
  • Macro F1: 0.9076, MCC: 0.8931
  • Training on 30,596 samples
  • Clear category boundary definitions (GROOMING vs SEXUAL)
  • Optimized for real-time inference (<50ms)

License: SiriusAI Tech Premium License v1.0

Commercial Use: Requires Premium License. Contact: [email protected]

Free Use Allowed For:

  • Academic research and education
  • Non-profit organizations (with approval)
  • Evaluation (30 days)

Disclaimer: This model is designed for content moderation and safety applications. Always implement with appropriate safeguards and human oversight. Model predictions should inform decisions, not replace human judgment.

Downloads last month
133
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hayatiali/turkish-safety

Finetuned
(31)
this model

Evaluation results