---
license: mit
language:
- id
tags:
- text-classification
- cyberbullying
- indonesian
- toxicity
- bert
- pytorch
- transformers
datasets:
- nayan90k/cyberbullying-tweets-balanced
- poleval/poleval2019_cyberbullying
- kairaamilanii/cyberbullying-indonesia
- izzulgod/indonesian-conversation
base_model:
- nlptown/bert-base-multilingual-uncased-sentiment
- indobenchmark/indobert-base-p1
---
# bert-cyberbullying-bahasa-classifier
A fine-tuned **BERT multilingual classifier** for detecting **cyberbullying** in Bahasa Indonesia. This model performs **binary classification**:
* **0 → non-bullying**
* **1 → bullying**
---
## ✅ Model Details
| Property | Value |
| -------------- | --------------------------------------------------- |
| **Model Type** | BERT (base multilingual) |
| **Task** | Cyberbullying Detection (Text Classification) |
| **Language** | Bahasa Indonesia |
| **Labels** | `0` — non-bullying, `1` — bullying |
| **Framework** | Hugging Face Transformers |
| **Files** | `model.safetensors`, `config.json`, tokenizer files |
---
## 📚 Dataset
This model was trained using a **combined dataset**, consisting of:
* Indonesian cyberbullying dataset
* Additional toxic / abusive comment datasets
* Social media–style and chat–style text
**Preprocessing steps:**
* text normalization
* emoji removal
* punctuation cleanup
* lowercasing
* label encoding (0 / 1)
Dataset was balanced to reduce bias.
---
## 🧠 Training Information
* **Base model:** `bert-base-multilingual-cased`
* **Epochs:** 3–5
* **Batch size:** 16
* **Optimizer:** AdamW
* **Learning rate:** 2e-5
* **Loss:** Cross Entropy
* **Train/Validation split:** 80 / 20
Training was done on a **6GB GPU**, optimized for low VRAM.
---
## ✅ How to Use
### Python Example
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "zeltera/bert-cyberbullying-bahasa-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "anjing lu jelek banget"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
label = torch.argmax(logits, dim=1).item()
print("Prediction:", label) # 1 = bullying
```
### Example Predictions
| Text | Output |
| ---------------------- | ---------------- |
| "mampus lu biarin aja" | 1 (bullying) |
| "kamu lagi dimana?" | 0 (non-bullying) |
| "bodoh banget sih" | 1 (bullying) |
| "nice job bro" | 0 (non-bullying) |

---
## 📈 Evaluation

| Metric | Score |
| ---------- | ----- |
| Accuracy | ~0.90 |
| F1 (macro) | ~0.88 |
| Precision | ~0.89 |
| Recall | ~0.87 |
---
## 🗂️ Repository Contents
```
config.json
model.safetensors
tokenizer.json
tokenizer_config.json
special_tokens_map.json
vocab.txt
README.md
```
---
## 🔧 Intended Use
* AI chatbots (moderation / filtering)
* Social media comment analysis
* Cyberbullying detection systems
* Student safety applications
* Research on toxicity detection
---
## ⚠️ Limitations
* Limited sarcasm detection
* May misclassify unseen slang
* Works best on Indonesian text
* Not suitable for legal or high-risk decisions
---
## 📜 License
MIT License
---
## 👤 Author
Model trained and published by **@zeltera**
Built using Hugging Face Transformers + PyTorch.
Contact instagram @gnwnadiwjy