---
license: mit
language:
- id
tags:
- text-classification
- cyberbullying
- indonesian
- toxicity
- bert
- pytorch
- transformers
datasets:
- nayan90k/cyberbullying-tweets-balanced
- poleval/poleval2019_cyberbullying
- kairaamilanii/cyberbullying-indonesia
- izzulgod/indonesian-conversation
base_model:
- nlptown/bert-base-multilingual-uncased-sentiment
- indobenchmark/indobert-base-p1
---
# bert-cyberbullying-bahasa-classifier

<p align="left">
  <img src="https://img.shields.io/badge/Task-Cyberbullying%20Detection-blue">
  <img src="https://img.shields.io/badge/Language-Bahasa%20Indonesia-green">
</p>

A fine-tuned **BERT multilingual classifier** for detecting **cyberbullying** in Bahasa Indonesia. This model performs **binary classification**:

* **0 → non-bullying**
* **1 → bullying**

---

## ✅ Model Details

| Property       | Value                                               |
| -------------- | --------------------------------------------------- |
| **Model Type** | BERT (base multilingual)                            |
| **Task**       | Cyberbullying Detection (Text Classification)       |
| **Language**   | Bahasa Indonesia                                    |
| **Labels**     | `0` — non-bullying, `1` — bullying                  |
| **Framework**  | Hugging Face Transformers                           |
| **Files**      | `model.safetensors`, `config.json`, tokenizer files |

---

## 📚 Dataset

This model was trained using a **combined dataset**, consisting of:

* Indonesian cyberbullying dataset
* Additional toxic / abusive comment datasets
* Social media–style and chat–style text

**Preprocessing steps:**

* text normalization
* emoji removal
* punctuation cleanup
* lowercasing
* label encoding (0 / 1)

Dataset was balanced to reduce bias.

---

## 🧠 Training Information

* **Base model:** `bert-base-multilingual-cased`
* **Epochs:** 3–5
* **Batch size:** 16
* **Optimizer:** AdamW
* **Learning rate:** 2e-5
* **Loss:** Cross Entropy
* **Train/Validation split:** 80 / 20

Training was done on a **6GB GPU**, optimized for low VRAM.

---

## ✅ How to Use

### Python Example

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "zeltera/bert-cyberbullying-bahasa-classifier"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "anjing lu jelek banget"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    label = torch.argmax(logits, dim=1).item()

print("Prediction:", label)  # 1 = bullying
```

### Example Predictions

| Text                   | Output           |
| ---------------------- | ---------------- |
| "mampus lu biarin aja" | 1 (bullying)     |
| "kamu lagi dimana?"    | 0 (non-bullying) |
| "bodoh banget sih"     | 1 (bullying)     |
| "nice job bro"         | 0 (non-bullying) |

![image](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F67fc7a139808e92cc1fb832d%2Fg6dYwiG69gEjovyEAqhyf.png)

---

## 📈 Evaluation

![image](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F67fc7a139808e92cc1fb832d%2FNGGN8woRNsWjf3MNVB_NX.png)

| Metric     | Score |
| ---------- | ----- |
| Accuracy   | ~0.90 |
| F1 (macro) | ~0.88 |
| Precision  | ~0.89 |
| Recall     | ~0.87 |

---

## 🗂️ Repository Contents

```
config.json
model.safetensors
tokenizer.json
tokenizer_config.json
special_tokens_map.json
vocab.txt
README.md
```

---

## 🔧 Intended Use

* AI chatbots (moderation / filtering)
* Social media comment analysis
* Cyberbullying detection systems
* Student safety applications
* Research on toxicity detection

---

## ⚠️ Limitations

* Limited sarcasm detection
* May misclassify unseen slang
* Works best on Indonesian text
* Not suitable for legal or high-risk decisions

---

## 📜 License

MIT License

---

## 👤 Author

Model trained and published by **@zeltera**
Built using Hugging Face Transformers + PyTorch.
Contact instagram @gnwnadiwjy