|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: microsoft/deberta-v3-xsmall |
|
|
tags: |
|
|
- safety |
|
|
- content-moderation |
|
|
- deberta |
|
|
- text-classification |
|
|
model_name: Wizard101-L0-Bouncer |
|
|
--- |
|
|
|
|
|
# Wizard101 L0 Bouncer |
|
|
|
|
|
Fast safety classifier for the first layer of a multi-level content moderation cascade. Built on DeBERTa-v3-xsmall for speed and efficiency. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: microsoft/deberta-v3-xsmall |
|
|
- **Task**: Binary text classification (safe/harmful) |
|
|
- **Training Data**: 124K samples |
|
|
- **Size**: ~70MB |
|
|
- **Inference**: <10ms per sample |
|
|
|
|
|
## Description |
|
|
|
|
|
L0 Bouncer is the first line of defense in a safety cascade system. It quickly filters obvious safe/harmful content, passing uncertain cases to more powerful downstream models (L1 GuardReasoner, L2/L3 reasoning models). |
|
|
|
|
|
**Design Goals:** |
|
|
- Maximum speed for high-throughput filtering |
|
|
- High recall on harmful content (minimize false negatives) |
|
|
- Route uncertain cases to L1+ for deeper analysis |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
# Load model |
|
|
tokenizer = AutoTokenizer.from_pretrained("vincentoh/wizard101-l0-bouncer") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("vincentoh/wizard101-l0-bouncer") |
|
|
model.eval() |
|
|
|
|
|
# Inference |
|
|
text = "How do I make a cake?" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = torch.softmax(outputs.logits, dim=-1) |
|
|
|
|
|
# Index 0 = safe, Index 1 = harmful |
|
|
safe_prob = probs[0][0].item() |
|
|
harmful_prob = probs[0][1].item() |
|
|
|
|
|
if harmful_prob > safe_prob: |
|
|
prediction = "harmful" |
|
|
confidence = harmful_prob |
|
|
else: |
|
|
prediction = "safe" |
|
|
confidence = safe_prob |
|
|
|
|
|
print(f"Prediction: {prediction} ({confidence:.2%})") |
|
|
``` |
|
|
|
|
|
## Cascade Integration |
|
|
|
|
|
```python |
|
|
# Route to L1 if confidence < 0.9 |
|
|
needs_l1 = confidence < 0.9 |
|
|
|
|
|
if needs_l1: |
|
|
# Send to GuardReasoner-8B for detailed analysis |
|
|
pass |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
Benchmark results on safety datasets: |
|
|
|
|
|
| Dataset | Samples | Accuracy | |
|
|
|---------|---------|----------| |
|
|
| JailbreakBench | 200 | 68.0% | |
|
|
| SG-Bench | 500 | 88.8% | |
|
|
| StrongREJECT | 313 | 96.8% | |
|
|
| WildGuardMix | 500 | 96.8% | |
|
|
|
|
|
**Note**: Lower accuracy on adversarial datasets (JailbreakBench) is expected - these cases route to L1+ for deeper analysis. |
|
|
|
|
|
## Cascade Architecture |
|
|
|
|
|
``` |
|
|
User Input |
|
|
β |
|
|
βΌ |
|
|
βββββββββββ |
|
|
β L0 β βββ This model (fast filter) |
|
|
β Bouncer β |
|
|
ββββββ¬βββββ |
|
|
β (uncertain cases) |
|
|
βΌ |
|
|
βββββββββββ |
|
|
β L1 β GuardReasoner-8B |
|
|
ββββββ¬βββββ |
|
|
β |
|
|
βΌ |
|
|
βββββββββββ |
|
|
β L2/L3 β GPT-OSS reasoning models |
|
|
βββββββββββ |
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
|
|
- **Dataset**: Combined safety datasets (124K samples) |
|
|
- **Labels**: Binary (safe/harmful) |
|
|
- **Epochs**: Fine-tuned on DeBERTa-v3-xsmall |
|
|
- **Hardware**: Single GPU |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
## Citation |
|
|
|
|
|
Part of the Wizard101 Safety Cascade project for efficient multi-level content moderation. |
|
|
|