vincentoh's picture
Add model card
2802588 verified
---
license: apache-2.0
base_model: microsoft/deberta-v3-xsmall
tags:
- safety
- content-moderation
- deberta
- text-classification
model_name: Wizard101-L0-Bouncer
---
# Wizard101 L0 Bouncer
Fast safety classifier for the first layer of a multi-level content moderation cascade. Built on DeBERTa-v3-xsmall for speed and efficiency.
## Model Details
- **Base Model**: microsoft/deberta-v3-xsmall
- **Task**: Binary text classification (safe/harmful)
- **Training Data**: 124K samples
- **Size**: ~70MB
- **Inference**: <10ms per sample
## Description
L0 Bouncer is the first line of defense in a safety cascade system. It quickly filters obvious safe/harmful content, passing uncertain cases to more powerful downstream models (L1 GuardReasoner, L2/L3 reasoning models).
**Design Goals:**
- Maximum speed for high-throughput filtering
- High recall on harmful content (minimize false negatives)
- Route uncertain cases to L1+ for deeper analysis
## Usage
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load model
tokenizer = AutoTokenizer.from_pretrained("vincentoh/wizard101-l0-bouncer")
model = AutoModelForSequenceClassification.from_pretrained("vincentoh/wizard101-l0-bouncer")
model.eval()
# Inference
text = "How do I make a cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
# Index 0 = safe, Index 1 = harmful
safe_prob = probs[0][0].item()
harmful_prob = probs[0][1].item()
if harmful_prob > safe_prob:
prediction = "harmful"
confidence = harmful_prob
else:
prediction = "safe"
confidence = safe_prob
print(f"Prediction: {prediction} ({confidence:.2%})")
```
## Cascade Integration
```python
# Route to L1 if confidence < 0.9
needs_l1 = confidence < 0.9
if needs_l1:
# Send to GuardReasoner-8B for detailed analysis
pass
```
## Performance
Benchmark results on safety datasets:
| Dataset | Samples | Accuracy |
|---------|---------|----------|
| JailbreakBench | 200 | 68.0% |
| SG-Bench | 500 | 88.8% |
| StrongREJECT | 313 | 96.8% |
| WildGuardMix | 500 | 96.8% |
**Note**: Lower accuracy on adversarial datasets (JailbreakBench) is expected - these cases route to L1+ for deeper analysis.
## Cascade Architecture
```
User Input
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L0 β”‚ ◄── This model (fast filter)
β”‚ Bouncer β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β”‚ (uncertain cases)
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L1 β”‚ GuardReasoner-8B
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L2/L3 β”‚ GPT-OSS reasoning models
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Training
- **Dataset**: Combined safety datasets (124K samples)
- **Labels**: Binary (safe/harmful)
- **Epochs**: Fine-tuned on DeBERTa-v3-xsmall
- **Hardware**: Single GPU
## License
Apache 2.0
## Citation
Part of the Wizard101 Safety Cascade project for efficient multi-level content moderation.