File size: 3,148 Bytes
2802588
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
license: apache-2.0
base_model: microsoft/deberta-v3-xsmall
tags:
  - safety
  - content-moderation
  - deberta
  - text-classification
model_name: Wizard101-L0-Bouncer
---

# Wizard101 L0 Bouncer

Fast safety classifier for the first layer of a multi-level content moderation cascade. Built on DeBERTa-v3-xsmall for speed and efficiency.

## Model Details

- **Base Model**: microsoft/deberta-v3-xsmall
- **Task**: Binary text classification (safe/harmful)
- **Training Data**: 124K samples
- **Size**: ~70MB
- **Inference**: <10ms per sample

## Description

L0 Bouncer is the first line of defense in a safety cascade system. It quickly filters obvious safe/harmful content, passing uncertain cases to more powerful downstream models (L1 GuardReasoner, L2/L3 reasoning models).

**Design Goals:**
- Maximum speed for high-throughput filtering
- High recall on harmful content (minimize false negatives)
- Route uncertain cases to L1+ for deeper analysis

## Usage

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("vincentoh/wizard101-l0-bouncer")
model = AutoModelForSequenceClassification.from_pretrained("vincentoh/wizard101-l0-bouncer")
model.eval()

# Inference
text = "How do I make a cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)

    # Index 0 = safe, Index 1 = harmful
    safe_prob = probs[0][0].item()
    harmful_prob = probs[0][1].item()

    if harmful_prob > safe_prob:
        prediction = "harmful"
        confidence = harmful_prob
    else:
        prediction = "safe"
        confidence = safe_prob

print(f"Prediction: {prediction} ({confidence:.2%})")
```

## Cascade Integration

```python
# Route to L1 if confidence < 0.9
needs_l1 = confidence < 0.9

if needs_l1:
    # Send to GuardReasoner-8B for detailed analysis
    pass
```

## Performance

Benchmark results on safety datasets:

| Dataset | Samples | Accuracy |
|---------|---------|----------|
| JailbreakBench | 200 | 68.0% |
| SG-Bench | 500 | 88.8% |
| StrongREJECT | 313 | 96.8% |
| WildGuardMix | 500 | 96.8% |

**Note**: Lower accuracy on adversarial datasets (JailbreakBench) is expected - these cases route to L1+ for deeper analysis.

## Cascade Architecture

```
User Input
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L0      β”‚ ◄── This model (fast filter)
β”‚ Bouncer β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
     β”‚ (uncertain cases)
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L1      β”‚ GuardReasoner-8B
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L2/L3   β”‚ GPT-OSS reasoning models
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Training

- **Dataset**: Combined safety datasets (124K samples)
- **Labels**: Binary (safe/harmful)
- **Epochs**: Fine-tuned on DeBERTa-v3-xsmall
- **Hardware**: Single GPU

## License

Apache 2.0

## Citation

Part of the Wizard101 Safety Cascade project for efficient multi-level content moderation.