File size: 3,148 Bytes
---
license: apache-2.0
base_model: microsoft/deberta-v3-xsmall
tags:
  - safety
  - content-moderation
  - deberta
  - text-classification
model_name: Wizard101-L0-Bouncer
---

# Wizard101 L0 Bouncer

Fast safety classifier for the first layer of a multi-level content moderation cascade. Built on DeBERTa-v3-xsmall for speed and efficiency.

## Model Details

- **Base Model**: microsoft/deberta-v3-xsmall
- **Task**: Binary text classification (safe/harmful)
- **Training Data**: 124K samples
- **Size**: ~70MB
- **Inference**: <10ms per sample

## Description

L0 Bouncer is the first line of defense in a safety cascade system. It quickly filters obvious safe/harmful content, passing uncertain cases to more powerful downstream models (L1 GuardReasoner, L2/L3 reasoning models).

**Design Goals:**
- Maximum speed for high-throughput filtering
- High recall on harmful content (minimize false negatives)
- Route uncertain cases to L1+ for deeper analysis

## Usage

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("vincentoh/wizard101-l0-bouncer")
model = AutoModelForSequenceClassification.from_pretrained("vincentoh/wizard101-l0-bouncer")
model.eval()

# Inference
text = "How do I make a cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)

    # Index 0 = safe, Index 1 = harmful
    safe_prob = probs[0][0].item()
    harmful_prob = probs[0][1].item()

    if harmful_prob > safe_prob:
        prediction = "harmful"
        confidence = harmful_prob
    else:
        prediction = "safe"
        confidence = safe_prob

print(f"Prediction: {prediction} ({confidence:.2%})")
```

## Cascade Integration

```python
# Route to L1 if confidence < 0.9
needs_l1 = confidence < 0.9

if needs_l1:
    # Send to GuardReasoner-8B for detailed analysis
    pass
```

## Performance

Benchmark results on safety datasets:

| Dataset | Samples | Accuracy |
|---------|---------|----------|
| JailbreakBench | 200 | 68.0% |
| SG-Bench | 500 | 88.8% |
| StrongREJECT | 313 | 96.8% |
| WildGuardMix | 500 | 96.8% |

**Note**: Lower accuracy on adversarial datasets (JailbreakBench) is expected - these cases route to L1+ for deeper analysis.

## Cascade Architecture

```
User Input
    │
    ▼
┌─────────┐
│ L0      │ ◄── This model (fast filter)
│ Bouncer │
└────┬────┘
     │ (uncertain cases)
     ▼
┌─────────┐
│ L1      │ GuardReasoner-8B
└────┬────┘
     │
     ▼
┌─────────┐
│ L2/L3   │ GPT-OSS reasoning models
└─────────┘
```

## Training

- **Dataset**: Combined safety datasets (124K samples)
- **Labels**: Binary (safe/harmful)
- **Epochs**: Fine-tuned on DeBERTa-v3-xsmall
- **Hardware**: Single GPU

## License

Apache 2.0

## Citation

Part of the Wizard101 Safety Cascade project for efficient multi-level content moderation.