L0 Bouncer - DeBERTa Safety Classifier
A fast, lightweight safety classifier based on DeBERTa-v3-xsmall (22M parameters) that serves as the first tier (L0) in a multi-tier safety cascade system.
Model Description
The L0 Bouncer is designed for high-throughput, low-latency safety screening of text inputs. It provides binary classification (safe vs. harmful) with a focus on maximizing recall to catch potentially harmful content.
Key Features
- Ultra-fast inference: ~5.7ms per input
- High recall: 99% (catches nearly all harmful content)
- Lightweight: Only 22M parameters
- Production-ready: Designed for real-time content moderation
Performance Metrics
| Metric | Value |
|---|---|
| F1 Score | 93.0% |
| Recall | 99.0% |
| Precision | 87.6% |
| Accuracy | 92.5% |
| Mean Latency | 5.74ms |
| P99 Latency | 5.86ms |
Training Data
Trained on 12,000 balanced samples from the GuardReasoner dataset, which contains diverse examples of safe and harmful content with reasoning annotations.
Training Details
- Base Model: microsoft/deberta-v3-xsmall
- Learning Rate: 2e-5
- Batch Size: 32 (effective, with gradient accumulation)
- Epochs: 3
- Max Sequence Length: 256 tokens
- Class Weighting: 1.5x weight on harmful class for higher recall
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "vincentoh/deberta-v3-xsmall-l0-bouncer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Classify text
text = "What is the capital of France?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
# Labels: 0 = safe, 1 = harmful
safe_prob = probs[0][0].item()
harmful_prob = probs[0][1].item()
label = "safe" if safe_prob > harmful_prob else "harmful"
confidence = max(safe_prob, harmful_prob)
print(f"Label: {label}, Confidence: {confidence:.2%}")
Cascade Architecture
This model is designed to work as the first tier (L0) in a multi-tier safety cascade:
Input β L0 Bouncer (6ms) β 70% pass through
β 30% escalate
L1 Analyst (50ms) β Deeper reasoning
β
L2 Gauntlet (200ms) β Expert ensemble
β
L3 Judge (async) β Final review
Design Philosophy
- Safety-first: Prioritizes catching harmful content (high recall) over avoiding false positives
- Efficient routing: 70% of safe traffic passes at L0, saving compute
- Graceful escalation: Uncertain cases are escalated to more capable models
Intended Use
Primary Use Cases
- Content moderation pipelines
- Safety screening for LLM inputs/outputs
- First-pass filtering in multi-stage systems
- Real-time safety classification
Limitations
- Binary classification only (safe/harmful)
- Optimized for English text
- May require calibration for specific domains
- Should be used with escalation to more capable models for uncertain cases
Citation
If you use this model, please cite:
@misc{l0-bouncer-2024,
author = {Vincent Oh},
title = {L0 Bouncer: A Fast Safety Classifier for Content Moderation},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/vincentoh/deberta-v3-xsmall-l0-bouncer}
}
License
MIT License - Free for commercial and non-commercial use.
Contact
For questions or issues, please open an issue on the model repository.
- Downloads last month
- 18
Model tree for vincentoh/deberta-v3-xsmall-l0-bouncer
Base model
microsoft/deberta-v3-xsmall