L0 Bouncer - DeBERTa Safety Classifier

A fast, lightweight safety classifier based on DeBERTa-v3-xsmall (22M parameters) that serves as the first tier (L0) in a multi-tier safety cascade system.

Model Description

The L0 Bouncer is designed for high-throughput, low-latency safety screening of text inputs. It provides binary classification (safe vs. harmful) with a focus on maximizing recall to catch potentially harmful content.

Key Features

Ultra-fast inference: ~5.7ms per input
High recall: 99% (catches nearly all harmful content)
Lightweight: Only 22M parameters
Production-ready: Designed for real-time content moderation

Performance Metrics

Metric	Value
F1 Score	93.0%
Recall	99.0%
Precision	87.6%
Accuracy	92.5%
Mean Latency	5.74ms
P99 Latency	5.86ms

Training Data

Trained on 12,000 balanced samples from the GuardReasoner dataset, which contains diverse examples of safe and harmful content with reasoning annotations.

Training Details

Base Model: microsoft/deberta-v3-xsmall
Learning Rate: 2e-5
Batch Size: 32 (effective, with gradient accumulation)
Epochs: 3
Max Sequence Length: 256 tokens
Class Weighting: 1.5x weight on harmful class for higher recall

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "vincentoh/deberta-v3-xsmall-l0-bouncer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Classify text
text = "What is the capital of France?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)

# Labels: 0 = safe, 1 = harmful
safe_prob = probs[0][0].item()
harmful_prob = probs[0][1].item()

label = "safe" if safe_prob > harmful_prob else "harmful"
confidence = max(safe_prob, harmful_prob)

print(f"Label: {label}, Confidence: {confidence:.2%}")

Cascade Architecture

This model is designed to work as the first tier (L0) in a multi-tier safety cascade:

Input → L0 Bouncer (6ms) → 70% pass through
            ↓ 30% escalate
        L1 Analyst (50ms) → Deeper reasoning
            ↓
        L2 Gauntlet (200ms) → Expert ensemble
            ↓
        L3 Judge (async) → Final review

Design Philosophy

Safety-first: Prioritizes catching harmful content (high recall) over avoiding false positives
Efficient routing: 70% of safe traffic passes at L0, saving compute
Graceful escalation: Uncertain cases are escalated to more capable models

Intended Use

Primary Use Cases

Content moderation pipelines
Safety screening for LLM inputs/outputs
First-pass filtering in multi-stage systems
Real-time safety classification

Limitations

Binary classification only (safe/harmful)
Optimized for English text
May require calibration for specific domains
Should be used with escalation to more capable models for uncertain cases

Citation

If you use this model, please cite:

@misc{l0-bouncer-2024,
  author = {Vincent Oh},
  title = {L0 Bouncer: A Fast Safety Classifier for Content Moderation},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/vincentoh/deberta-v3-xsmall-l0-bouncer}
}

License

MIT License - Free for commercial and non-commercial use.

Contact

For questions or issues, please open an issue on the model repository.

Downloads last month: 18

Safetensors

Model size

70.8M params

Tensor type

F32

Model tree for vincentoh/deberta-v3-xsmall-l0-bouncer

Base model

microsoft/deberta-v3-xsmall

Finetuned

(43)

this model