L0 Bouncer - DeBERTa Safety Classifier

A fast, lightweight safety classifier based on DeBERTa-v3-xsmall (22M parameters) that serves as the first tier (L0) in a multi-tier safety cascade system.

Model Description

The L0 Bouncer is designed for high-throughput, low-latency safety screening of text inputs. It provides binary classification (safe vs. harmful) with a focus on maximizing recall to catch potentially harmful content.

Key Features

  • Ultra-fast inference: ~5.7ms per input
  • High recall: 99% (catches nearly all harmful content)
  • Lightweight: Only 22M parameters
  • Production-ready: Designed for real-time content moderation

Performance Metrics

Metric Value
F1 Score 93.0%
Recall 99.0%
Precision 87.6%
Accuracy 92.5%
Mean Latency 5.74ms
P99 Latency 5.86ms

Training Data

Trained on 12,000 balanced samples from the GuardReasoner dataset, which contains diverse examples of safe and harmful content with reasoning annotations.

Training Details

  • Base Model: microsoft/deberta-v3-xsmall
  • Learning Rate: 2e-5
  • Batch Size: 32 (effective, with gradient accumulation)
  • Epochs: 3
  • Max Sequence Length: 256 tokens
  • Class Weighting: 1.5x weight on harmful class for higher recall

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "vincentoh/deberta-v3-xsmall-l0-bouncer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Classify text
text = "What is the capital of France?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)

# Labels: 0 = safe, 1 = harmful
safe_prob = probs[0][0].item()
harmful_prob = probs[0][1].item()

label = "safe" if safe_prob > harmful_prob else "harmful"
confidence = max(safe_prob, harmful_prob)

print(f"Label: {label}, Confidence: {confidence:.2%}")

Cascade Architecture

This model is designed to work as the first tier (L0) in a multi-tier safety cascade:

Input β†’ L0 Bouncer (6ms) β†’ 70% pass through
            ↓ 30% escalate
        L1 Analyst (50ms) β†’ Deeper reasoning
            ↓
        L2 Gauntlet (200ms) β†’ Expert ensemble
            ↓
        L3 Judge (async) β†’ Final review

Design Philosophy

  • Safety-first: Prioritizes catching harmful content (high recall) over avoiding false positives
  • Efficient routing: 70% of safe traffic passes at L0, saving compute
  • Graceful escalation: Uncertain cases are escalated to more capable models

Intended Use

Primary Use Cases

  • Content moderation pipelines
  • Safety screening for LLM inputs/outputs
  • First-pass filtering in multi-stage systems
  • Real-time safety classification

Limitations

  • Binary classification only (safe/harmful)
  • Optimized for English text
  • May require calibration for specific domains
  • Should be used with escalation to more capable models for uncertain cases

Citation

If you use this model, please cite:

@misc{l0-bouncer-2024,
  author = {Vincent Oh},
  title = {L0 Bouncer: A Fast Safety Classifier for Content Moderation},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/vincentoh/deberta-v3-xsmall-l0-bouncer}
}

License

MIT License - Free for commercial and non-commercial use.

Contact

For questions or issues, please open an issue on the model repository.

Downloads last month
18
Safetensors
Model size
70.8M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for vincentoh/deberta-v3-xsmall-l0-bouncer

Finetuned
(43)
this model