--- license: apache-2.0 base_model: microsoft/deberta-v3-xsmall tags: - safety - content-moderation - deberta - text-classification model_name: Wizard101-L0-Bouncer --- # Wizard101 L0 Bouncer Fast safety classifier for the first layer of a multi-level content moderation cascade. Built on DeBERTa-v3-xsmall for speed and efficiency. ## Model Details - **Base Model**: microsoft/deberta-v3-xsmall - **Task**: Binary text classification (safe/harmful) - **Training Data**: 124K samples - **Size**: ~70MB - **Inference**: <10ms per sample ## Description L0 Bouncer is the first line of defense in a safety cascade system. It quickly filters obvious safe/harmful content, passing uncertain cases to more powerful downstream models (L1 GuardReasoner, L2/L3 reasoning models). **Design Goals:** - Maximum speed for high-throughput filtering - High recall on harmful content (minimize false negatives) - Route uncertain cases to L1+ for deeper analysis ## Usage ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch # Load model tokenizer = AutoTokenizer.from_pretrained("vincentoh/wizard101-l0-bouncer") model = AutoModelForSequenceClassification.from_pretrained("vincentoh/wizard101-l0-bouncer") model.eval() # Inference text = "How do I make a cake?" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=-1) # Index 0 = safe, Index 1 = harmful safe_prob = probs[0][0].item() harmful_prob = probs[0][1].item() if harmful_prob > safe_prob: prediction = "harmful" confidence = harmful_prob else: prediction = "safe" confidence = safe_prob print(f"Prediction: {prediction} ({confidence:.2%})") ``` ## Cascade Integration ```python # Route to L1 if confidence < 0.9 needs_l1 = confidence < 0.9 if needs_l1: # Send to GuardReasoner-8B for detailed analysis pass ``` ## Performance Benchmark results on safety datasets: | Dataset | Samples | Accuracy | |---------|---------|----------| | JailbreakBench | 200 | 68.0% | | SG-Bench | 500 | 88.8% | | StrongREJECT | 313 | 96.8% | | WildGuardMix | 500 | 96.8% | **Note**: Lower accuracy on adversarial datasets (JailbreakBench) is expected - these cases route to L1+ for deeper analysis. ## Cascade Architecture ``` User Input │ ▼ ┌─────────┐ │ L0 │ ◄── This model (fast filter) │ Bouncer │ └────┬────┘ │ (uncertain cases) ▼ ┌─────────┐ │ L1 │ GuardReasoner-8B └────┬────┘ │ ▼ ┌─────────┐ │ L2/L3 │ GPT-OSS reasoning models └─────────┘ ``` ## Training - **Dataset**: Combined safety datasets (124K samples) - **Labels**: Binary (safe/harmful) - **Epochs**: Fine-tuned on DeBERTa-v3-xsmall - **Hardware**: Single GPU ## License Apache 2.0 ## Citation Part of the Wizard101 Safety Cascade project for efficient multi-level content moderation.