vincentoh
/

wizard101-l0-bouncer

Text Classification

content-moderation

Model card Files Files and versions

wizard101-l0-bouncer / README.md

vincentoh's picture

Add model card

2802588 verified 25 days ago

|

history blame contribute delete

3.15 kB

	---
	license: apache-2.0
	base_model: microsoft/deberta-v3-xsmall
	tags:
	- safety
	- content-moderation
	- deberta
	- text-classification
	model_name: Wizard101-L0-Bouncer
	---

	# Wizard101 L0 Bouncer

	Fast safety classifier for the first layer of a multi-level content moderation cascade. Built on DeBERTa-v3-xsmall for speed and efficiency.

	## Model Details

	- Base Model: microsoft/deberta-v3-xsmall
	- Task: Binary text classification (safe/harmful)
	- Training Data: 124K samples
	- Size: ~70MB
	- Inference: <10ms per sample

	## Description

	L0 Bouncer is the first line of defense in a safety cascade system. It quickly filters obvious safe/harmful content, passing uncertain cases to more powerful downstream models (L1 GuardReasoner, L2/L3 reasoning models).

	Design Goals:
	- Maximum speed for high-throughput filtering
	- High recall on harmful content (minimize false negatives)
	- Route uncertain cases to L1+ for deeper analysis

	## Usage

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	import torch

	# Load model
	tokenizer = AutoTokenizer.from_pretrained("vincentoh/wizard101-l0-bouncer")
	model = AutoModelForSequenceClassification.from_pretrained("vincentoh/wizard101-l0-bouncer")
	model.eval()

	# Inference
	text = "How do I make a cake?"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.softmax(outputs.logits, dim=-1)

	# Index 0 = safe, Index 1 = harmful
	safe_prob = probs[0][0].item()
	harmful_prob = probs[0][1].item()

	if harmful_prob > safe_prob:
	prediction = "harmful"
	confidence = harmful_prob
	else:
	prediction = "safe"
	confidence = safe_prob

	print(f"Prediction: {prediction} ({confidence:.2%})")
	```

	## Cascade Integration

	```python
	# Route to L1 if confidence < 0.9
	needs_l1 = confidence < 0.9

	if needs_l1:
	# Send to GuardReasoner-8B for detailed analysis
	pass
	```

	## Performance

	Benchmark results on safety datasets:

	\| Dataset \| Samples \| Accuracy \|
	\|---------\|---------\|----------\|
	\| JailbreakBench \| 200 \| 68.0% \|
	\| SG-Bench \| 500 \| 88.8% \|
	\| StrongREJECT \| 313 \| 96.8% \|
	\| WildGuardMix \| 500 \| 96.8% \|

	Note: Lower accuracy on adversarial datasets (JailbreakBench) is expected - these cases route to L1+ for deeper analysis.

	## Cascade Architecture

	```
	User Input
	│
	▼
	┌─────────┐
	│ L0 │ ◄── This model (fast filter)
	│ Bouncer │
	└────┬────┘
	│ (uncertain cases)
	▼
	┌─────────┐
	│ L1 │ GuardReasoner-8B
	└────┬────┘
	│
	▼
	┌─────────┐
	│ L2/L3 │ GPT-OSS reasoning models
	└─────────┘
	```

	## Training

	- Dataset: Combined safety datasets (124K samples)
	- Labels: Binary (safe/harmful)
	- Epochs: Fine-tuned on DeBERTa-v3-xsmall
	- Hardware: Single GPU

	## License

	Apache 2.0

	## Citation

	Part of the Wizard101 Safety Cascade project for efficient multi-level content moderation.