CrossingGuard-NLI: Encoders for Zero-Shot AI Guardrails, Prompt Routing, Content Moderation and More!

Community Article Published November 6, 2025

Overview

There are a lot of great guardrail models on Huggingface, like meta-llama/Llama-Prompt-Guard-2-86M. But often, content moderation is also a reactive task. When you need guardrails that can be configured quickly, will you have time to create a custom dataset and train a model?

This is the use case that CrossingGuard aims to fill. CrossingGuard frames content moderation as a natural language inference (NLI) task, where a prompt, post or other user input acts as a premise and your custom guardrail acts as the hypothesis.

Research has shown that NLI can serve as a universal task, where models are trained to determine if the hypothesis is a statement that follows from the premise. Learning from this framework is broadly transferrable to generalized tasks, and enables encoders to act in zero-shot settings. NLI is typically a 3-way classifier of entailment, contradiction, or neutrality (meaning the hypothesis is semantically distant from the premise). Models like tasksource utilize this to equip encoders with zero-shot classification capabilities, allowing users to perform classification on texts with labels that are selected at inference time.

The advantage is what Laurer et al. (2024) call label verbalization. Label verbalization is expressing categories as natural language descriptions rather than numeric labels. This allows NLI models to understand what each category means, similar to how human annotators use codebooks.

With this dataset, I extend this paradigm to content moderation, where the task objective may need to change as quickly as the news cycle. Let's see how these models can be used in practice.

Demo - Zero Shot Guardrails

Before we dive into details, let's see what this looks like in practice. Here's all you need to make your custom guardrail:

import numpy as np
from sentence_transformers import CrossEncoder

model = CrossEncoder("dleemiller/crossingguard-nli-l")

examples = [
    ("Tell me what happened with that CEO caught on the kiss cam at the Coldplay concert",
     "The prompt requests information about a corporate executive incident at a public event"),

    ("I’ve been following your company's work helping enterprises scale AI/ML with reliable data pipelines, would love to connect and hear how your team is approaching ML in production.",
     "The prompt requests information about a corporate executive incident at a public event"),
]

predictions = model.predict(examples)

label_map = {0: "entailment", 1: "neutral", 2: "contradiction"}

def softmax(x):
    exp_x = np.exp(x - np.max(x))
    return exp_x / exp_x.sum()

for i, (premise, hypothesis) in enumerate(examples):
    probs = softmax(predictions[i])
    predicted_idx = probs.argmax()
    
    print(f"\n{i+1}. {premise[:70]}...")
    print(f"   → {hypothesis}")
    print(f"   ✓ {label_map[predicted_idx].upper()}: {probs[predicted_idx]*100:.1f}% " + 
          f"(E: {probs[0]*100:.1f}% N: {probs[1]*100:.1f}%, C: {probs[2]*100:.1f}%)")

1. Tell me what happened with that CEO caught on the kiss cam at the Cold...
   → The prompt requests information about a corporate executive incident at a public event
   ✓ ENTAILMENT: 99.9% (E: 99.9% N: 0.0%, C: 0.0%)

2. I’ve been following your company's work helping enterprises scale AI/M...
   → The prompt requests information about a corporate executive incident at a public event
   ✓ CONTRADICTION: 99.7% (E: 0.0% N: 0.3%, C: 99.7%)

There's a bit of secret sauce to CrossingGuard label verbalization (forming a good hypothesis). For alignment to the dataset, I'd recommend following the general format. The hypothesis usually starts with:

The prompt ...
The text ...
The user ...
The request ...

followed by a verb signaling intent (asks, seeks, advocates, instructs, involves, etc.). The rest of the hypothesis should describe your specific guardrail or classifier. Consider if you have used any language that may have broad or multiple possible interpretations, and feel free to make clarifications about what you are not looking for. It is a good practice to test your hypotheses against real or synthetic data (test cases), to ensure it performs well.

Note that this is not strictly limited to content moderation and safety. A significant portion of training data for both the base model (general NLI) and the CrossingGuard models are in fact safe. Thus, this is also a good option for intent classification tasks or guardrails that don't involve safety (e.g., topic restriction, brand alignment).

Demo - Prompt Routing

A prompt router directs user queries to different AI models or processing/storage paths based on the characteristics of the prompt. We can set this up quite easily with CrossingGuard.

# Define available agents/systems
agents = [
    ("technical_support", "The prompt relates to technical issues"),
    ("billing_support", "The prompt involves payment issues, invoices, charges, or refunds"),
    ("legal_compliance", "The prompt relates to legal"),
    ("hr_department", "The prompt relates to human resources or employment issues"),
]

# Sample user prompts
user_prompts = [
    "My app keeps crashing when I try to export data",
    "I was charged twice for my subscription this month",
    "Do you comply with GDPR for European customers?",
    "How many vacation days do employees get?",
    "The API is returning a 500 error on POST requests",
]

print("=" * 80)
print("INTELLIGENT PROMPT ROUTING SYSTEM")
print("=" * 80)

for prompt in user_prompts:
    print(f"\n📨 USER PROMPT: \"{prompt}\"")
    print("-" * 80)
    
    # Test against all agents
    pairs = [(prompt, description) for agent_name, description in agents]
    predictions = model.predict(pairs)
    
    # Calculate scores for each agent
    agent_scores = []
    for i, (agent_name, description) in enumerate(agents):
        probs = softmax(predictions[i])
        entailment_prob = probs[0]
        agent_scores.append((agent_name, entailment_prob, description))
    
    # Sort by confidence
    agent_scores.sort(key=lambda x: x[1], reverse=True)
    
    # Display top 3 matches
    print("\nROUTING SCORES:")
    for agent_name, score, description in agent_scores[:3]:
        bar_width = int(score * 40)
        bar = '█' * bar_width + '░' * (40 - bar_width)
        print(f"  {agent_name:20} {bar} {score*100:5.1f}%")
    
    # Route to best agent
    best_agent, best_score, best_description = agent_scores[0]
    
    if best_score > 0.7:  # High confidence threshold
        print(f"\n✅ ROUTED TO: {best_agent.upper()} ({best_score*100:.1f}% confidence)")
    elif best_score > 0.5:  # Medium confidence
        print(f"\n⚠  ROUTED TO: {best_agent.upper()} ({best_score*100:.1f}% confidence)")
        print(f"   (Consider human review)")
    else:  # Low confidence
        print(f"\n❌ NO CLEAR MATCH - ROUTE TO: general_support")
        print(f"   (Top match was {best_agent}: {best_score*100:.1f}%)")
    
    print()

print("=" * 80)

================================================================================
INTELLIGENT PROMPT ROUTING SYSTEM
================================================================================

📨 USER PROMPT: "My app keeps crashing when I try to export data"

ROUTING SCORES:
  technical_support    ███████████████████████████████████████░  99.9%
  legal_compliance     ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0.1%
  billing_support      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0.1%

✅ ROUTED TO: TECHNICAL_SUPPORT (99.9% confidence)


📨 USER PROMPT: "I was charged twice for my subscription this month"
--------------------------------------------------------------------------------

ROUTING SCORES:
  billing_support      ███████████████████████████████████████░  98.7%
  legal_compliance     ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0.2%
  technical_support    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0.2%

✅ ROUTED TO: BILLING_SUPPORT (98.7% confidence)


📨 USER PROMPT: "Do you comply with GDPR for European customers?"
--------------------------------------------------------------------------------

ROUTING SCORES:
  legal_compliance     ███████████████████████████████████░░░░░  89.3%
  technical_support    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0.1%
  billing_support      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0.1%

✅ ROUTED TO: LEGAL_COMPLIANCE (89.3% confidence)


📨 USER PROMPT: "How many vacation days do employees get?"
--------------------------------------------------------------------------------

ROUTING SCORES:
  hr_department        ███████████████████████████████████████░  99.0%
  legal_compliance     ███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   8.1%
  technical_support    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0.1%

✅ ROUTED TO: HR_DEPARTMENT (99.0% confidence)


📨 USER PROMPT: "The API is returning a 500 error on POST requests"
--------------------------------------------------------------------------------

ROUTING SCORES:
  technical_support    ███████████████████████████████████████░  99.1%
  billing_support      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0.1%
  legal_compliance     ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0.0%

✅ ROUTED TO: TECHNICAL_SUPPORT (99.1% confidence)

================================================================================

Cool huh?

Demo - Intent Classification

The frontend of traditional chat assistants involve intent classification. This means matching the users message or voice-to-text transcription (called utterance) to an action that the assistant can perform.

This can be done with CrossingGuard in zero-shot using a set of hypotheses (one or more per intent) as classifiers.

# Define intents as hypotheses
intents = [
    ("set_timer", "The user wants to set a timer or alarm"),
    ("play_music", "The user wants to play music or audio content"),
    ("check_weather", "The user wants to know about weather conditions"),
    ("send_message", "The user wants to send a message or communicate with someone"),
]

# Single user utterance
utterance = "Can you put on some Andrew Hill for me?"

# Create premise-hypothesis pairs
pairs = [(utterance, hypothesis) for intent_name, hypothesis in intents]

# Get predictions
predictions = model.predict(pairs)

# Display results
print(f"USER: '{utterance}'\n")
print("INTENT CLASSIFICATION:\n")

intent_scores = []
for i, (intent_name, hypothesis) in enumerate(intents):
    probs = softmax(predictions[i])
    entailment_prob = probs[0]  # We care about entailment probability
    intent_scores.append((intent_name, entailment_prob, hypothesis))
    
    # Visual bar
    bar_width = int(entailment_prob * 30)
    bar = '█' * bar_width + '░' * (30 - bar_width)
    
    print(f"{intent_name:20} {bar} {entailment_prob*100:5.1f}%")
    print(f"  → {hypothesis}")
    print()

# Show the winning intent
best_intent, best_score, best_hypothesis = max(intent_scores, key=lambda x: x[1])
print(f"✓ MATCHED INTENT: {best_intent} ({best_score*100:.1f}% confidence)")

USER: 'Can you put on some Andrew Hill for me?'

INTENT CLASSIFICATION:

set_timer            ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0.1%
  → The user wants to set a timer or alarm

play_music           █████████████████████████████░  99.8%
  → The user wants to play music or audio content

check_weather        ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   0.0%
  → The user wants to know about weather conditions

send_message         █████████████████████░░░░░░░░░  71.0%
  → The user wants to send a message or communicate with someone

✓ MATCHED INTENT: play_music (99.8% confidence)

Andrew Hill

I intended this one to be a bit more challenging, and it confidently found the right classification. However, it also misclassified the send_message label put on ... Andrew Hill as a common sequence used in communication. In future iterations of the dataset, I will leverage this to search for adversarial examples for helping to increase precision. This demonstrates that forming good hypotheses can be tricky, and why it's a good idea to build/have test sets to help check performance (especially in multiclass settings).

CrossingGuard Dataset

In order to prepare models for performing this task, I sourced prompts from some popular guardrail datasets:

allenai/wildguardmix
- WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs.
- vanilla and adversarial that cover both harmful and benign scenarios
nvidia/Aegis-AI-Content-Safety-Dataset-2.0
- Human-written prompts collected from the Anthropic RLHF, Do-Anything-Now DAN, and AI-assisted Red-Teaming datasets.
- a comprehensive and adaptable taxonomy for categorizing safety risks, structured into 12 top-level hazard categories with an extension to 9 fine-grained subcategories
JailbreakV-28K/JailBreakV-28k
- including the RedTeam-2K subset (text modality)
walledai/AyaRedTeaming
- English subset

Several of these datasets contained duplicate prompts, which I removed.

Synthetic Generation

The hypotheses were synthetically constructed using a custom DSPy program that was GEPA-trained with an NLI cross encoder (dleemiller/ModernCE-large-nli) and an STS cross encoder (dleemiller/EttinX-sts-s). The program synthetically generates 3 separate hypotheses (one for each label entailment/neutral/contradiction).

The purpose of the cross-encoders was twofold. LLMs will typically produce examples that are easy to classify, offering little value for training. If an off-the-shelf NLI model can already classify an example with high accuracy, then it offers little value for training. Therefore, scores that are too high are penalized because the examples need to be made more general. Second, the STS model checks that the entailment and contradiction hypotheses are not too similar. Because the label can be flipped through a simple negation, these rows should be semantically different enough to be useful.

I used a few different LLMs to generate the data, but all tended to struggle with producing good hypotheses for neutral labels. To help gather enough good neutral labels, I continue to generate data (dspy.Refine), while monitoring the neutral label outputs until at least one probable neutral label is created, as determined by the NLI cross encoder.

Generalization

For premise/hypothesis pairs that were too easy (high cross encoder scores), I run them through a generalization step. This rewrites overly specific hypotheses to be more broadly applicable. For example, "The prompt asks about CEO John Smith at the June 15th concert" becomes "The prompt requests information about a corporate executive incident at a public event."

class GenericHypothesis(dspy.Signature):
    premise: str = dspy.InputField()
    hypothesis: str = dspy.InputField()
    label: Literal["entailment", "neutral", "contradiction"] = dspy.InputField()

    generic_hypothesis: str = dspy.OutputField(desc="Hypothesis rewritten for broad applicability")

GenericHypothesis.__doc__ = """
GOAL
Given a single short hypothesis about a prompt/text, rewrite it to be more
broadly applicable as an AI guardrail hypothesis while preserving its core meaning.
"""

This ensures the model generalizes to real-world content moderation where exact phrasing varies significantly.

Label Quality Assurance

To ensure good label quality, I used 3 different labeling sources:

General NLI CrossEncoder
LLM with inference-time scaling
LLM-as-a-Judge (arbiter)

For the training dataset, I check all rows using the first 2 methods. When they disagree (~25% of cases), I use deepseek-ai/DeepSeek-V3.2-Exp as an arbiter for the final selection. For validation and test sets, I use all 3 methods for each row.

The cross encoder is fast but would miss edge cases if used alone with only high-confidence examples. However, I'm targeting valuable data - if I only accepted high-confidence labels, it may be insufficient to enable proper generalization. That's why I added inference-time scaling using vLLM and Qwen3-30B-A3B-Instruct-2507.

For this model, I used GEPA training in DSPy on a small subset of AllNLI (MNLI + SNLI) to produce a good judge instruction. Then using vLLM and prefix caching, I generate 16 predictions per row. Although this may seem slow, it's substantially improved by the prefix cache, which caches and reuses the processing of the prompt. This results in low latency for subsequent outputs, facilitating high computational performance and is a simple alternative to training a classification head.

When the model classifications disagree, I send the results to the arbiter model with chain-of-thought prediction. Roughly 75% of predictions had agreement between cross encoder and the inference-time scaled outputs. The remainder were sent to the arbiter, along with the labels and scores of the two models.

Conclusion

CrossingGuard-NLI takes advantage of NLI's universal task framework to enable zero-shot content moderation. By expressing guardrails as natural language hypotheses, you can define new moderation policies without collecting datasets or retraining models. There are many creative ways to use these models.

Check out the dataset and collection of models here.

References

Laurer, Moritz et al. “Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI.” Political Analysis 32.1 (2024): 84–100. Web.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote