Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems
Abstract
CJE framework improves model evaluation accuracy and efficiency by addressing calibration, weight stabilization, and uncertainty in confidence intervals.
LLM-as-judge evaluation has become the de facto standard for scaling model assessment, but the practice is statistically unsound: uncalibrated scores can invert preferences, naive confidence intervals on uncalibrated scores achieve near-0% coverage, and importance-weighted estimators collapse under limited overlap despite high effective sample size (ESS). We introduce Causal Judge Evaluation (CJE), a framework that fixes all three failures. On n=4,961 Chatbot Arena prompts (after filtering from 5k), CJE achieves 99% pairwise ranking accuracy at full sample size (94% averaged across configurations), matching oracle quality, at 14x lower cost (for ranking 5 policies) by calibrating a 16x cheaper judge on just 5% oracle labels (~250 labels). CJE combines three components: (i) AutoCal-R, reward calibration via mean-preserving isotonic regression; (ii) SIMCal-W, weight stabilization via stacking of S-monotone candidates; and (iii) Oracle-Uncertainty Aware (OUA) inference that propagates calibration uncertainty into confidence intervals. We formalize the Coverage-Limited Efficiency (CLE) diagnostic, which explains why IPS-style estimators fail even when ESS exceeds 90%: the logger rarely visits regions where target policies concentrate. Key findings: SNIPS inverts rankings even with reward calibration (38% pairwise, negative Kendall's tau) due to weight instability; calibrated IPS remains near-random (47%) despite weight stabilization, consistent with CLE; OUA improves coverage from near-0% to ~86% (Direct) and ~96% (stacked-DR), where naive intervals severely under-cover.
Community
LLM-as-judge evals are convenient, but meaningful (fixable) failure modes lurk beneath the surface.
CJE treats LLM-judge evaluation as a statistics problem:
• calibrate a cheap judge to a small oracle slice of high-quality labels
• quantify uncertainty
• flag when the method is breaking
On Chatbot Arena prompts, we match oracle-quality pairwise policy ranking (99%) while cutting oracle labeling cost by ~14×.
If you run an eval pipeline: what are the most important failure modes you’ve seen?
I’d love to hear where this breaks first.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies (2025)
- What Does It Take to Build a Performant Selective Classifier? (2025)
- Approximating Human Preferences Using a Multi-Judge Learned System (2025)
- BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers? (2025)
- Ask a Strong LLM Judge when Your Reward Model is Uncertain (2025)
- Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge (2025)
- LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper