OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
Abstract
The Outcome-based Process Verifier (OPV) improves the verification of complex reasoning chains in large language models by combining outcome-based and process-based verification with iterative active learning and Rejection Fine-Tuning, achieving state-of-the-art performance on various benchmarks.
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.
Community
We propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads (2025)
- CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions (2025)
- Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math (2025)
- DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning (2025)
- Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning (2025)
- CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning (2025)
- Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper