Result
Table 1. Results on the eval set
Verifier Model Rubric Precision Rubric Recall Rubric F1 Sample Precision Sample Recall Sample F1 Avg. F1 Qwen3-1.7B 0.41 0.49 0.34 0.48 0.40 0.32 0.33 Qwen2.5-3B 0.42 0.47 0.43 0.49 0.46 0.43 0.43 Qwen3-4B 0.56 0.62 0.57 0.61 0.58 0.58 0.58 Qwen3-8B 0.54 0.66 0.55 0.62 0.61 0.57 0.56 LLaMA-3.1-8B 0.45 0.54 0.42 0.34 0.41 0.32 0.37 Qwen3-30B-A3B 0.56 0.66 0.56 0.63 0.62 0.62 0.58 Qwen2.5-32B-Instruct 0.60 0.67 0.60 0.67 0.68 0.64 0.62 Search-Gen-V-1.7B (SFT) 0.63 0.62 0.62 0.66 0.66 0.66 0.64 Search-Gen-V-4B (SFT) 0.70 0.66 0.68 0.72 0.72 0.71 0.70 Search-Gen-V-4B (SFT+RL) 0.71 0.68 0.70 0.74 0.74 0.73 0.72 Qwen3-235B-A22B-Instruct-2507 0.72 0.73 0.73 0.76 0.76 0.76 0.74 Table 2. Accuracy comparison on verifying rubrics in longform answers from DeepResearch Bench
Verifier Model Precision Recall F1 Qwen3-4B 0.42 0.56 0.42 Search-Gen-V-4B 0.59 0.57 0.57 Qwen3-235B-A22B 0.57 0.67 0.61 Table 3. Results on the short-form workload, HotpotQA
Verifier Model Precision Recall F1 EM 0.84 0.80 0.82 Qwen3-4B 0.83 0.70 0.71 Search-Gen-V-4B 0.86 0.76 0.77 Qwen3-235B-A22B 0.87 0.78 0.80 EM + Qwen3-4B 0.94 0.92 0.93 EM + Search-Gen-V-4B 0.95 0.93 0.94 EM + Qwen3-235B-A22B 0.96 0.94 0.95
Related links
- paper:
- code:
- model:
- datasets:
Citation
@article{ma2025searchgenv,
title={AN EFFICIENT RUBRIC-BASED GENERATIVE VERIFIER FOR SEARCH-AUGMENTED LLMS},
author={Ma, Linyue and Xu, Yilong and Long, Xiang and Zheng, Zhi},
journal={arXiv preprint arXiv:2510.14660},
year={2025},
url={https://arxiv.org/abs/2510.14660}
}
- Downloads last month
- 28
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support