JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation Paper • 2601.00223 • Published 7 days ago • 1
view article Article The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator 22 days ago • 43
If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs Paper • 2412.04144 • Published Dec 5, 2024 • 6
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model Paper • 2510.18855 • Published Oct 21, 2025 • 71
Ovis2.5 Collection Our next-generation MLLMs for native-resolution vision and advanced reasoning • 5 items • Updated Aug 19, 2025 • 57
SYNTHETIC-1 Collection A collection of tasks & verifiers for reasoning datasets • 9 items • Updated Oct 7, 2025 • 67
Quartet: Native FP4 Training Can Be Optimal for Large Language Models Paper • 2505.14669 • Published May 20, 2025 • 78
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning Paper • 2504.07128 • Published Apr 2, 2025 • 87
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens Paper • 2504.07096 • Published Apr 9, 2025 • 76
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks Paper • 2502.08235 • Published Feb 12, 2025 • 58
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU Paper • 2502.08910 • Published Feb 13, 2025 • 148
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models Paper • 2410.07985 • Published Oct 10, 2024 • 32