Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B Paper • 2511.06221 • Published 6 days ago • 88
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention Paper • 2510.04212 • Published Oct 5 • 23
view article Article The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix 12 days ago • 37