arxiv:2511.08567

The Path Not Taken: RLVR Provably Learns Off the Principals

Published on Nov 11

· Submitted by

Hanqing Zhu on Nov 12

AI at Meta

Upvote

Authors:

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) improves large language models by modifying a small fraction of parameters through a mechanism involving KL-constrained updates, steering into low-curvature subspaces, and hiding updates in non-preferred regions, differing from supervised fine-tuning methods.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR's learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR. Together, these results provide the first parameter-space account of RLVR's training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.

View arXiv page View PDF Add to collection

Community

hanqing666

Paper submitter 1 day ago

The paper provides the first parameter-space account of RLVR's training dynamics, revealing clear regularities in how parameters evolve. Crucially, it shows that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by their case studies on advanced sparse fine-tuning and LoRA variants.