DRIVE: Data Curation Best Practices for Reinforcement Learning wIth VErifiable Reward in Competitive Code Generation

Hunyuan Team, Tencent

πŸ“– Paper β€’ πŸ“™ SFT Model β€’ πŸ“˜ RL Model β€’ πŸ“œ Citation


Abstract

Recent reasoning-first models have spurred a resurgence of interest in RLVR (Reinforcement Learning with Verifiable Reward). However, advances are dominated by mathematics, with competitive-programming code generation being relatively underexplored. This work investigates how to construct RLVR datasets and presents practical training techniques that yield strong performance.

Our pipeline begins with Supervised Fine-Tuning (SFT) distilled from strong open-source models. This is followed by a two-stage RL process using executable, testcase-driven rewards:

  1. Stage 1 (Entropy Expansion): Training on a large, uniformly distributed set of problems with moderate rollouts (8) and a shorter context (24k) to expand entropy and mitigate repetition.
  2. Stage 2 (Hard-Focus Curriculum): Updating on a small, high-quality set of challenging problems using Pre-GRPO with a large rollout budget (64) under a hard-focus curriculum.

We implement our method on Qwen2.5-32B and achieve state-of-the-art performance among models of similar scale, comparable to leading systems like DeepSeek v3.1.

πŸš€ The DRIVE Pipeline

Our training pipeline consists of two main phases: Supervised Fine-Tuning (SFT) and a Two-Stage Reinforcement Learning process, as illustrated below.

pipeline_overview

Figure 2: The training pipeline of our models.

Phase 1: Supervised Fine-Tuning (SFT)

We begin by fine-tuning Qwen2.5-32B. The key innovation in this stage is Difficulty-Aware Sampling:

  • We first classify all competitive programming prompts into three categories: easy, medium, and hard.
  • To force the model to focus on more challenging problems, we duplicate hard samples twice in the final SFT dataset.
  • We also augment this with general-purpose coding and reasoning-intensive data to improve overall capabilities.

Phase 2: Two-Stage Reinforcement Learning

After SFT, the model still suffers from low entropy, repetitive generation, and poor performance on hard problems. Our two-stage RL process directly addresses this.

Stage 1: Entropy Expansion

  • Goal: Increase output diversity and reduce repetitive patterns.
  • Data: A large, uniformly distributed set of ~9k problems.
  • Method: We use 8 rollouts and a shorter 24k token length. As shown in Figure 3, this "24k-style" training (blue line) successfully increases entropy, while standard training (orange line) leads to entropy collapse.

entropy_vs_steps

Figure 3: The entropy comparison of 24k-style training and 32k-style training.

Stage 2: Hard-Focus Curriculum

  • Goal: Master the most challenging problems.
  • Data: A small, high-quality set of difficult problems (e.g., the 72, 50, and 32 hardest cases from LiveCode V6).
  • Method: We apply a "hard-focus curriculum" that progressively retains only the most difficult instances. Crucially, we use a large rollout budget (64-80 rollouts) in this stage, which we found essential for stable gains on hard problems.

πŸ“Š Key Results

Our final 32B model, DRIVE-RL, achieves state-of-the-art performance among similarly sized models and is competitive with larger 64k-context models.

Figure 1: Performance of our models on various benchmarks.

Pass@1 Performance Comparison

The two-stage RL pipeline provides significant improvements over the SFT baseline, particularly on challenging benchmarks. We see a +58.3% relative improvement on Codeforces OJ.

Model LiveCode 08-11 LiveCode V5 LiveCode V6 LeetCode Weekly (32) Codeforces OJ (33)
DeepseekV3.1 (64k) 0.692 0.713 0.693 0.688 0.161
Seed1.6-0715 (64k) 0.803 0.824 0.770 0.743 0.188
Qwen3-235B-2507 (64k) 0.681 0.713 0.646 0.688 0.200
--- --- --- --- --- ---
SFT model (32k) 0.602 0.594 0.549 0.578 0.115
RL Stage 1 model (24k) 0.625 0.627 0.634 0.603 0.112
DRIVE-RL model (32k) 0.699 0.697 0.703 0.653 0.182
Rel. Improvement (RL vs SFT) +16.1% +17.3% +28.1% +13.0% +58.3%

(Data sourced from Table 2 in our paper)

Key Findings

  1. Difficulty-aware training is crucial: Standard RL struggles with hard problems. Our hard-focus curriculum (Stage 2) is essential for pushing the model's capabilities.
  2. Entropy expansion is necessary: Skipping Stage 1 (Entropy Expansion) and training only on hard cases hurts generalization to out-of-distribution benchmarks. Both stages are necessary.
  3. Large rollouts for hard problems: A large rollout budget (e.g., 64+) is essential for mastering challenging cases.
  4. Scaling: The DRIVE strategy shows strong, positive scaling trends when applied to a large-scale internal MoE model.

πŸ“œ Citation

If you find this work useful, please cite our paper:

@misc{zhu2025drivedatacurationbest,
      title={DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation}, 
      author={Speed Zhu and Jianwei Cai and Guang Chen and Lulu Wu and Saiyong Yang and Wiggin Zhou},
      year={2025},
      eprint={2511.06307},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.06307}, 
}

License

This repository contains two separate licenses for different models:

Please refer to the respective license file for the model you are using.

Downloads last month
105
Safetensors
Model size
33B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tencent/DRIVE-RL

Base model

Qwen/Qwen2.5-32B
Finetuned
tencent/DRIVE-SFT
Finetuned
(1)
this model
Quantizations
1 model