OLMo-2-7B-SFT-GRPO-MATH-1EPOCH / README.md

Xuandong

Improve model card: Add library, links, and usage example (#1)

e53a31f verified 3 months ago

preview code

raw

history blame contribute delete

2.08 kB

metadata

base_model:
  - allenai/OLMo-2-1124-7B-SFT
datasets:
  - math
language:
  - en
license: apache-2.0
metrics:
  - accuracy
pipeline_tag: text-generation
library_name: transformers

OLMo-2-7B-SFT-GRPO-MATH-1EPOCH

This model is a GRPO-fine-tuned version of allenai/OLMo-2-1124-7B-SFT trained on the MATH dataset.

This model is associated with the paper Learning to Reason without External Rewards, which introduces Intuitor, a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. This approach is built on a novel paradigm called Reinforcement Learning from Internal Feedback (RLIF), enabling models to learn without external rewards, gold labels, or verifiers by optimizing intrinsic signals.

Project Page & Code

Project Page: https://sunblaze-ucb.github.io/Intuitor/
GitHub Repository: https://github.com/sunblaze-ucb/Intuitor

Usage

You can load and use this model with the transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH"

# It's recommended to load with bfloat16 for OLMo-2 models if supported by your hardware
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

# Example usage:
prompt = "Question: What is 2 + 2?
Answer:"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Citation

@article{zhao2025learning,
  title={Learning to Reason without External Rewards},
  author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal={arXiv preprint arXiv:2505.19590},
  year={2025}
}