metadata
base_model:
- allenai/OLMo-2-1124-7B-SFT
datasets:
- math
language:
- en
license: apache-2.0
metrics:
- accuracy
pipeline_tag: text-generation
library_name: transformers
OLMo-2-7B-SFT-GRPO-MATH-1EPOCH
This model is a GRPO-fine-tuned version of allenai/OLMo-2-1124-7B-SFT trained on the MATH dataset.
This model is associated with the paper Learning to Reason without External Rewards, which introduces Intuitor, a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. This approach is built on a novel paradigm called Reinforcement Learning from Internal Feedback (RLIF), enabling models to learn without external rewards, gold labels, or verifiers by optimizing intrinsic signals.
Project Page & Code
- Project Page: https://sunblaze-ucb.github.io/Intuitor/
- GitHub Repository: https://github.com/sunblaze-ucb/Intuitor
Usage
You can load and use this model with the transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH"
# It's recommended to load with bfloat16 for OLMo-2 models if supported by your hardware
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
# Example usage:
prompt = "Question: What is 2 + 2?
Answer:"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Citation
@article{zhao2025learning,
title={Learning to Reason without External Rewards},
author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
journal={arXiv preprint arXiv:2505.19590},
year={2025}
}