Qwen3-14B-Intuitor-MATH-1EPOCH

This model is an Intuitor-fine-tuned version of Qwen3-14B trained on the MATH dataset, as presented in the paper Learning to Reason without External Rewards.

Intuitor is a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm called Reinforcement Learning from Internal Feedback (RLIF), which enables LLMs to learn from intrinsic signals without external rewards or labeled data. RLIF offers a scalable and domain-agnostic fine-tuning approach for LLMs in settings where external supervision is expensive or unavailable.

Usage

You can use this model with the Hugging Face transformers library.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Solve the following problem:  $2 x + 5 = 11$ . What is x?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    eos_token_id=tokenizer.eos_token_id,
)

response = tokenizer.batch_decode(generated_ids[:, model_inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(response)

Citation

@article{zhao2025learning,
  title   = {Learning to Reason without External Rewards},
  author  = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal = {arXiv preprint arXiv:2505.19590},
  year    = {2025}
}

Downloads last month: 28

Safetensors

Model size

15B params

Tensor type

BF16

Model tree for sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH

Base model

Qwen/Qwen3-14B-Base

Finetuned

Qwen/Qwen3-14B

Finetuned

(143)

this model

Collection including sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH

Intuitor

Collection

Models in the paper "Learning to Reason without External Rewards" • 12 items • Updated Jun 25

sunblaze-ucb
/

Qwen3-14B-Intuitor-MATH-1EPOCH