Xuandong Zhao's picture

17 11

Xuandong Zhao

Xuandong

·

https://xuandongzhao.github.io/

AI & ML interests

None yet

Recent Activity

replied to Kseniase's post about 1 month ago

8 Emerging trends in Reinforcement Learning Reinforcement learning is having a moment - and not just this week. Some of its directions are already showing huge promise, while others are still early but exciting. Here’s a look at what’s happening right now in RL: 1. Reinforcement Pre-Training (RPT) → https://huggingface.co/papers/2506.08007 Reframes next-token pretraining as RL with verifiable rewards, yielding scalable reasoning gains 2. Reinforcement Learning from Human Feedback (RLHF) → https://huggingface.co/papers/1706.03741 The top approach. It trains a model using human preference feedback, building a reward model and then optimizing the policy to generate outputs people prefer 3. Reinforcement Learning with Verifiable Rewards (RLVR) → https://huggingface.co/papers/2506.14245 Moves from subjective (human-labeled) rewards to objective ones that can be automatically verified, like in math, code, or rubrics as reward, for example → https://huggingface.co/papers/2508.12790, https://huggingface.co/papers/2507.17746 4. Multi-objective RL → https://huggingface.co/papers/2508.07768 Trains LMs to balance multiple goals at once, like being helpful but also concise or creative, ensuring that improving one goal doesn’t ruin another 5. Parallel thinking RL → https://huggingface.co/papers/2509.07980 Trains parallel chains of thought, boosting math accuracy and final ceilings. It first teaches the model “parallel thinking” skill on easier problems, then uses RL to refine it on harder ones Read further below ⬇️ And if you like this, subscribe to the Turing post: https://www.turingpost.com/subscribe Also, check out our recent guide about the past, present and future of RL: https://www.turingpost.com/p/rlguide

new activity 3 months ago

sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH:Improve model card: Add transformers library, expand description, links, and usage

new activity 3 months ago

sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH:Improve model card: Add library, links, and usage example

View all activity

Organizations

New activity in sunblaze-ucb/Qwen2.5-1.5B-Intuitor-MATH-1EPOCH 3 months ago

Improve model card: Add transformers library, expand description, links, and usage

#1 opened 3 months ago by

New activity in sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH 3 months ago

Improve model card: Add library, links, and usage example

#1 opened 3 months ago by

New activity in sunblaze-ucb/OLMo-2-7B-SFT-Intuitor-MATH-1EPOCH 3 months ago

Improve model card: Add library, update pipeline tag, link to code

#1 opened 3 months ago by

New activity in sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH 3 months ago

Improve model card: Add library_name, paper/code links, and usage example

#1 opened 3 months ago by

New activity in sunblaze-ucb/Qwen2.5-1.5B-GRPO-MATH-1EPOCH 3 months ago

Improve model card: Add library, GitHub link, paper details, and usage example

#1 opened 3 months ago by

New activity in sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH 3 months ago

Improve model card: Add library, links, and usage example

#1 opened 3 months ago by

New activity in sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH 3 months ago

Improve model card: Add `library_name`, expanded description, GitHub link, and usage

#1 opened 3 months ago by

New activity in sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH 3 months ago

Improve model card: Add library, usage, tags, and links

#1 opened 3 months ago by

commented 2 papers 4 months ago

Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

Paper • 2507.07484 • Published Jul 10 • 17 •

The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation

Paper • 2507.05578 • Published Jul 8 • 5 •

New activity in sunblaze-ucb/AgentSynth 5 months ago

Improve dataset card: update task category, add descriptive tags, abstract, and code link

#2 opened 5 months ago by

New activity in LLM360/guru-RL-92k 5 months ago

Test set is empty?

#10 opened 5 months ago by

commented a paper 5 months ago

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

Paper • 2506.14205 • Published Jun 17 • 7 •

New activity in sunblaze-ucb/AgentSynth 5 months ago

Add reinforcement-learning task category, link to paper, and project page

#1 opened 5 months ago by

commented a paper 6 months ago

Learning to Reason without External Rewards

Paper • 2505.19590 • Published May 26 • 29 •

commented a paper 7 months ago

Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs

Paper • 2504.04715 • Published Apr 7 • 13 •