2 71 228

Xiaosen Zheng

xszheng2020

https://xszheng2020.github.io

AI & ML interests

Code AI and Data-Centric AI.

Recent Activity

upvoted a paper about 7 hours ago

Diffusion Language Models are Super Data Learners

liked a dataset 1 day ago

tokyotech-llm/swallow-math

liked a dataset 1 day ago

tokyotech-llm/swallow-code

View all activity

Organizations

upvoted a paper about 7 hours ago

Diffusion Language Models are Super Data Learners

Paper • 2511.03276 • Published 1 day ago • 53

upvoted 6 papers about 1 month ago

upvoted a collection about 1 month ago

cwm

Collection

Collection for Code World Model, an agentic coding model from FAIR. • 3 items • Updated Sep 24 • 17

upvoted 2 papers 2 months ago

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Paper • 2509.02479 • Published Sep 2 • 83

TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training

Paper • 2508.17677 • Published Aug 25 • 14

upvoted a paper 3 months ago

CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction

Paper • 2502.07316 • Published Feb 11 • 50

upvoted 2 collections 3 months ago

Pre-training Dataset Samples

Collection

A collection of pre-training datasets samples of sizes 10M, 100M and 1B tokens. Ideal for use in quick experimentation and ablations. • 16 items • Updated 1 day ago • 7

Seed-Coder

Collection

4 items • Updated May 13 • 21

upvoted 2 papers 3 months ago

Geometric-Mean Policy Optimization

Paper • 2507.20673 • Published Jul 28 • 31

Group Sequence Policy Optimization

Paper • 2507.18071 • Published Jul 24 • 307

upvoted 3 papers 4 months ago

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Paper • 2507.16812 • Published Jul 22 • 63

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

Paper • 2507.12415 • Published Jul 16 • 42

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 75

upvoted 2 papers 5 months ago

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Paper • 2506.10952 • Published Jun 12 • 22

SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond

Paper • 2505.19641 • Published May 26 • 67

Xiaosen Zheng

AI & ML interests

Recent Activity

Organizations

xszheng2020's activity