-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 75 -
HuggingFaceFW/finewiki
Viewer • Updated • 61.6M • 19.6k • 263 -
nhagar/fineweb_urls
Viewer • Updated • 24.5B • 239 • 1 -
PleIAs/common_corpus
Viewer • Updated • 470M • 42.7k • 320
Collections
Discover the best community collections!
Collections including paper arxiv:2506.20920
-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 75 -
SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 200 -
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Paper • 2303.03915 • Published • 7 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper • 2502.02737 • Published • 250
-
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
Paper • 2506.18095 • Published • 66 -
FreedomIntelligence/ShareGPT-4o-Image
Viewer • Updated • 92.3k • 22.2k • 92 -
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 75
-
MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings
Paper • 2405.19504 • Published • 3 -
HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling
Paper • 2506.20452 • Published • 19 -
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 75 -
The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm
Paper • 2507.18553 • Published • 40
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Paper • 2503.14734 • Published • 5 -
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Paper • 2401.02117 • Published • 33 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 142 -
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding
Paper • 2506.16035 • Published • 88
-
Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated
Paper • 2509.05739 • Published • 2 -
Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers
Paper • 2509.03059 • Published • 24 -
Universal Deep Research: Bring Your Own Model and Strategy
Paper • 2509.00244 • Published • 13 -
<think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs
Paper • 2509.08358 • Published • 13
-
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Paper • 2506.07044 • Published • 113 -
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
Paper • 2506.09513 • Published • 99 -
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Paper • 2505.24863 • Published • 97 -
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Paper • 2506.09113 • Published • 102
-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 75 -
SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 200 -
YourBench: Easy Custom Evaluation Sets for Everyone
Paper • 2504.01833 • Published • 22 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper • 2502.02737 • Published • 250
-
The Curse of Depth in Large Language Models
Paper • 2502.05795 • Published • 40 -
Transformers without Normalization
Paper • 2503.10622 • Published • 172 -
Parallel Scaling Law for Language Models
Paper • 2505.10475 • Published • 83 -
Learning to Skip the Middle Layers of Transformers
Paper • 2506.21103 • Published • 18
-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 75 -
HuggingFaceFW/finewiki
Viewer • Updated • 61.6M • 19.6k • 263 -
nhagar/fineweb_urls
Viewer • Updated • 24.5B • 239 • 1 -
PleIAs/common_corpus
Viewer • Updated • 470M • 42.7k • 320
-
Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated
Paper • 2509.05739 • Published • 2 -
Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers
Paper • 2509.03059 • Published • 24 -
Universal Deep Research: Bring Your Own Model and Strategy
Paper • 2509.00244 • Published • 13 -
<think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs
Paper • 2509.08358 • Published • 13
-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 75 -
SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 200 -
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Paper • 2303.03915 • Published • 7 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper • 2502.02737 • Published • 250
-
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
Paper • 2506.18095 • Published • 66 -
FreedomIntelligence/ShareGPT-4o-Image
Viewer • Updated • 92.3k • 22.2k • 92 -
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 75
-
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Paper • 2506.07044 • Published • 113 -
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
Paper • 2506.09513 • Published • 99 -
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Paper • 2505.24863 • Published • 97 -
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Paper • 2506.09113 • Published • 102
-
MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings
Paper • 2405.19504 • Published • 3 -
HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling
Paper • 2506.20452 • Published • 19 -
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 75 -
The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm
Paper • 2507.18553 • Published • 40
-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 75 -
SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 200 -
YourBench: Easy Custom Evaluation Sets for Everyone
Paper • 2504.01833 • Published • 22 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper • 2502.02737 • Published • 250
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Paper • 2503.14734 • Published • 5 -
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Paper • 2401.02117 • Published • 33 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 142 -
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding
Paper • 2506.16035 • Published • 88
-
The Curse of Depth in Large Language Models
Paper • 2502.05795 • Published • 40 -
Transformers without Normalization
Paper • 2503.10622 • Published • 172 -
Parallel Scaling Law for Language Models
Paper • 2505.10475 • Published • 83 -
Learning to Skip the Middle Layers of Transformers
Paper • 2506.21103 • Published • 18