Craw4LLM: Efficient Web Crawling for LLM Pretraining Paper • 2502.13347 • Published Feb 19 • 30
view article Article Releasing Common Corpus: the largest public domain dataset for training LLMs Mar 20, 2024 • 29
Essential-Web v1.0: 24T tokens of organized web data Paper • 2506.14111 • Published Jun 17 • 46
view article Article 🥬 TinyLettuce: Efficient Hallucination Detection with 17–68M Encoders Aug 31 • 15