Running on CPU Upgrade 1.68k 1.68k The Smol Training Playbook: The Secrets to Building World-Class LLMs ๐
StarCoder 2 and The Stack v2: The Next Generation Paper โข 2402.19173 โข Published Feb 29, 2024 โข 149
The Ultimate Collection of Code Classifiers Collection ๐ฅ 15 classifiers, 124M parameters, one per programming languageโ for assessing the educational value of GitHub code โข 15 items โข Updated May 5 โข 15
Essential-Web v1.0: 24T tokens of organized web data Paper โข 2506.14111 โข Published Jun 17 โข 46
view article Article nanoJAXGPT: A pedagogical introduction to JAX/Equinox By sachithgunasekara and 2 others โข Oct 23, 2024 โข 5
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training Paper โข 2504.13161 โข Published Apr 17 โข 93
Running 125 125 TxT360: Trillion Extracted Text ๐ Explore and utilize a large, deduplicated text dataset for LLM training
Running 3.45k 3.45k The Ultra-Scale Playbook ๐ The ultimate guide to training LLM on large GPU Clusters
Running 125 125 TxT360: Trillion Extracted Text ๐ Explore and utilize a large, deduplicated text dataset for LLM training
Running 79 79 Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks ๐ Evaluate multilingual models using FineTasks