Kyle Pearson's picture

Kyle Pearson PRO

pearsonkyle

·

pearsonkyle

AI & ML interests

Exoplanets, Atmospheric Characterization, Observational Astronomy

Recent Activity

upvoted an article 2 days ago

The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

reacted to codelion's post with 🚀 2 days ago

Want to experiment with pre-training dataset mixtures but don't want to process terabytes of data? We've got you covered. We're releasing a collection of several carefully curated 1B token dataset samples specifically designed for rapid prototyping and pretraining experiments: https://huggingface.co/collections/codelion/pre-training-dataset-samples These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead. The collection includes: - finePDFs-1B: High-quality textbook-style educational content - DCLM-baseline-1B: Filtered, diverse web content - FineWeb-Edu-1B: Curated educational web resources We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data. Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately. Read the full story of how we used these datasets to find the optimal pretraining recipe: https://huggingface.co/blog/codelion/optimal-dataset-mixing

liked a Space 9 days ago

linoyts/Qwen-Image-Edit-next-scene

View all activity

Organizations

None yet

spaces 1

mtg-card-explorer

models 3

pearsonkyle/ArtPrompter

Text Generation • Updated Feb 7, 2023 • 23 • 2

pearsonkyle/gpt2-arxiv

Text Generation • Updated Jan 20, 2023 • 3 • 1

pearsonkyle/gpt2-exomachina

Text Generation • Updated May 23, 2021 • 7 • 1

datasets 0

None public yet