Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
141.6
TFLOPS
5
3
102
Kyle Pearson
PRO
pearsonkyle
Follow
0 followers
ยท
1 following
pearsonkyle
AI & ML interests
Exoplanets, Atmospheric Characterization, Observational Astronomy
Recent Activity
upvoted
an
article
2 days ago
The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix
reacted
to
codelion
's
post
with ๐
2 days ago
Want to experiment with pre-training dataset mixtures but don't want to process terabytes of data? We've got you covered. We're releasing a collection of several carefully curated 1B token dataset samples specifically designed for rapid prototyping and pretraining experiments: https://huggingface.co/collections/codelion/pre-training-dataset-samples These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead. The collection includes: - finePDFs-1B: High-quality textbook-style educational content - DCLM-baseline-1B: Filtered, diverse web content - FineWeb-Edu-1B: Curated educational web resources We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data. Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately. Read the full story of how we used these datasets to find the optimal pretraining recipe: https://huggingface.co/blog/codelion/optimal-dataset-mixing
liked
a Space
9 days ago
linoyts/Qwen-Image-Edit-next-scene
View all activity
Organizations
None yet
spaces
1
Running
mtg-card-explorer
๐ณ
models
3
Sort:ย Recently updated
pearsonkyle/ArtPrompter
Text Generation
โข
Updated
Feb 7, 2023
โข
23
โข
2
pearsonkyle/gpt2-arxiv
Text Generation
โข
Updated
Jan 20, 2023
โข
3
โข
1
pearsonkyle/gpt2-exomachina
Text Generation
โข
Updated
May 23, 2021
โข
7
โข
1
datasets
0
None public yet