Kyle Pearson's picture

Kyle Pearson PRO

pearsonkyle

·

pearsonkyle

AI & ML interests

Exoplanets, Atmospheric Characterization, Observational Astronomy

Recent Activity

liked a Space about 15 hours ago

ibm-granite/granite-docling-258m-demo

upvoted an article 3 days ago

The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

reacted to codelion's post with 🚀 3 days ago

Want to experiment with pre-training dataset mixtures but don't want to process terabytes of data? We've got you covered. We're releasing a collection of several carefully curated 1B token dataset samples specifically designed for rapid prototyping and pretraining experiments: https://huggingface.co/collections/codelion/pre-training-dataset-samples These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead. The collection includes: - finePDFs-1B: High-quality textbook-style educational content - DCLM-baseline-1B: Filtered, diverse web content - FineWeb-Edu-1B: Curated educational web resources We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data. Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately. Read the full story of how we used these datasets to find the optimal pretraining recipe: https://huggingface.co/blog/codelion/optimal-dataset-mixing

View all activity

Organizations

None yet

liked a Space about 15 hours ago

granite-docling-258M demo

Convert images to structured documents and answer questions

upvoted an article 3 days ago

Article

The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

By

•

11 days ago

• 34

reacted to codelion's post with 🚀 3 days ago

Post

3871

Want to experiment with pre-training dataset mixtures but don't want to process terabytes of data? We've got you covered.

We're releasing a collection of several carefully curated 1B token dataset samples specifically designed for rapid prototyping and pretraining experiments: https://huggingface.co/collections/codelion/pre-training-dataset-samples

These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead.

The collection includes:
- finePDFs-1B: High-quality textbook-style educational content
- DCLM-baseline-1B: Filtered, diverse web content
- FineWeb-Edu-1B: Curated educational web resources

We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data.

Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately.

Read the full story of how we used these datasets to find the optimal pretraining recipe: https://huggingface.co/blog/codelion/optimal-dataset-mixing

1 reply

·

liked a Space 10 days ago

Qwen Image Edit Next Scene

Fast 4 step inference with Qwen Image Edit 2509

liked 2 Spaces about 1 month ago

FLUX LoRA DLC

270+ Impressive LoRAs for Flux.1

DeepSite v3

Generate any application by Vibe Coding

liked a model about 2 months ago

ibm-granite/granite-docling-258M

Image-Text-to-Text • 0.3B • Updated Sep 23 • 93.7k • 1.02k

liked 2 Spaces 2 months ago

Semantic Galaxy

Visualize embeddings in 3D space, powered by EmbeddingGemma

Conference Generator VibeVoice

Generate conference audio from text

updated a Space 2 months ago

mtg-card-explorer

published a Space 3 months ago

mtg-card-explorer

liked a Space 3 months ago

Qwen Image Edit

Edit images based on user instructions

liked a model 3 months ago

Qwen/Qwen-Image-Edit

Image-to-Image • Updated Aug 25 • 193k • • 2.12k

liked a Space 3 months ago

Wan2.2 14B Fast

generate a video from an image with a text prompt

upvoted an article 5 months ago

Article

Training and Finetuning Reranker Models with Sentence Transformers v4

Mar 26

• 173

liked 2 models 5 months ago

cross-encoder/ms-marco-MiniLM-L6-v2

Text Ranking • 22.7M • Updated Aug 29 • 5.94M • 162

unsloth/Qwen3-8B-unsloth-bnb-4bit

5B • Updated May 13 • 103k • 11

liked a Space 6 months ago

my-beautiful-earth

liked a Space 7 months ago

Kokoro TTS

Upgraded to v1.0!

liked a Space 8 months ago

FLUX.1 Dev Inpainting Model Beta GPU

Repair and enhance images using prompts