FineData

Team

community

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

guipenedo updated a dataset 9 days ago

HuggingFaceFW/fineweb-2

guipenedo new activity 9 days ago

HuggingFaceFW/fineweb-2:Synthetic Data Generator

guipenedo new activity 9 days ago

HuggingFaceFW/fineweb-2:Number of rows not available for all configs.

View all activity

Papers

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

View all Papers

HuggingFaceFW 's datasets 12

HuggingFaceFW/fineweb-2

Viewer • Updated 9 days ago • 4.48B • 93.9k • 681

HuggingFaceFW/finewiki

Viewer • Updated 15 days ago • 61.6M • 13.6k • 217

HuggingFaceFW/clean-wikipedia

Viewer • Updated 15 days ago • 61.2M • 1.65k • 23

HuggingFaceFW/finepdfs_lang_classification_tmp

Updated 16 days ago • 12

HuggingFaceFW/ocr-annotations

Viewer • Updated 16 days ago • 1.62k • 233 • 14

HuggingFaceFW/finepdfs_lang_classification

Viewer • Updated 20 days ago • 3.08M • 6.15k • 4

HuggingFaceFW/finepdfs

Viewer • Updated Sep 8 • 475M • 56.5k • 651

HuggingFaceFW/fineweb

Viewer • Updated Jul 11 • 52.5B • 305k • 2.42k

HuggingFaceFW/fineweb-edu

Viewer • Updated Jul 11 • 3.5B • 233k • 797

HuggingFaceFW/fineweb-edu-score-2

Viewer • Updated Jul 11 • 13.9B • 39.3k • 81

HuggingFaceFW/admin

Viewer • Updated Dec 7, 2024 • 16 • 16.5k • 3

HuggingFaceFW/fineweb-edu-llama3-annotations

Viewer • Updated Jun 3, 2024 • 467k • 260 • 45