AI & ML interests
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
Recent Activity
View all activity
Papers
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Organization Card
🍷 FineData
This is the home of the 🍷 FineData team, a branch of the 🤗 Hugging Face Science Team releasing large scale pre-training datasets to accelerate open LLM development.
- 🍷 FineWeb: A 15T tokens English dataset for LLM pre-training. See the blogpost and paper.
- 📚 FineWeb-Edu: a filtered subset of the most educational content from FineWeb.
- 🥂 FineWeb2: an extension of FineWeb to over 1000 languages. See the paper.
- 📄 FinePDFs: 3T tokens of text data extracted from PDFs sourced from the Web.
- 🌐 FineWiki: an updated, better extracted version of Wikipedia in 300+ languages.
spaces
6
Running
8
FineWiki Viewer
🌐
Viewer to explore the finewiki dataset
Running
1.14k
FineWeb: decanting the web for the finest text data at scale
🍷
Generate high-quality text data for LLMs using FineWeb
Running
74
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
📝
Evaluate multilingual models using FineTasks
Sleeping
Tasks Explorer
🏢
Explore and analyze experiment results
Running
4
Datasets Metrics Explorer
📊
Launch an interactive demo interface
models
30
HuggingFaceFW/fineweb-edu-classifier
Text Classification
•
0.1B
•
Updated
•
1.59k
•
•
198
HuggingFaceFW/Datasets-Metrics-Viewer-Data
Updated
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation
•
2B
•
Updated
•
425
•
16
HuggingFaceFW/ablation-exp-filter-custom-all_filters-28BT
Text Generation
•
2B
•
Updated
•
1
•
1
HuggingFaceFW/ablation-exp-filter-custom-line_char_duplicated_0.01-28BT
Text Generation
•
2B
•
Updated
•
1
•
2
HuggingFaceFW/ablation-exp-filter-custom-line_ratio_0.67-28BT
Text Generation
•
2B
•
Updated
•
3
HuggingFaceFW/ablation-exp-filter-custom-lines_punct_0.12-28BT
Text Generation
•
2B
•
Updated
•
7
•
3
HuggingFaceFW/ablation-exp-filter-baseline_c4-28BT
Text Generation
•
2B
•
Updated
•
4
•
2
HuggingFaceFW/ablation-exp-filter-baseline_cc-28BT
Text Generation
•
2B
•
Updated
•
3
•
4
HuggingFaceFW/ablation-exp-filter-c4-word_lengths-28BT
Text Generation
•
2B
•
Updated
•
3
•
2
datasets
12
HuggingFaceFW/fineweb-2
Viewer
•
Updated
•
4.48B
•
99k
•
678
HuggingFaceFW/finewiki
Viewer
•
Updated
•
61.6M
•
12.2k
•
197
HuggingFaceFW/clean-wikipedia
Viewer
•
Updated
•
61.2M
•
1.61k
•
23
HuggingFaceFW/finepdfs_lang_classification_tmp
Updated
•
11
HuggingFaceFW/ocr-annotations
Viewer
•
Updated
•
1.62k
•
222
•
13
HuggingFaceFW/finepdfs_lang_classification
Viewer
•
Updated
•
3.08M
•
4.95k
•
4
HuggingFaceFW/finepdfs
Viewer
•
Updated
•
475M
•
53.5k
•
647
HuggingFaceFW/fineweb
Viewer
•
Updated
•
52.5B
•
303k
•
2.42k
HuggingFaceFW/fineweb-edu
Viewer
•
Updated
•
3.5B
•
243k
•
793
HuggingFaceFW/fineweb-edu-score-2
Viewer
•
Updated
•
13.9B
•
37.4k
•
81