Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,15 +1,15 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: AI‑Culture‑Commons
|
| 3 |
-
emoji: 📚
|
| 4 |
-
colorFrom: indigo
|
| 5 |
-
colorTo: gray
|
| 6 |
-
sdk: static
|
| 7 |
-
pinned: true
|
| 8 |
-
thumbnail: >-
|
| 9 |
-
https://cdn-uploads.huggingface.co/production/uploads/678d64ee7967054e64970908/
|
| 10 |
-
short_description: Multilingual cultural corpora for AI research
|
| 11 |
-
license: cc-by-4.0
|
| 12 |
-
---
|
| 13 |
|
| 14 |
# AI‑Culture‑Commons
|
| 15 |
AI‑Culture‑Commons curates multilingual cultural corpora for language‑model research.
|
|
@@ -24,7 +24,7 @@ Our repositories provide models with deep philosophical-intellectual context, di
|
|
| 24 |
| **Multilingual Culture Corpus** | 16M words | 12 ALIGNED languages | HTML · CSV · DOLMA JSONL | CC‑BY‑4.0 | [](https://doi.org/10.5281/zenodo.16001657) |
|
| 25 |
| **Project Websites Raw** | 160MB | 12 ALIGNED languages | ZIP (HTML + images + CSS) | CC‑BY‑4.0 | [](https://doi.org/10.5281/zenodo.16001641) |
|
| 26 |
|
| 27 |
-
|
| 28 |
- **Perfect Alignment**: All 12 languages contain identical content with exact same complex HTML structure. All datasets include both pure text and HTML source files
|
| 29 |
- **AI-Optimized**: Designed specifically for training multilingual AI systems
|
| 30 |
- **Truly Open**: [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/) - use freely, even commercially
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: AI‑Culture‑Commons
|
| 3 |
+
emoji: 📚
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: gray
|
| 6 |
+
sdk: static
|
| 7 |
+
pinned: true
|
| 8 |
+
thumbnail: >-
|
| 9 |
+
https://cdn-uploads.huggingface.co/production/uploads/678d64ee7967054e64970908/PHTcXWQoX7_2_9CjFoHlJ.jpeg
|
| 10 |
+
short_description: Multilingual cultural corpora for AI research
|
| 11 |
+
license: cc-by-4.0
|
| 12 |
+
---
|
| 13 |
|
| 14 |
# AI‑Culture‑Commons
|
| 15 |
AI‑Culture‑Commons curates multilingual cultural corpora for language‑model research.
|
|
|
|
| 24 |
| **Multilingual Culture Corpus** | 16M words | 12 ALIGNED languages | HTML · CSV · DOLMA JSONL | CC‑BY‑4.0 | [](https://doi.org/10.5281/zenodo.16001657) |
|
| 25 |
| **Project Websites Raw** | 160MB | 12 ALIGNED languages | ZIP (HTML + images + CSS) | CC‑BY‑4.0 | [](https://doi.org/10.5281/zenodo.16001641) |
|
| 26 |
|
| 27 |
+
## Key Features
|
| 28 |
- **Perfect Alignment**: All 12 languages contain identical content with exact same complex HTML structure. All datasets include both pure text and HTML source files
|
| 29 |
- **AI-Optimized**: Designed specifically for training multilingual AI systems
|
| 30 |
- **Truly Open**: [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/) - use freely, even commercially
|