Spaces:
Running
Running
File size: 3,393 Bytes
6ee049d ee891db 6ee049d 3d2a7df 2a6ad72 3b3e599 3d2a7df 8840169 2a6ad72 8840169 3d2a7df 8840169 6ee049d 3d2a7df 9e96a57 36f106e 8840169 3d2a7df 8840169 3d2a7df |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
---
title: AI‑Culture‑Commons
emoji: 📚
colorFrom: indigo
colorTo: gray
sdk: static
pinned: true
thumbnail: >-
/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F678d64ee7967054e64970908%2F_DjOT2roXXm_sltcskzDZ.jpeg%3C%2Fspan%3E
short_description: Multilingual cultural corpora for AI research
license: cc-by-4.0
---
# AI‑Culture‑Commons
AI‑Culture‑Commons curates [multilingual cultural corpora](https://huggingface.co/datasets/AI-Culture-Commons/ai-culture-html-multilingual) for language‑model research.
We are a **non-profit digital humanities project**, advancing humane AI development through high-quality, rich cultural content. We strive to contribute to the **cultural evolution of artificial intelligence** by providing sophisticated training data that explores the intersection of technology, artificial intelligence, and human culture.
[Our repositories](https://huggingface.co/datasets/AI-Culture-Commons/ai-culture-html-multilingual) provide models with deep philosophical-intellectual context, diverse connections between culture, philosophy, literature, and technology—particularly AI. Our content is specifically designed to help train more **culturally aware and philosophically grounded AI models**.
## Our Datasets
| Dataset | Size | Languages | Formats | License | Citation & Research |
|---------|------|-----------|---------|---------|---------|
| **Multilingual Culture Corpus** | 16M words | 12 ALIGNED languages | HTML · CSV · DOLMA JSONL | CC‑BY‑4.0 | [](https://doi.org/10.5281/zenodo.16001657) |
| **Project Websites Raw** | 160MB | 12 ALIGNED languages | ZIP (HTML + images + CSS) | CC‑BY‑4.0 | [](https://doi.org/10.5281/zenodo.16001641) |
## Key Features
- **Perfect Alignment**: All 12 languages contain identical content with exact same complex HTML structure. All datasets include both pure text and HTML source files
- **AI-Optimized**: Designed specifically for training multilingual AI systems
- **Truly Open**: [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/) - use freely, even commercially
- **Content Quality**: Sophisticated content with intellectual depth, authored by a group of academics and writers
- **Completely Clean Data**: No user comments, scraped texts, or unwanted content - pure, high-quality, carefully edited content
- **Full Documentation**: [Complete pipeline](https://github.com/AI-Culture-Commons/ai-culture-pipeline) description and documentation in dataset cards. All datasets are versioned and archived for research reproducibility
## Languages
English, French, German, Spanish, Portuguese, Italian, Japanese, Russian, Korean, Mandarin, Hindi, Hebrew
## Source Websites & Licensing
Our corpora are carefully extracted from our websites:
- **Original Project**: [https://hitdarderut-haaretz.org](https://hitdarderut-haaretz.org) - Cultural analysis
- [License Terms](https://hitdarderut-haaretz.org/license): CC-BY-4.0
- **Multicultural Project**: [https://degeneration-of-nation.org](https://degeneration-of-nation.org) - Critical philosophical commentary
- [License Terms](https://degeneration-of-nation.org/license): CC-BY-4.0
---
*As a non-profit organization, we're committed to advancing humane AI through high-quality, clean cultural datasets with perfect multilingual alignment* |