File size: 3,393 Bytes
6ee049d
 
 
 
 
 
 
 
ee891db
6ee049d
 
 
3d2a7df
 
2a6ad72
3b3e599
3d2a7df
8840169
2a6ad72
8840169
3d2a7df
 
 
 
 
8840169
6ee049d
3d2a7df
 
 
 
9e96a57
36f106e
8840169
3d2a7df
 
8840169
3d2a7df
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
title: AI‑Culture‑Commons
emoji: 📚
colorFrom: indigo
colorTo: gray
sdk: static
pinned: true
thumbnail: >-
  /static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F678d64ee7967054e64970908%2F_DjOT2roXXm_sltcskzDZ.jpeg%3C%2Fspan%3E
short_description: Multilingual cultural corpora for AI research
license: cc-by-4.0
---

# AI‑Culture‑Commons
AI‑Culture‑Commons curates [multilingual cultural corpora](https://huggingface.co/datasets/AI-Culture-Commons/ai-culture-html-multilingual) for language‑model research.

We are a **non-profit digital humanities project**, advancing humane AI development through high-quality, rich cultural content. We strive to contribute to the **cultural evolution of artificial intelligence** by providing sophisticated training data that explores the intersection of technology, artificial intelligence, and human culture.

[Our repositories](https://huggingface.co/datasets/AI-Culture-Commons/ai-culture-html-multilingual) provide models with deep philosophical-intellectual context, diverse connections between culture, philosophy, literature, and technology—particularly AI. Our content is specifically designed to help train more **culturally aware and philosophically grounded AI models**.

## Our Datasets
| Dataset | Size | Languages | Formats | License | Citation & Research |
|---------|------|-----------|---------|---------|---------|
| **Multilingual Culture Corpus** | 16M words | 12 ALIGNED languages | HTML · CSV · DOLMA JSONL | CC‑BY‑4.0 | [![DOI](https://zenodo.org/badge/1021100370.svg)](https://doi.org/10.5281/zenodo.16001657) |
| **Project Websites Raw** | 160MB | 12 ALIGNED languages | ZIP (HTML + images + CSS) | CC‑BY‑4.0 | [![DOI](https://zenodo.org/badge/1021100223.svg)](https://doi.org/10.5281/zenodo.16001641) |

## Key Features
- **Perfect Alignment**: All 12 languages contain identical content with exact same complex HTML structure. All datasets include both pure text and HTML source files
- **AI-Optimized**: Designed specifically for training multilingual AI systems
- **Truly Open**: [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/) - use freely, even commercially
- **Content Quality**: Sophisticated content with intellectual depth, authored by a group of academics and writers
- **Completely Clean Data**: No user comments, scraped texts, or unwanted content - pure, high-quality, carefully edited content
- **Full Documentation**: [Complete pipeline](https://github.com/AI-Culture-Commons/ai-culture-pipeline) description and documentation in dataset cards. All datasets are versioned and archived for research reproducibility

## Languages
English, French, German, Spanish, Portuguese, Italian, Japanese, Russian, Korean, Mandarin, Hindi, Hebrew

## Source Websites & Licensing
Our corpora are carefully extracted from our websites:
- **Original Project**: [https://hitdarderut-haaretz.org](https://hitdarderut-haaretz.org) - Cultural analysis
  - [License Terms](https://hitdarderut-haaretz.org/license): CC-BY-4.0
- **Multicultural Project**: [https://degeneration-of-nation.org](https://degeneration-of-nation.org) - Critical philosophical commentary  
  - [License Terms](https://degeneration-of-nation.org/license): CC-BY-4.0

---
*As a non-profit organization, we're committed to advancing humane AI through high-quality, clean cultural datasets with perfect multilingual alignment*