Spaces:
Running
Running
| title: AI‑Culture‑Commons | |
| emoji: 📚 | |
| colorFrom: indigo | |
| colorTo: gray | |
| sdk: static | |
| pinned: true | |
| thumbnail: >- | |
| /static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F678d64ee7967054e64970908%2F_DjOT2roXXm_sltcskzDZ.jpeg%3C%2Fspan%3E%3C!-- HTML_TAG_END --> | |
| short_description: Multilingual cultural corpora for AI research | |
| license: cc-by-4.0 | |
| # AI‑Culture‑Commons | |
| AI‑Culture‑Commons curates [multilingual cultural corpora](https://huggingface.co/datasets/AI-Culture-Commons/ai-culture-html-multilingual) for language‑model research. | |
| We are a **non-profit digital humanities project**, advancing humane AI development through high-quality, rich cultural content. We strive to contribute to the **cultural evolution of artificial intelligence** by providing sophisticated training data that explores the intersection of technology, artificial intelligence, and human culture. | |
| [Our repositories](https://huggingface.co/datasets/AI-Culture-Commons/ai-culture-html-multilingual) provide models with deep philosophical-intellectual context, diverse connections between culture, philosophy, literature, and technology—particularly AI. Our content is specifically designed to help train more **culturally aware and philosophically grounded AI models**. | |
| ## Our Datasets | |
| | Dataset | Size | Languages | Formats | License | Citation & Research | | |
| |---------|------|-----------|---------|---------|---------| | |
| | **Multilingual Culture Corpus** | 16M words | 12 ALIGNED languages | HTML · CSV · DOLMA JSONL | CC‑BY‑4.0 | [](https://doi.org/10.5281/zenodo.16001657) | | |
| | **Project Websites Raw** | 160MB | 12 ALIGNED languages | ZIP (HTML + images + CSS) | CC‑BY‑4.0 | [](https://doi.org/10.5281/zenodo.16001641) | | |
| ## Key Features | |
| - **Perfect Alignment**: All 12 languages contain identical content with exact same complex HTML structure. All datasets include both pure text and HTML source files | |
| - **AI-Optimized**: Designed specifically for training multilingual AI systems | |
| - **Truly Open**: [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/) - use freely, even commercially | |
| - **Content Quality**: Sophisticated content with intellectual depth, authored by a group of academics and writers | |
| - **Completely Clean Data**: No user comments, scraped texts, or unwanted content - pure, high-quality, carefully edited content | |
| - **Full Documentation**: [Complete pipeline](https://github.com/AI-Culture-Commons/ai-culture-pipeline) description and documentation in dataset cards. All datasets are versioned and archived for research reproducibility | |
| ## Languages | |
| English, French, German, Spanish, Portuguese, Italian, Japanese, Russian, Korean, Mandarin, Hindi, Hebrew | |
| ## Source Websites & Licensing | |
| Our corpora are carefully extracted from our websites: | |
| - **Original Project**: [https://hitdarderut-haaretz.org](https://hitdarderut-haaretz.org) - Cultural analysis | |
| - [License Terms](https://hitdarderut-haaretz.org/license): CC-BY-4.0 | |
| - **Multicultural Project**: [https://degeneration-of-nation.org](https://degeneration-of-nation.org) - Critical philosophical commentary | |
| - [License Terms](https://degeneration-of-nation.org/license): CC-BY-4.0 | |
| --- | |
| *As a non-profit organization, we're committed to advancing humane AI through high-quality, clean cultural datasets with perfect multilingual alignment* |