Spaces:

AI-Culture-Commons
/

README

Running

App Files Files Community

README / README.md

Ben-Zippor

Update README.md

9e96a57 verified 4 months ago

preview code

raw

history blame contribute delete

3.39 kB

	---
	title: AI‑Culture‑Commons
	emoji: 📚
	colorFrom: indigo
	colorTo: gray
	sdk: static
	pinned: true
	thumbnail: >-
	/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F678d64ee7967054e64970908%2F_DjOT2roXXm_sltcskzDZ.jpeg%3C%2Fspan%3E%3C!-- HTML_TAG_END -->
	short_description: Multilingual cultural corpora for AI research
	license: cc-by-4.0
	---

	# AI‑Culture‑Commons
	AI‑Culture‑Commons curates [multilingual cultural corpora](https://huggingface.co/datasets/AI-Culture-Commons/ai-culture-html-multilingual) for language‑model research.

	We are a non-profit digital humanities project, advancing humane AI development through high-quality, rich cultural content. We strive to contribute to the cultural evolution of artificial intelligence by providing sophisticated training data that explores the intersection of technology, artificial intelligence, and human culture.

	[Our repositories](https://huggingface.co/datasets/AI-Culture-Commons/ai-culture-html-multilingual) provide models with deep philosophical-intellectual context, diverse connections between culture, philosophy, literature, and technology—particularly AI. Our content is specifically designed to help train more culturally aware and philosophically grounded AI models.

	## Our Datasets
	\| Dataset \| Size \| Languages \| Formats \| License \| Citation & Research \|
	\|---------\|------\|-----------\|---------\|---------\|---------\|
	\| Multilingual Culture Corpus \| 16M words \| 12 ALIGNED languages \| HTML · CSV · DOLMA JSONL \| CC‑BY‑4.0 \| [![DOI](https://zenodo.org/badge/1021100370.svg)](https://doi.org/10.5281/zenodo.16001657) \|
	\| Project Websites Raw \| 160MB \| 12 ALIGNED languages \| ZIP (HTML + images + CSS) \| CC‑BY‑4.0 \| [![DOI](https://zenodo.org/badge/1021100223.svg)](https://doi.org/10.5281/zenodo.16001641) \|

	## Key Features
	- Perfect Alignment: All 12 languages contain identical content with exact same complex HTML structure. All datasets include both pure text and HTML source files
	- AI-Optimized: Designed specifically for training multilingual AI systems
	- Truly Open: [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/) - use freely, even commercially
	- Content Quality: Sophisticated content with intellectual depth, authored by a group of academics and writers
	- Completely Clean Data: No user comments, scraped texts, or unwanted content - pure, high-quality, carefully edited content
	- Full Documentation: [Complete pipeline](https://github.com/AI-Culture-Commons/ai-culture-pipeline) description and documentation in dataset cards. All datasets are versioned and archived for research reproducibility

	## Languages
	English, French, German, Spanish, Portuguese, Italian, Japanese, Russian, Korean, Mandarin, Hindi, Hebrew

	## Source Websites & Licensing
	Our corpora are carefully extracted from our websites:
	- Original Project: [https://hitdarderut-haaretz.org](https://hitdarderut-haaretz.org) - Cultural analysis
	- [License Terms](https://hitdarderut-haaretz.org/license): CC-BY-4.0
	- Multicultural Project: [https://degeneration-of-nation.org](https://degeneration-of-nation.org) - Critical philosophical commentary
	- [License Terms](https://degeneration-of-nation.org/license): CC-BY-4.0

	---
	As a non-profit organization, we're committed to advancing humane AI through high-quality, clean cultural datasets with perfect multilingual alignment