--- license: mit language: - ar base_model: - Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2 pipeline_tag: sentence-similarity --- # Bayaan - Advanced Quran Tafseer Search with AI Vector Models [![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://www.python.org/downloads/) [![Flask](https://img.shields.io/badge/Flask-2.0+-green.svg)](https://flask.palletsprojects.com/) [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) [![API](https://img.shields.io/badge/API-REST-orange.svg)](http://localhost:5001) [![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-orange.svg)](https://huggingface.co/datasets/MohamedRashad/Quran-Tafseer) ## πŸ“– Overview **Bayaan** is an AI-powered Quran Tafseer search system that uses multiple machine learning models to find relevant Islamic interpretations from 219,000 records across 84 scholarly books. It automatically picks the best AI approach for your query - simple keywords use TF-IDF, contextual searches use Word2Vec/BERT, making it like having an intelligent Islamic library at your fingertips. ![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F67a6b9cf049b2a1bd3bc97f8%2Fn3rEB2Q3_eXB4AOYb96Wl.png) ## πŸ› οΈ Tech Stack - **Flask** - REST API framework - **scikit-learn** - TF-IDF, cosine similarity - **SentenceTransformers** - Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2 for semantic search - **BERT/Word2Vec** - Semantic embeddings - **pandas/numpy** - Data processing - **Dataset**: [219K Tafseer records](https://huggingface.co/datasets/MohamedRashad/Quran-Tafseer) from Altafsir.com ## πŸ—ƒοΈ The Dataset: A Treasure Trove of Islamic Knowledge Bayaan is powered by the comprehensive **Quran-Tafseer dataset** from [Hugging Face](https://huggingface.co/datasets/MohamedRashad/Quran-Tafseer), created by MohamedRashad. This dataset is a goldmine for anyone interested in Islamic studies, natural language processing, or understanding the Quran's deeper meanings. ### **Dataset Highlights:** - πŸ“š **84 Different Tafseer Books** - From classical to contemporary scholars - πŸ“Š **219,000 Rows** of rich interpretative content - 🌍 **Source**: All data collected from [Altafsir.com](https://altafsir.com) - πŸ”€ **Language**: Arabic (with English query support through AI) ### **What's Inside:** | Column | Description | Example | |--------|-------------|---------| | `surah_name` | Name of the Quran chapter | "Al-Fatiha", "Al-Baqarah" | | `revelation_type` | Where the Surah was revealed | "Meccan" or "Medinan" | | `ayah` | The specific Quranic verse | "بِسْمِ Ψ§Ω„Ω„ΩŽΩ‘Ω‡Ω Ψ§Ω„Ψ±ΩŽΩ‘Ψ­Ω’Ω…ΩŽΩ°Ω†Ω Ψ§Ω„Ψ±ΩŽΩ‘Ψ­ΩΩŠΩ…Ω" | | `tafsir_book` | Source of the interpretation | "Ibn Kathir", "Al-Jalalayn" | | `tafsir_content` | The actual scholarly commentary | Detailed Arabic interpretation | ## πŸ€– How Bayaan Makes It Smart Bayaan doesn't just do keyword matching - it understands context, meaning, and relationships between concepts using multiple AI approaches: ## 🌟 Key Features ### πŸ€– **Multi-Model AI Search** - **TF-IDF Vectorization**: Optimized for short queries (≀2 words) - **Word2Vec Embeddings**: Perfect for medium-length queries (≀10 words) - **BERT Transformers**: Advanced semantic understanding for long queries (>10 words) - **SentenceTransformers**: State-of-the-art Arabic language model for advanced search ### 🎯 **Intelligent Query Routing** - **Hybrid Search Algorithm**: Automatically selects the best AI model based on query characteristics - **Fallback Mechanisms**: Ensures reliable results even when specific models are unavailable - **Contextual Understanding**: Semantic similarity matching beyond keyword matching ### πŸ” **Advanced Search Capabilities** - **Semantic Search**: Find conceptually similar content, not just keyword matches - **Multi-field Search**: Search across Ayahs, Tafseer content, Surah names, and more - **Similarity Scoring**: Ranked results with confidence scores - **Flexible Result Limits**: Configurable result counts (1-50 results) ## πŸ—οΈ System Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Query Input │───▢│ Hybrid Router │───▢│ AI Models β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Query Analysis β”‚ β”‚ β€’ TF-IDF Matrix β”‚ β”‚ - Length Check β”‚ β”‚ β€’ Word2Vec Vectors β”‚ β”‚ - Complexity β”‚ β”‚ β€’ BERT Embeddings β”‚ β”‚ - Language β”‚ β”‚ β€’ SentenceTransformβ”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Result Ranking │◀───│ Similarity Engine β”‚ β”‚ - Cosine Sim β”‚ β”‚ - Vector Matching β”‚ β”‚ - Score Fusion β”‚ β”‚ - Context Analysis β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## πŸ“Š Required Data Files | File | Description | Required | Size | |------|-------------|----------|------| | `tafseer.csv` | Main Tafseer dataset | βœ… Yes | Variable | | `w2v_vectors.npy` | Pre-computed Word2Vec embeddings | ⚠️ Optional | ~100MB | | `bert_vectors.npy` | Pre-computed BERT embeddings | ⚠️ Optional | ~200MB | | `tafsir_embeddings.npy` | SentenceTransformer embeddings | ⚠️ Optional | ~300MB |