Spaces:
Sleeping
Sleeping
| # π Enhanced RAG System Setup Guide | |
| This guide will help you set up the Enhanced RAG (Retrieval-Augmented Generation) system for saving high-confidence news to Google Drive. | |
| ## π Overview | |
| The Enhanced RAG system automatically saves news with **95%+ confidence** from Gemini analysis to Google Drive, allowing you to: | |
| - View all high-confidence news entries | |
| - Use them for better RAG analysis | |
| - Track user input patterns | |
| - Build a comprehensive knowledge base | |
| ## π§ Setup Steps | |
| ### Step 1: Google Cloud Console Setup | |
| 1. **Go to Google Cloud Console** | |
| - Visit: https://console.cloud.google.com/ | |
| 2. **Create or Select Project** | |
| - Create a new project or select existing one | |
| - Note your project ID | |
| 3. **Enable Google Drive API** | |
| - Go to "APIs & Services" β "Library" | |
| - Search for "Google Drive API" | |
| - Click "Enable" | |
| 4. **Create OAuth 2.0 Credentials** | |
| - Go to "APIs & Services" β "Credentials" | |
| - Click "Create Credentials" β "OAuth 2.0 Client IDs" | |
| - Choose "Desktop application" | |
| - Download the JSON file | |
| - Rename it to `credentials.json` | |
| - Place it in your project directory | |
| ### Step 2: Local Setup | |
| 1. **Run the Setup Script** | |
| ```bash | |
| python setup_google_drive_rag.py | |
| ``` | |
| 2. **Follow the Authentication Process** | |
| - A browser window will open | |
| - Log in with your Google account | |
| - Grant permissions for Google Drive access | |
| - The script will save your credentials | |
| 3. **Verify Setup** | |
| - The script will test Google Drive access | |
| - It will create the RAG folder and file | |
| - You'll see confirmation messages | |
| ### Step 3: Hugging Face Spaces Setup (Optional) | |
| If you want to use this on Hugging Face Spaces: | |
| 1. **Add Secrets to Hugging Face** | |
| - Go to your Space settings | |
| - Add these secrets: | |
| - `GOOGLE_CLIENT_ID`: Your OAuth client ID | |
| - `GOOGLE_CLIENT_SECRET`: Your OAuth client secret | |
| - `GOOGLE_REFRESH_TOKEN`: Get this from your local token.json | |
| 2. **Get Refresh Token** | |
| - Run the setup script locally first | |
| - Check the `token.json` file | |
| - Copy the `refresh_token` value | |
| ## π File Structure | |
| After setup, you'll have: | |
| ``` | |
| your-project/ | |
| βββ credentials.json # Google OAuth credentials | |
| βββ token.json # Saved authentication token | |
| βββ rag_news_manager.py # Main RAG system | |
| βββ setup_google_drive_rag.py # Setup script | |
| βββ view_rag_news.py # News viewer | |
| βββ app.py # Your main app (updated) | |
| ``` | |
| ## π Google Drive Structure | |
| The system creates: | |
| ``` | |
| Google Drive/ | |
| βββ Vietnamese_Fake_News_RAG/ | |
| βββ high_confidence_news.json | |
| ``` | |
| ## π How It Works | |
| ### Automatic Saving | |
| - When users input news, the system analyzes it | |
| - If Gemini confidence > 95%, it's automatically saved to Google Drive | |
| - Each entry includes: | |
| - News text | |
| - Prediction (REAL/FAKE) | |
| - Confidence score | |
| - Gemini analysis | |
| - Search results | |
| - Timestamp | |
| ### Data Format | |
| ```json | |
| { | |
| "metadata": { | |
| "created_at": "2024-01-01T00:00:00", | |
| "description": "High-confidence Vietnamese fake news for RAG", | |
| "threshold": 0.95, | |
| "total_entries": 10, | |
| "last_updated": "2024-01-01T12:00:00" | |
| }, | |
| "news_entries": [ | |
| { | |
| "id": 1, | |
| "content_hash": "abc123...", | |
| "news_text": "Argentina vΓ΄ Δα»ch World Cup 2022...", | |
| "prediction": "REAL", | |
| "gemini_confidence": 0.98, | |
| "gemini_analysis": "1. KαΊΎT LUαΊ¬N: THαΊ¬T...", | |
| "distilbert_confidence": 0.85, | |
| "search_results": [...], | |
| "created_at": "2024-01-01T10:00:00", | |
| "source": "user_input", | |
| "verified": true | |
| } | |
| ] | |
| } | |
| ``` | |
| ## π₯οΈ Viewing Saved News | |
| ### Option 1: Command Line Viewer | |
| ```bash | |
| python view_rag_news.py | |
| ``` | |
| Features: | |
| - View all saved news | |
| - Filter by prediction (REAL/FAKE) | |
| - Search through entries | |
| - View statistics | |
| - Open Google Drive directly | |
| ### Option 2: Google Drive Web Interface | |
| - Go to your Google Drive | |
| - Find the "Vietnamese_Fake_News_RAG" folder | |
| - Open "high_confidence_news.json" | |
| - View the raw JSON data | |
| ### Option 3: Direct Google Drive Links | |
| The system provides direct links: | |
| - Folder: `https://drive.google.com/drive/folders/{folder_id}` | |
| - File: `https://drive.google.com/file/d/{file_id}/view` | |
| ## π§ Configuration | |
| ### In app.py | |
| ```python | |
| # Enhanced RAG System Configuration | |
| ENABLE_ENHANCED_RAG = True # Enable/disable the system | |
| RAG_CONFIDENCE_THRESHOLD = 0.95 # 95% threshold for saving | |
| ``` | |
| ### Thresholds | |
| - **95%**: Only very high-confidence predictions are saved | |
| - **90%**: More entries saved, but still high quality | |
| - **85%**: More entries, but some uncertainty | |
| ## π Statistics | |
| The system tracks: | |
| - Total entries saved | |
| - Real vs Fake news count | |
| - Average confidence score | |
| - Latest entry timestamp | |
| - Google Drive folder/file IDs | |
| ## π¨ Troubleshooting | |
| ### Common Issues | |
| 1. **"credentials.json not found"** | |
| - Make sure you downloaded the OAuth credentials | |
| - Rename the file to exactly `credentials.json` | |
| - Place it in the project directory | |
| 2. **"Authentication failed"** | |
| - Check your internet connection | |
| - Make sure Google Drive API is enabled | |
| - Try running the setup script again | |
| 3. **"Permission denied"** | |
| - Make sure you granted all required permissions | |
| - Check if your Google account has Drive access | |
| 4. **"RAG system not available"** | |
| - Check if all dependencies are installed | |
| - Make sure `rag_news_manager.py` is in the same directory | |
| ### Debug Mode | |
| Add this to see detailed logs: | |
| ```python | |
| import logging | |
| logging.basicConfig(level=logging.DEBUG) | |
| ``` | |
| ## π Integration with Existing System | |
| The Enhanced RAG system works alongside your existing knowledge base: | |
| - **Local Knowledge Base**: Still works as before | |
| - **Enhanced RAG**: Additional Google Drive storage | |
| - **Both systems**: Can be used together for comprehensive RAG | |
| ## π± Usage Examples | |
| ### View Recent News | |
| ```bash | |
| python view_rag_news.py | |
| # Select option 2: View Recent News | |
| ``` | |
| ### Search for Specific Topics | |
| ```bash | |
| python view_rag_news.py | |
| # Select option 6: Search News | |
| # Enter: "COVID-19" | |
| ``` | |
| ### Check Statistics | |
| ```bash | |
| python view_rag_news.py | |
| # Select option 1: View Statistics | |
| ``` | |
| ## π― Benefits | |
| 1. **Automatic Collection**: No manual intervention needed | |
| 2. **High Quality**: Only 95%+ confidence entries saved | |
| 3. **Easy Access**: View through multiple interfaces | |
| 4. **Scalable**: Google Drive handles large datasets | |
| 5. **Searchable**: Find specific news entries quickly | |
| 6. **Analytics**: Track patterns and statistics | |
| ## π Security | |
| - OAuth 2.0 authentication | |
| - Credentials stored securely | |
| - Only your Google account can access | |
| - No sensitive data exposed | |
| ## π Support | |
| If you encounter issues: | |
| 1. Check the troubleshooting section | |
| 2. Verify all setup steps completed | |
| 3. Check Google Cloud Console for API quotas | |
| 4. Ensure proper file permissions | |
| --- | |
| **π Congratulations!** You now have a comprehensive RAG system that automatically saves high-confidence news to Google Drive for analysis and viewing! | |