FakeNews_Detector / RAG_SETUP_GUIDE.md
NLong's picture
Upload 12 files
b5fb8d2 verified

A newer version of the Gradio SDK is available: 5.49.1

Upgrade

πŸš€ Enhanced RAG System Setup Guide

This guide will help you set up the Enhanced RAG (Retrieval-Augmented Generation) system for saving high-confidence news to Google Drive.

πŸ“‹ Overview

The Enhanced RAG system automatically saves news with 95%+ confidence from Gemini analysis to Google Drive, allowing you to:

  • View all high-confidence news entries
  • Use them for better RAG analysis
  • Track user input patterns
  • Build a comprehensive knowledge base

πŸ”§ Setup Steps

Step 1: Google Cloud Console Setup

  1. Go to Google Cloud Console

  2. Create or Select Project

    • Create a new project or select existing one
    • Note your project ID
  3. Enable Google Drive API

    • Go to "APIs & Services" β†’ "Library"
    • Search for "Google Drive API"
    • Click "Enable"
  4. Create OAuth 2.0 Credentials

    • Go to "APIs & Services" β†’ "Credentials"
    • Click "Create Credentials" β†’ "OAuth 2.0 Client IDs"
    • Choose "Desktop application"
    • Download the JSON file
    • Rename it to credentials.json
    • Place it in your project directory

Step 2: Local Setup

  1. Run the Setup Script

    python setup_google_drive_rag.py
    
  2. Follow the Authentication Process

    • A browser window will open
    • Log in with your Google account
    • Grant permissions for Google Drive access
    • The script will save your credentials
  3. Verify Setup

    • The script will test Google Drive access
    • It will create the RAG folder and file
    • You'll see confirmation messages

Step 3: Hugging Face Spaces Setup (Optional)

If you want to use this on Hugging Face Spaces:

  1. Add Secrets to Hugging Face

    • Go to your Space settings
    • Add these secrets:
      • GOOGLE_CLIENT_ID: Your OAuth client ID
      • GOOGLE_CLIENT_SECRET: Your OAuth client secret
      • GOOGLE_REFRESH_TOKEN: Get this from your local token.json
  2. Get Refresh Token

    • Run the setup script locally first
    • Check the token.json file
    • Copy the refresh_token value

πŸ“ File Structure

After setup, you'll have:

your-project/
β”œβ”€β”€ credentials.json          # Google OAuth credentials
β”œβ”€β”€ token.json               # Saved authentication token
β”œβ”€β”€ rag_news_manager.py      # Main RAG system
β”œβ”€β”€ setup_google_drive_rag.py # Setup script
β”œβ”€β”€ view_rag_news.py         # News viewer
└── app.py                   # Your main app (updated)

πŸ” Google Drive Structure

The system creates:

Google Drive/
└── Vietnamese_Fake_News_RAG/
    └── high_confidence_news.json

πŸ“Š How It Works

Automatic Saving

  • When users input news, the system analyzes it
  • If Gemini confidence > 95%, it's automatically saved to Google Drive
  • Each entry includes:
    • News text
    • Prediction (REAL/FAKE)
    • Confidence score
    • Gemini analysis
    • Search results
    • Timestamp

Data Format

{
  "metadata": {
    "created_at": "2024-01-01T00:00:00",
    "description": "High-confidence Vietnamese fake news for RAG",
    "threshold": 0.95,
    "total_entries": 10,
    "last_updated": "2024-01-01T12:00:00"
  },
  "news_entries": [
    {
      "id": 1,
      "content_hash": "abc123...",
      "news_text": "Argentina vΓ΄ Δ‘α»‹ch World Cup 2022...",
      "prediction": "REAL",
      "gemini_confidence": 0.98,
      "gemini_analysis": "1. KαΊΎT LUαΊ¬N: THαΊ¬T...",
      "distilbert_confidence": 0.85,
      "search_results": [...],
      "created_at": "2024-01-01T10:00:00",
      "source": "user_input",
      "verified": true
    }
  ]
}

πŸ–₯️ Viewing Saved News

Option 1: Command Line Viewer

python view_rag_news.py

Features:

  • View all saved news
  • Filter by prediction (REAL/FAKE)
  • Search through entries
  • View statistics
  • Open Google Drive directly

Option 2: Google Drive Web Interface

  • Go to your Google Drive
  • Find the "Vietnamese_Fake_News_RAG" folder
  • Open "high_confidence_news.json"
  • View the raw JSON data

Option 3: Direct Google Drive Links

The system provides direct links:

  • Folder: https://drive.google.com/drive/folders/{folder_id}
  • File: https://drive.google.com/file/d/{file_id}/view

πŸ”§ Configuration

In app.py

# Enhanced RAG System Configuration
ENABLE_ENHANCED_RAG = True  # Enable/disable the system
RAG_CONFIDENCE_THRESHOLD = 0.95  # 95% threshold for saving

Thresholds

  • 95%: Only very high-confidence predictions are saved
  • 90%: More entries saved, but still high quality
  • 85%: More entries, but some uncertainty

πŸ“ˆ Statistics

The system tracks:

  • Total entries saved
  • Real vs Fake news count
  • Average confidence score
  • Latest entry timestamp
  • Google Drive folder/file IDs

🚨 Troubleshooting

Common Issues

  1. "credentials.json not found"

    • Make sure you downloaded the OAuth credentials
    • Rename the file to exactly credentials.json
    • Place it in the project directory
  2. "Authentication failed"

    • Check your internet connection
    • Make sure Google Drive API is enabled
    • Try running the setup script again
  3. "Permission denied"

    • Make sure you granted all required permissions
    • Check if your Google account has Drive access
  4. "RAG system not available"

    • Check if all dependencies are installed
    • Make sure rag_news_manager.py is in the same directory

Debug Mode

Add this to see detailed logs:

import logging
logging.basicConfig(level=logging.DEBUG)

πŸ”„ Integration with Existing System

The Enhanced RAG system works alongside your existing knowledge base:

  • Local Knowledge Base: Still works as before
  • Enhanced RAG: Additional Google Drive storage
  • Both systems: Can be used together for comprehensive RAG

πŸ“± Usage Examples

View Recent News

python view_rag_news.py
# Select option 2: View Recent News

Search for Specific Topics

python view_rag_news.py
# Select option 6: Search News
# Enter: "COVID-19"

Check Statistics

python view_rag_news.py
# Select option 1: View Statistics

🎯 Benefits

  1. Automatic Collection: No manual intervention needed
  2. High Quality: Only 95%+ confidence entries saved
  3. Easy Access: View through multiple interfaces
  4. Scalable: Google Drive handles large datasets
  5. Searchable: Find specific news entries quickly
  6. Analytics: Track patterns and statistics

πŸ” Security

  • OAuth 2.0 authentication
  • Credentials stored securely
  • Only your Google account can access
  • No sensitive data exposed

πŸ“ž Support

If you encounter issues:

  1. Check the troubleshooting section
  2. Verify all setup steps completed
  3. Check Google Cloud Console for API quotas
  4. Ensure proper file permissions

πŸŽ‰ Congratulations! You now have a comprehensive RAG system that automatically saves high-confidence news to Google Drive for analysis and viewing!