Spaces:

MCP-1st-Birthday
/

Video-Agent-MCP

Running

App Files Files Community

Video-Agent-MCP / README.md

smileyc

feat: add LinkedIn post and author info

1b35cff 20 days ago

preview code

raw

history blame contribute delete

12.4 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

metadata

title: MCP Video Agent
emoji: 🎥
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.0.1
app_file: app.py
pinned: false
license: mit
tags:
  - mcp
  - model-context-protocol
  - mcp-in-action-track-consumer
  - mcp-in-action-track-creative
  - video-analysis
  - gemini
  - multimodal
  - agents
  - rag
  - context-caching

🎥 MCP Video Agent

🏆 MCP 1st Birthday Hackathon Submission

Track: MCP in Action - Consumer & Creative Categories
Tech Stack: Gradio 6.0 + Gemini 2.5 Flash + ElevenLabs TTS + Modal + Context Caching

🎬 Demo Video

Watch the Video Agent in action - upload a video, ask questions, and receive voice responses!

🎯 What Makes This Special?

An intelligent video analysis agent that combines multimodal AI, voice interaction, and smart context caching to create a natural conversation experience with your videos.

⚡ Key Innovation: Smart Frame Caching

Unlike traditional video analysis that processes the entire video for every question, this agent uses Gemini's Context Caching to:

First Query: Uploads and deeply analyzes your video (5-10 seconds)
Subsequent Queries: Uses cached video context (2-3 seconds, 90% cost reduction!)
Smart Reuse: Cache persists for 1 hour - ask multiple questions without reprocessing

Real-world Impact: Turn a 10-minute video into a queryable knowledge base. Ask multiple questions in rapid succession, get instant answers with voice responses.

🚀 Core Features

🎬 1. Multimodal Video Analysis

Upload any video (MP4, max 100MB)
Powered by Gemini 2.5 Flash - Google's latest multimodal model
Understands visual content, actions, scenes, objects, and context

🗣️ 2. Voice-First Interaction

Natural language responses via ElevenLabs TTS
Audio-first experience (hear answers immediately)
Full text transcripts available on demand
Supports conversational follow-up questions

⚡ 3. Intelligent Context Caching

First query: Deep video analysis with full context extraction
Follow-up queries: Lightning-fast responses using cached context
Cost optimization: 90% reduction in API costs for repeated queries
Automatic management: No manual cache setup required

🔌 4. MCP Server Integration

This application is designed to work as an MCP server for Claude Desktop and other MCP clients.

Note: The public MCP endpoint is currently disabled to prevent unauthorized API usage. If you need MCP access for evaluation, please contact the developer directly.

The primary way to use this application is through the HF Space Gradio interface.

🛡️ 5. Fair Usage & Rate Limiting

Built-in rate limiting (10 requests/hour per user)
100MB file size limit
Designed for responsible shared resource usage

🎓 How It Works

The Smart Caching Pipeline

1. Video Upload → Modal Volume (Persistent Storage)
                  ↓
2. First Analysis → Gemini 2.5 Flash (Deep Processing)
                  ↓
3. Context Cache → Stored for 1 hour (Automatic)
                  ↓
4. Follow-up Questions → Instant responses from cache ⚡
                  ↓
5. TTS Generation → ElevenLabs (Natural Voice)

Why This Matters

Problem: Traditional video analysis processes the entire video for every single question, causing:

🐌 Slow response times (10-30 seconds per query)
💸 High API costs (full video processing each time)
😫 Poor user experience for exploratory queries

Solution: Context Caching enables:

⚡ Fast follow-up queries (2-3 seconds)
💰 90% cost reduction for subsequent questions
😊 Natural conversation flow with your videos

📖 Use Cases

For Consumers

📺 Content Understanding: "What's the main message of this video?"
🔍 Scene Search: "At what point does the speaker mention AI?"
📝 Summarization: "Give me a 3-sentence summary"
🎓 Learning: Turn educational videos into interactive Q&A sessions

For Creatives

🎬 Content Analysis: Analyze video aesthetics, composition, and style
🎨 Creative Inspiration: "What visual techniques are used here?"
📊 Feedback: Get AI feedback on your video content
🔄 Iteration: Ask multiple questions to refine your understanding

🛠️ Technical Architecture

Full Source Code

📦 GitHub Repository: mcp-video-agent

📖 Detailed Architecture: See ARCHITECTURE.md for in-depth technical documentation

This HF Space contains the frontend application. The complete project includes:

hf_space/ - This Gradio frontend (you're looking at it!)
backend/ - Modal serverless backend (view on GitHub)
frontend/ - Alternative frontend for direct Modal integration

For Evaluators: All backend code and deployment instructions are available in the GitHub repository.

Tech Stack

Frontend: Gradio 6.0 with custom components
Backend: Modal for serverless compute
AI Models:
- Gemini 2.5 Flash (multimodal video analysis + context caching)
- ElevenLabs Multilingual v2 (neural TTS)
Storage: Modal Volume (persistent video storage)
Caching: Gemini Context Caching API (1-hour TTL)
Rate Limiting: In-memory rate limiter (10 req/hr per user)

Architecture Highlights

┌─────────────────┐
│  Gradio UI      │  ← User uploads video + asks questions
│  (This Space)   │  ← Rate limiting & session management
└────────┬────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│  Modal Backend (Serverless Functions)   │
│                                          │
│  _internal_analyze_video():              │
│    • Upload video to Gemini Files API   │
│    • Create context cache (first query) │
│    • Use cached context (follow-ups)    │
│    • Return analysis text               │
│                                          │
│  _internal_speak_text():                 │
│    • Convert text to speech             │
│    • Store audio in Modal Volume        │
│    • Return audio file                  │
│                                          │
│  Modal Volume:                           │
│    • Persistent video storage           │
│    • Generated audio files              │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────┐
│ Gemini 2.5 API  │  ← Multimodal video analysis
│ Context Cache   │  ← Automatic caching (min 1024 tokens)
│                 │  ← 90% cost reduction on cache hits
└─────────────────┘
         │
         ↓
┌─────────────────┐
│ ElevenLabs API  │  ← Neural voice synthesis
│ Model: v2       │  ← Multilingual support
└─────────────────┘

Key Implementation Details

Backend Code (backend/modal_app.py):

# Context caching with Gemini
@app.function(timeout=600, volumes={"/data": vol})
def _internal_analyze_video(query: str, video_filename: str):
    # Upload to Gemini Files API
    video_file = client.files.upload(file=video_path)
    
    # Create cache (first query)
    cache = client.caches.create(
        model="gemini-2.5-flash",
        contents=[video_file, system_instruction],
        ttl="3600s"  # 1 hour
    )
    
    # Use cache for queries
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=[query],
        cached_content=cache.name  # Reuse cached context!
    )

Frontend Code (hf_space/app_with_modal.py):

# Rate limiting
class RateLimiter:
    def is_allowed(self, user_id):
        # Clean requests older than 1 hour
        # Check if under limit
        # Record new request
        return within_limit

# Modal function calls
analyze_fn = modal.Function.from_name("mcp-video-agent", "_internal_analyze_video")
text_response = analyze_fn.remote(query, video_filename=unique_filename)

Performance Metrics

Metric	First Query	Cached Query	Improvement
Response Time	8-12s	2-3s	75% faster
API Cost	$0.10	$0.01	90% cheaper
Token Usage	~10,000	~1,000	90% reduction

🏆 Hackathon Submission Details

Sponsor Technologies Used

✅ Modal: Serverless backend infrastructure
✅ Google Gemini: Multimodal AI + Context Caching
✅ ElevenLabs: Neural text-to-speech
✅ Gradio 6.0: Modern UI framework

Innovation Points

Smart Caching Strategy: Pioneering use of Gemini's Context Caching for video analysis
Voice-First UX: Natural conversation experience with videos
MCP Integration: Extensible as a tool for AI agents
Fair Usage Design: Built-in rate limiting for shared resources

⚙️ Setup & Configuration

For Evaluators (Quick Test)

No setup needed! Just:

Upload a video (MP4, max 100MB)
Ask questions
Experience the caching speed on follow-up queries

For Developers (Self-Hosting)

Required Secrets (in Space Settings → Secrets):

GOOGLE_API_KEY (Required)
- Get from Google AI Studio
- Used for Gemini 2.5 Flash video analysis
ELEVENLABS_API_KEY (Optional but recommended)
- Get from ElevenLabs
- Used for voice synthesis
- Without it, only text responses will be generated
MODAL_TOKEN_ID & MODAL_TOKEN_SECRET (For Modal backend)
- Get from modal token new
- Required if deploying with Modal backend
MAX_REQUESTS_PER_HOUR (Optional)
- Default: 10 requests/hour per user
- Adjust based on your usage needs

Duplicate for Personal Use

Want to use this without limits?

Click "Duplicate this Space" button
Add your own API keys in Settings → Secrets
Adjust rate limits as needed
You're good to go!

📱 Social Media & Community

📝 Project Announcement

🔗 LinkedIn Post

💬 Discussions

Have questions or feedback? Visit the Discussions tab on this Space!

👥 Team

Built by: Yu Cheng Lin
GitHub: ycsmiley

📊 Project Stats

Built in: MCP 1st Birthday Hackathon (Nov 14-30, 2024)
Tech Stack: 5 integrated technologies
Performance: 90% cost reduction, 75% speed improvement
License: MIT Open Source

🙏 Acknowledgments

Sponsors & Technologies

🚀 Modal - Serverless infrastructure
🤖 Google Gemini - Multimodal AI + Context Caching
🗣️ ElevenLabs - Neural voice synthesis
🎨 Gradio - UI framework
🤗 Hugging Face - Hosting platform

Special Thanks

MCP 1st Birthday Hackathon organizers
The Gradio team for excellent documentation
The open-source community

📄 License

MIT License - See LICENSE file for details.

Open source and free to use, modify, and distribute!