Video-Agent-MCP / README.md
smileyc's picture
feat: add LinkedIn post and author info
1b35cff

A newer version of the Gradio SDK is available: 6.2.0

Upgrade
metadata
title: MCP Video Agent
emoji: πŸŽ₯
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.0.1
app_file: app.py
pinned: false
license: mit
tags:
  - mcp
  - model-context-protocol
  - mcp-in-action-track-consumer
  - mcp-in-action-track-creative
  - video-analysis
  - gemini
  - multimodal
  - agents
  - rag
  - context-caching

πŸŽ₯ MCP Video Agent

πŸ† MCP 1st Birthday Hackathon Submission

Track: MCP in Action - Consumer & Creative Categories
Tech Stack: Gradio 6.0 + Gemini 2.5 Flash + ElevenLabs TTS + Modal + Context Caching


🎬 Demo Video

Watch the Video Agent in action - upload a video, ask questions, and receive voice responses!


🎯 What Makes This Special?

An intelligent video analysis agent that combines multimodal AI, voice interaction, and smart context caching to create a natural conversation experience with your videos.

⚑ Key Innovation: Smart Frame Caching

Unlike traditional video analysis that processes the entire video for every question, this agent uses Gemini's Context Caching to:

  1. First Query: Uploads and deeply analyzes your video (5-10 seconds)
  2. Subsequent Queries: Uses cached video context (2-3 seconds, 90% cost reduction!)
  3. Smart Reuse: Cache persists for 1 hour - ask multiple questions without reprocessing

Real-world Impact: Turn a 10-minute video into a queryable knowledge base. Ask multiple questions in rapid succession, get instant answers with voice responses.


πŸš€ Core Features

🎬 1. Multimodal Video Analysis

  • Upload any video (MP4, max 100MB)
  • Powered by Gemini 2.5 Flash - Google's latest multimodal model
  • Understands visual content, actions, scenes, objects, and context

πŸ—£οΈ 2. Voice-First Interaction

  • Natural language responses via ElevenLabs TTS
  • Audio-first experience (hear answers immediately)
  • Full text transcripts available on demand
  • Supports conversational follow-up questions

⚑ 3. Intelligent Context Caching

  • First query: Deep video analysis with full context extraction
  • Follow-up queries: Lightning-fast responses using cached context
  • Cost optimization: 90% reduction in API costs for repeated queries
  • Automatic management: No manual cache setup required

πŸ”Œ 4. MCP Server Integration

This application is designed to work as an MCP server for Claude Desktop and other MCP clients.

Note: The public MCP endpoint is currently disabled to prevent unauthorized API usage. If you need MCP access for evaluation, please contact the developer directly.

The primary way to use this application is through the HF Space Gradio interface.

πŸ›‘οΈ 5. Fair Usage & Rate Limiting

  • Built-in rate limiting (10 requests/hour per user)
  • 100MB file size limit
  • Designed for responsible shared resource usage

πŸŽ“ How It Works

The Smart Caching Pipeline

1. Video Upload β†’ Modal Volume (Persistent Storage)
                  ↓
2. First Analysis β†’ Gemini 2.5 Flash (Deep Processing)
                  ↓
3. Context Cache β†’ Stored for 1 hour (Automatic)
                  ↓
4. Follow-up Questions β†’ Instant responses from cache ⚑
                  ↓
5. TTS Generation β†’ ElevenLabs (Natural Voice)

Why This Matters

Problem: Traditional video analysis processes the entire video for every single question, causing:

  • 🐌 Slow response times (10-30 seconds per query)
  • πŸ’Έ High API costs (full video processing each time)
  • 😫 Poor user experience for exploratory queries

Solution: Context Caching enables:

  • ⚑ Fast follow-up queries (2-3 seconds)
  • πŸ’° 90% cost reduction for subsequent questions
  • 😊 Natural conversation flow with your videos

πŸ“– Use Cases

For Consumers

  • πŸ“Ί Content Understanding: "What's the main message of this video?"
  • πŸ” Scene Search: "At what point does the speaker mention AI?"
  • πŸ“ Summarization: "Give me a 3-sentence summary"
  • πŸŽ“ Learning: Turn educational videos into interactive Q&A sessions

For Creatives

  • 🎬 Content Analysis: Analyze video aesthetics, composition, and style
  • 🎨 Creative Inspiration: "What visual techniques are used here?"
  • πŸ“Š Feedback: Get AI feedback on your video content
  • πŸ”„ Iteration: Ask multiple questions to refine your understanding

πŸ› οΈ Technical Architecture

Full Source Code

πŸ“¦ GitHub Repository: mcp-video-agent

πŸ“– Detailed Architecture: See ARCHITECTURE.md for in-depth technical documentation

This HF Space contains the frontend application. The complete project includes:

  • hf_space/ - This Gradio frontend (you're looking at it!)
  • backend/ - Modal serverless backend (view on GitHub)
  • frontend/ - Alternative frontend for direct Modal integration

For Evaluators: All backend code and deployment instructions are available in the GitHub repository.

Tech Stack

  • Frontend: Gradio 6.0 with custom components
  • Backend: Modal for serverless compute
  • AI Models:
    • Gemini 2.5 Flash (multimodal video analysis + context caching)
    • ElevenLabs Multilingual v2 (neural TTS)
  • Storage: Modal Volume (persistent video storage)
  • Caching: Gemini Context Caching API (1-hour TTL)
  • Rate Limiting: In-memory rate limiter (10 req/hr per user)

Architecture Highlights

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Gradio UI      β”‚  ← User uploads video + asks questions
β”‚  (This Space)   β”‚  ← Rate limiting & session management
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Modal Backend (Serverless Functions)   β”‚
β”‚                                          β”‚
β”‚  _internal_analyze_video():              β”‚
β”‚    β€’ Upload video to Gemini Files API   β”‚
β”‚    β€’ Create context cache (first query) β”‚
β”‚    β€’ Use cached context (follow-ups)    β”‚
β”‚    β€’ Return analysis text               β”‚
β”‚                                          β”‚
β”‚  _internal_speak_text():                 β”‚
β”‚    β€’ Convert text to speech             β”‚
β”‚    β€’ Store audio in Modal Volume        β”‚
β”‚    β€’ Return audio file                  β”‚
β”‚                                          β”‚
β”‚  Modal Volume:                           β”‚
β”‚    β€’ Persistent video storage           β”‚
β”‚    β€’ Generated audio files              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Gemini 2.5 API  β”‚  ← Multimodal video analysis
β”‚ Context Cache   β”‚  ← Automatic caching (min 1024 tokens)
β”‚                 β”‚  ← 90% cost reduction on cache hits
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ElevenLabs API  β”‚  ← Neural voice synthesis
β”‚ Model: v2       β”‚  ← Multilingual support
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Implementation Details

Backend Code (backend/modal_app.py):

# Context caching with Gemini
@app.function(timeout=600, volumes={"/data": vol})
def _internal_analyze_video(query: str, video_filename: str):
    # Upload to Gemini Files API
    video_file = client.files.upload(file=video_path)
    
    # Create cache (first query)
    cache = client.caches.create(
        model="gemini-2.5-flash",
        contents=[video_file, system_instruction],
        ttl="3600s"  # 1 hour
    )
    
    # Use cache for queries
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=[query],
        cached_content=cache.name  # Reuse cached context!
    )

Frontend Code (hf_space/app_with_modal.py):

# Rate limiting
class RateLimiter:
    def is_allowed(self, user_id):
        # Clean requests older than 1 hour
        # Check if under limit
        # Record new request
        return within_limit

# Modal function calls
analyze_fn = modal.Function.from_name("mcp-video-agent", "_internal_analyze_video")
text_response = analyze_fn.remote(query, video_filename=unique_filename)

Performance Metrics

Metric First Query Cached Query Improvement
Response Time 8-12s 2-3s 75% faster
API Cost $0.10 $0.01 90% cheaper
Token Usage ~10,000 ~1,000 90% reduction

πŸ† Hackathon Submission Details

Categories

  • MCP in Action - Consumer Track: Practical video Q&A for everyday users
  • MCP in Action - Creative Track: Tool for content creators and analysts

Sponsor Technologies Used

  • βœ… Modal: Serverless backend infrastructure
  • βœ… Google Gemini: Multimodal AI + Context Caching
  • βœ… ElevenLabs: Neural text-to-speech
  • βœ… Gradio 6.0: Modern UI framework

Innovation Points

  1. Smart Caching Strategy: Pioneering use of Gemini's Context Caching for video analysis
  2. Voice-First UX: Natural conversation experience with videos
  3. MCP Integration: Extensible as a tool for AI agents
  4. Fair Usage Design: Built-in rate limiting for shared resources

βš™οΈ Setup & Configuration

For Evaluators (Quick Test)

No setup needed! Just:

  1. Upload a video (MP4, max 100MB)
  2. Ask questions
  3. Experience the caching speed on follow-up queries

For Developers (Self-Hosting)

Required Secrets (in Space Settings β†’ Secrets):

  1. GOOGLE_API_KEY (Required)

  2. ELEVENLABS_API_KEY (Optional but recommended)

    • Get from ElevenLabs
    • Used for voice synthesis
    • Without it, only text responses will be generated
  3. MODAL_TOKEN_ID & MODAL_TOKEN_SECRET (For Modal backend)

    • Get from modal token new
    • Required if deploying with Modal backend
  4. MAX_REQUESTS_PER_HOUR (Optional)

    • Default: 10 requests/hour per user
    • Adjust based on your usage needs

Duplicate for Personal Use

Want to use this without limits?

  1. Click "Duplicate this Space" button
  2. Add your own API keys in Settings β†’ Secrets
  3. Adjust rate limits as needed
  4. You're good to go!

πŸ“± Social Media & Community

πŸ“ Project Announcement

πŸ”— LinkedIn Post

πŸ’¬ Discussions

Have questions or feedback? Visit the Discussions tab on this Space!

πŸ‘₯ Team


πŸ“Š Project Stats

  • Built in: MCP 1st Birthday Hackathon (Nov 14-30, 2024)
  • Tech Stack: 5 integrated technologies
  • Performance: 90% cost reduction, 75% speed improvement
  • License: MIT Open Source

πŸ™ Acknowledgments

Sponsors & Technologies

  • πŸš€ Modal - Serverless infrastructure
  • πŸ€– Google Gemini - Multimodal AI + Context Caching
  • πŸ—£οΈ ElevenLabs - Neural voice synthesis
  • 🎨 Gradio - UI framework
  • πŸ€— Hugging Face - Hosting platform

Special Thanks

  • MCP 1st Birthday Hackathon organizers
  • The Gradio team for excellent documentation
  • The open-source community

πŸ“„ License

MIT License - See LICENSE file for details.

Open source and free to use, modify, and distribute!