A newer version of the Gradio SDK is available:
6.2.0
title: MCP Video Agent
emoji: π₯
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.0.1
app_file: app.py
pinned: false
license: mit
tags:
- mcp
- model-context-protocol
- mcp-in-action-track-consumer
- mcp-in-action-track-creative
- video-analysis
- gemini
- multimodal
- agents
- rag
- context-caching
π₯ MCP Video Agent
π MCP 1st Birthday Hackathon Submission
Track: MCP in Action - Consumer & Creative Categories
Tech Stack: Gradio 6.0 + Gemini 2.5 Flash + ElevenLabs TTS + Modal + Context Caching
π¬ Demo Video
Watch the Video Agent in action - upload a video, ask questions, and receive voice responses!
π― What Makes This Special?
An intelligent video analysis agent that combines multimodal AI, voice interaction, and smart context caching to create a natural conversation experience with your videos.
β‘ Key Innovation: Smart Frame Caching
Unlike traditional video analysis that processes the entire video for every question, this agent uses Gemini's Context Caching to:
- First Query: Uploads and deeply analyzes your video (5-10 seconds)
- Subsequent Queries: Uses cached video context (2-3 seconds, 90% cost reduction!)
- Smart Reuse: Cache persists for 1 hour - ask multiple questions without reprocessing
Real-world Impact: Turn a 10-minute video into a queryable knowledge base. Ask multiple questions in rapid succession, get instant answers with voice responses.
π Core Features
π¬ 1. Multimodal Video Analysis
- Upload any video (MP4, max 100MB)
- Powered by Gemini 2.5 Flash - Google's latest multimodal model
- Understands visual content, actions, scenes, objects, and context
π£οΈ 2. Voice-First Interaction
- Natural language responses via ElevenLabs TTS
- Audio-first experience (hear answers immediately)
- Full text transcripts available on demand
- Supports conversational follow-up questions
β‘ 3. Intelligent Context Caching
- First query: Deep video analysis with full context extraction
- Follow-up queries: Lightning-fast responses using cached context
- Cost optimization: 90% reduction in API costs for repeated queries
- Automatic management: No manual cache setup required
π 4. MCP Server Integration
This application is designed to work as an MCP server for Claude Desktop and other MCP clients.
Note: The public MCP endpoint is currently disabled to prevent unauthorized API usage. If you need MCP access for evaluation, please contact the developer directly.
The primary way to use this application is through the HF Space Gradio interface.
π‘οΈ 5. Fair Usage & Rate Limiting
- Built-in rate limiting (10 requests/hour per user)
- 100MB file size limit
- Designed for responsible shared resource usage
π How It Works
The Smart Caching Pipeline
1. Video Upload β Modal Volume (Persistent Storage)
β
2. First Analysis β Gemini 2.5 Flash (Deep Processing)
β
3. Context Cache β Stored for 1 hour (Automatic)
β
4. Follow-up Questions β Instant responses from cache β‘
β
5. TTS Generation β ElevenLabs (Natural Voice)
Why This Matters
Problem: Traditional video analysis processes the entire video for every single question, causing:
- π Slow response times (10-30 seconds per query)
- πΈ High API costs (full video processing each time)
- π« Poor user experience for exploratory queries
Solution: Context Caching enables:
- β‘ Fast follow-up queries (2-3 seconds)
- π° 90% cost reduction for subsequent questions
- π Natural conversation flow with your videos
π Use Cases
For Consumers
- πΊ Content Understanding: "What's the main message of this video?"
- π Scene Search: "At what point does the speaker mention AI?"
- π Summarization: "Give me a 3-sentence summary"
- π Learning: Turn educational videos into interactive Q&A sessions
For Creatives
- π¬ Content Analysis: Analyze video aesthetics, composition, and style
- π¨ Creative Inspiration: "What visual techniques are used here?"
- π Feedback: Get AI feedback on your video content
- π Iteration: Ask multiple questions to refine your understanding
π οΈ Technical Architecture
Full Source Code
π¦ GitHub Repository: mcp-video-agent
π Detailed Architecture: See ARCHITECTURE.md for in-depth technical documentation
This HF Space contains the frontend application. The complete project includes:
hf_space/- This Gradio frontend (you're looking at it!)backend/- Modal serverless backend (view on GitHub)frontend/- Alternative frontend for direct Modal integration
For Evaluators: All backend code and deployment instructions are available in the GitHub repository.
Tech Stack
- Frontend: Gradio 6.0 with custom components
- Backend: Modal for serverless compute
- AI Models:
- Gemini 2.5 Flash (multimodal video analysis + context caching)
- ElevenLabs Multilingual v2 (neural TTS)
- Storage: Modal Volume (persistent video storage)
- Caching: Gemini Context Caching API (1-hour TTL)
- Rate Limiting: In-memory rate limiter (10 req/hr per user)
Architecture Highlights
βββββββββββββββββββ
β Gradio UI β β User uploads video + asks questions
β (This Space) β β Rate limiting & session management
ββββββββββ¬βββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββ
β Modal Backend (Serverless Functions) β
β β
β _internal_analyze_video(): β
β β’ Upload video to Gemini Files API β
β β’ Create context cache (first query) β
β β’ Use cached context (follow-ups) β
β β’ Return analysis text β
β β
β _internal_speak_text(): β
β β’ Convert text to speech β
β β’ Store audio in Modal Volume β
β β’ Return audio file β
β β
β Modal Volume: β
β β’ Persistent video storage β
β β’ Generated audio files β
ββββββββββ¬βββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββ
β Gemini 2.5 API β β Multimodal video analysis
β Context Cache β β Automatic caching (min 1024 tokens)
β β β 90% cost reduction on cache hits
βββββββββββββββββββ
β
β
βββββββββββββββββββ
β ElevenLabs API β β Neural voice synthesis
β Model: v2 β β Multilingual support
βββββββββββββββββββ
Key Implementation Details
Backend Code (backend/modal_app.py):
# Context caching with Gemini
@app.function(timeout=600, volumes={"/data": vol})
def _internal_analyze_video(query: str, video_filename: str):
# Upload to Gemini Files API
video_file = client.files.upload(file=video_path)
# Create cache (first query)
cache = client.caches.create(
model="gemini-2.5-flash",
contents=[video_file, system_instruction],
ttl="3600s" # 1 hour
)
# Use cache for queries
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[query],
cached_content=cache.name # Reuse cached context!
)
Frontend Code (hf_space/app_with_modal.py):
# Rate limiting
class RateLimiter:
def is_allowed(self, user_id):
# Clean requests older than 1 hour
# Check if under limit
# Record new request
return within_limit
# Modal function calls
analyze_fn = modal.Function.from_name("mcp-video-agent", "_internal_analyze_video")
text_response = analyze_fn.remote(query, video_filename=unique_filename)
Performance Metrics
| Metric | First Query | Cached Query | Improvement |
|---|---|---|---|
| Response Time | 8-12s | 2-3s | 75% faster |
| API Cost | $0.10 | $0.01 | 90% cheaper |
| Token Usage | ~10,000 | ~1,000 | 90% reduction |
π Hackathon Submission Details
Categories
- MCP in Action - Consumer Track: Practical video Q&A for everyday users
- MCP in Action - Creative Track: Tool for content creators and analysts
Sponsor Technologies Used
- β Modal: Serverless backend infrastructure
- β Google Gemini: Multimodal AI + Context Caching
- β ElevenLabs: Neural text-to-speech
- β Gradio 6.0: Modern UI framework
Innovation Points
- Smart Caching Strategy: Pioneering use of Gemini's Context Caching for video analysis
- Voice-First UX: Natural conversation experience with videos
- MCP Integration: Extensible as a tool for AI agents
- Fair Usage Design: Built-in rate limiting for shared resources
βοΈ Setup & Configuration
For Evaluators (Quick Test)
No setup needed! Just:
- Upload a video (MP4, max 100MB)
- Ask questions
- Experience the caching speed on follow-up queries
For Developers (Self-Hosting)
Required Secrets (in Space Settings β Secrets):
GOOGLE_API_KEY(Required)- Get from Google AI Studio
- Used for Gemini 2.5 Flash video analysis
ELEVENLABS_API_KEY(Optional but recommended)- Get from ElevenLabs
- Used for voice synthesis
- Without it, only text responses will be generated
MODAL_TOKEN_ID&MODAL_TOKEN_SECRET(For Modal backend)- Get from
modal token new - Required if deploying with Modal backend
- Get from
MAX_REQUESTS_PER_HOUR(Optional)- Default: 10 requests/hour per user
- Adjust based on your usage needs
Duplicate for Personal Use
Want to use this without limits?
- Click "Duplicate this Space" button
- Add your own API keys in Settings β Secrets
- Adjust rate limits as needed
- You're good to go!
π± Social Media & Community
π Project Announcement
π¬ Discussions
Have questions or feedback? Visit the Discussions tab on this Space!
π₯ Team
- Built by: Yu Cheng Lin
- GitHub: ycsmiley
π Project Stats
- Built in: MCP 1st Birthday Hackathon (Nov 14-30, 2024)
- Tech Stack: 5 integrated technologies
- Performance: 90% cost reduction, 75% speed improvement
- License: MIT Open Source
π Acknowledgments
Sponsors & Technologies
- π Modal - Serverless infrastructure
- π€ Google Gemini - Multimodal AI + Context Caching
- π£οΈ ElevenLabs - Neural voice synthesis
- π¨ Gradio - UI framework
- π€ Hugging Face - Hosting platform
Special Thanks
- MCP 1st Birthday Hackathon organizers
- The Gradio team for excellent documentation
- The open-source community
π License
MIT License - See LICENSE file for details.
Open source and free to use, modify, and distribute!