|
|
--- |
|
|
title: Falconz - Red teamers |
|
|
emoji: 🚀 |
|
|
colorFrom: blue |
|
|
colorTo: yellow |
|
|
sdk: gradio |
|
|
sdk_version: 5.49.1 |
|
|
app_file: app.py |
|
|
pinned: true |
|
|
thumbnail: >- |
|
|
/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F621c88aca7d6c7e0563256ae%2FsCv6mFixuQLmzhTJuzgXG.png%3C%2Fspan%3E%3C!-- HTML_TAG_END --> |
|
|
short_description: MCP Powered Redteaming tool to Safegaurd your Agentic Apps!! |
|
|
tags: |
|
|
- building-mcp-track-enterprise |
|
|
- mcp-in-action-track-enterprise |
|
|
- security |
|
|
- red-teaming |
|
|
- ai-safety |
|
|
--- |
|
|
|
|
|
# 🛡️ Falconz – Unified LLM Security & Red Teaming Platform |
|
|
|
|
|
Welcome to our submission for the **Hugging Face GenAI Agents & MCP Hackathon**! |
|
|
Falconz is a **multi-model AI security platform** built with **Gradio & MCP** and ANthropic Claude models, designed to detect **jailbreaks, prompt injections, and unsafe LLM outputs in Agentic pipelines / LLM based workflows across multiple foundation models** in real time. |
|
|
|
|
|
|
|
|
|
|
|
🎥 **Demo working Video:** |
|
|
Main Falconz demo showcasing core features with MCP in Action in Claude Desktop. |
|
|
https://www.youtube.com/watch?v=wZ9RQjpoMYo |
|
|
|
|
|
|
|
|
🌐 **Social media -LinkedIn Post:** |
|
|
Public announcement and shareable link. |
|
|
https://www.linkedin.com/posts/sallu-mandya_ai-aiagents-mcp-activity-7399436956662841344-3o1I?utm_source=share&utm_medium=member_desktop&rcm=ACoAACD-K8sBnXZWALlW2yw-AnT_4KptCJFJs7M |
|
|
|
|
|
🌐 **Google CO:lab:** |
|
|
https://colab.research.google.com/drive/1PSuPQ35UZntKcUBd43QtjrsRLVvHJYlm?usp=sharing |
|
|
|
|
|
## 🏷️ Hackathon Track Tags |
|
|
|
|
|
This project is officially submitted to the following MCP Hackathon tracks: |
|
|
|
|
|
- **building-mcp-track-enterprise** |
|
|
- **mcp-in-action-track-enterprise** |
|
|
- **security** |
|
|
- **red-teaming** |
|
|
- **ai-safety** |
|
|
## 🌐 Platform Overview |
|
|
|
|
|
Falconz provides a unified security layer for LLM-based apps by combining: |
|
|
|
|
|
- 🔐 **Real-time jailbreak & prompt-injection detection using CLaude Model** |
|
|
- 🧠 **Multi-model testing across Anthropic, OpenAI, Gemini, Mistral, Phi & more** |
|
|
- 🖼️ **Image-based prompt injection scanning** |
|
|
- 📊 **Analytics dashboard for threat trends** |
|
|
- 🪝 **MCP integration for agentic workflows** |
|
|
|
|
|
This platform helps developers validate and harden LLM systems against manipulation and unsafe outputs. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧩 Core Modules |
|
|
|
|
|
### 💬 Chat & Response Analysis |
|
|
- Interact with multiple LLMs |
|
|
- Automatically evaluates model responses for: |
|
|
- Jailbreak signals |
|
|
- Policy violations |
|
|
- Manipulation attempts |
|
|
- Outputs structured JSON + visual risk scoring |
|
|
|
|
|
### 📝 Prompt Tester |
|
|
- Test known or custom jailbreak prompts |
|
|
- Compare how different models respond |
|
|
- Ideal for red-teaming and benchmarking model safety |
|
|
|
|
|
### 🖼️ Image Scanner |
|
|
- Detects hidden prompt instructions within images |
|
|
- Flags potential injection attempts (SAFE / UNSAFE) |
|
|
|
|
|
### ⚙️ Prompt Library (Customizable) |
|
|
- Built-in top 10 jailbreak templates (OWASP-inspired) |
|
|
- Users can update and auto-modify prompt templates |
|
|
- Supports CSV import + dynamic replacements |
|
|
|
|
|
### 📊 Analytics Dashboard |
|
|
- Trends of SAFE vs UNSAFE detections |
|
|
- Risk score visualization |
|
|
- Model performance insights |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔗 Multi-Model Support |
|
|
|
|
|
Falconz integrates with (With openAI like Endpoints): |
|
|
- ✅ Anthropic |
|
|
- ✅ openai |
|
|
- ✅ Google Gemini |
|
|
- ✅ Mistral |
|
|
- ✅ Microsoft Phi |
|
|
- ✅ Meta (Guard Models) |
|
|
- ✅ Meta (Guard Models) |
|
|
- Any Custom model from OpenRouter or OpenAI like endpoints |
|
|
|
|
|
Each model can be tested independently for safety robustness. |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
High-level components: |
|
|
- **Frontend:** Gradio UI (Multi-tab interaction) |
|
|
- **Middleware:** MCP-powered routing & agent logic |
|
|
- **Backend:** Multi-model OpenRouter API |
|
|
- **Analytics:** Local CSV logging + dashboards |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 How It Works (Full App Flow Across All Tabs) |
|
|
|
|
|
### ✅ 1️⃣ Chat & Analysis Flow |
|
|
1. User enters a message in the **Chat** tab |
|
|
2. Falconz sends the message to the selected LLM model |
|
|
3. The model responds normally |
|
|
4. The response is passed through the **risk analysis engine** |
|
|
5. A JSON risk score + visual report is generated |
|
|
6. Conversation & analysis logs are stored for analytics |
|
|
|
|
|
--- |
|
|
|
|
|
### ✅ 2️⃣ Text Prompt Tester Flow |
|
|
1. User inputs a jailbreak/prompt-injection test prompt |
|
|
2. Falconz sends it directly to the selected guard model |
|
|
3. The raw model response is returned (no chat history) |
|
|
4. Users compare responses to evaluate model safety behavior |
|
|
|
|
|
--- |
|
|
|
|
|
### ✅ 3️⃣ Image Scanner Flow |
|
|
1. User uploads an image containing text or hidden instructions |
|
|
2. Falconz extracts image content and sends it to a vision model |
|
|
3. The model evaluates the content for injection threats |
|
|
4. Output is classified as **SAFE** or **UNSAFE** |
|
|
|
|
|
## 🧑💻 Authors |
|
|
|
|
|
- [Mohammed Arsalan](http://linkedin.com/in/sallu-mandya/) |
|
|
|
|
|
## 📝 License |
|
|
|
|
|
This project is licensed under the **MIT License**. |
|
|
|
|
|
--- |
|
|
## 📝 Architecture |
|
|
|
|
|
╔════════════════════════════════════════════════════════════════════════════════════╗ |
|
|
║ FALCONZ - ARCHITECTURE DIAGRAM ║ |
|
|
║ Unified LLM Security & Red Teaming Platform ║ |
|
|
╚════════════════════════════════════════════════════════════════════════════════════╝ |
|
|
|
|
|
|
|
|
┌──────────────────────────────────────────────────────────────────────────────────┐ |
|
|
│ 🖥️ FRONTEND LAYER │ |
|
|
│ (Gradio UI) │ |
|
|
├──────────────────────────────────────────────────────────────────────────────────┤ |
|
|
│ │ |
|
|
│ ┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ |
|
|
│ │ 💬 Chat & │ │ 🖼️ Image │ │ 📝 Text Prompt │ │ |
|
|
│ │ Analysis Tab │ │ Scanner Tab │ │ Tester Tab │ │ |
|
|
│ └────────┬────────┘ └────────┬─────────┘ └────────┬────────┘ │ |
|
|
│ │ │ │ │ |
|
|
│ ┌────────┴─────────┬──────────┴──────────┬───────────┴────────┐ │ |
|
|
│ │ │ │ │ │ |
|
|
│ └──────────────────┴─────────────────────┴────────────────────┘ │ |
|
|
│ │ │ │ │ |
|
|
│ ▼ ▼ ▼ │ |
|
|
│ ┌───────────────────────────────────────────────────────────┐ │ |
|
|
│ │ 📊 Analytics Dashboard Tab │ 📚 Learning Hub Tab │ │ |
|
|
│ └───────────────────────────────────────────────────────────┘ │ |
|
|
│ │ │ │ │ |
|
|
└───────────┼────────────────────┼──────────────────────┼─────────────────────────┘ |
|
|
│ │ │ |
|
|
▼ ▼ ▼ |
|
|
┌──────────────────────────────────────────────────────────────────────────────────┐ |
|
|
│ 🔗 REQUEST ROUTER LAYER │ |
|
|
│ (Message Handling & Orchestration) │ |
|
|
├──────────────────────────────────────────────────────────────────────────────────┤ |
|
|
│ │ |
|
|
│ ┌──────────────────┐ ┌─────────────────┐ ┌──────────────────┐ │ |
|
|
│ │ Chat Handler │ │ Image Handler │ │ Prompt Handler │ │ |
|
|
│ │ - Format msgs │ │ - Extract B64 │ │ - Parse templates│ │ |
|
|
│ │ - Build history │ │ - Send to vision│ │ - Route to guard │ │ |
|
|
│ └────────┬─────────┘ └────────┬────────┘ └────────┬─────────┘ │ |
|
|
│ │ │ │ │ |
|
|
│ └─────────────────────┼─────────────────────┘ │ |
|
|
│ │ │ |
|
|
└─────────────────────────────────┼──────────────────────────────────────────────┘ |
|
|
▼ |
|
|
┌──────────────────────────────────────────────────────────────────────────────────┐ |
|
|
│ 🧠 DETECTION ENGINE LAYER (Claude) │ |
|
|
│ (Falconz Prompt Processors) │ |
|
|
├──────────────────────────────────────────────────────────────────────────────────┤ |
|
|
│ │ |
|
|
│ ┌──────────────────────────────────────────────────────────┐ │ |
|
|
│ │ falcon_prompt_text (Text Analysis) │ │ |
|
|
│ │ - Detect jailbreaks, prompt injections │ │ |
|
|
│ │ - Output: risk_score, policy_break_points, attack_used │ │ |
|
|
│ └────────┬─────────────────────────────────────────────────┘ │ |
|
|
│ │ │ |
|
|
│ ┌────────▼─────────────────────────────────────────────────┐ │ |
|
|
│ │ Falcon_prompt_image (Vision Analysis) │ │ |
|
|
│ │ - Extract text from images │ │ |
|
|
│ │ - Compare against injection templates │ │ |
|
|
│ │ - Output: SAFE / UNSAFE │ │ |
|
|
│ └────────┬─────────────────────────────────────────────────┘ │ |
|
|
│ │ │ |
|
|
│ ┌────────▼─────────────────────────────────────────────────┐ │ |
|
|
│ │ prompt_injection_templates │ │ |
|
|
│ │ - Top 10 jailbreak patterns (OWASP-inspired) │ │ |
|
|
│ │ - Customizable & updatable via CSV │ │ |
|
|
│ └────────┬─────────────────────────────────────────────────┘ │ |
|
|
│ │ │ |
|
|
└───────────┼──────────────────────────────────────────────────────────────────────┘ |
|
|
│ |
|
|
▼ |
|
|
┌──────────────────────────────────────────────────────────────────────────────────┐ |
|
|
│ 🌐 MULTI-MODEL API LAYER │ |
|
|
│ (OpenRouter API - Model Abstraction) │ |
|
|
├──────────────────────────────────────────────────────────────────────────────────┤ |
|
|
│ │ |
|
|
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ |
|
|
│ │ DETECTION MODELS │ │ CHAT MODELS │ │ VISION MODELS │ │ |
|
|
│ ├──────────────────┤ ├──────────────────┤ ├──────────────────┤ │ |
|
|
│ │ • Claude Sonnet │ │ • Gemini 2.5 │ │ • Claude Sonnet │ │ |
|
|
│ │ 4.5 │ │ • GPT-4o │ │ • Gemini 2.5 │ │ |
|
|
│ │ • Claude Opus │ │ • Mistral Med │ │ • GPT-4o │ │ |
|
|
│ │ • Claude Haiku │ │ • Phi-4 │ │ • Phi-4 │ │ |
|
|
│ │ • Llama Guard │ │ • Gemma-3 │ │ • Mistral Med │ │ |
|
|
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │ |
|
|
│ │ |
|
|
│ ▼ ▼ ▼ ▼ │ |
|
|
│ ┌──────────────────────────────────────────────────────┐ │ |
|
|
│ │ OpenRouter.ai/api/v1 (Multi-Model Gateway) │ │ |
|
|
│ │ - Unified endpoint for all LLM providers │ │ |
|
|
│ │ - API Key: YOUR__API_KEY (env var) │ │ |
|
|
│ └──────────────────┬───────────────────────────────────┘ │ |
|
|
│ │ │ |
|
|
└─────────────────────┼────────────────────────────────────────────────────────────┘ |
|
|
│ |
|
|
┌─────────────┼─────────────┐ |
|
|
▼ ▼ ▼ |
|
|
┌──────┐ ┌──────┐ ┌──────┐ |
|
|
│Google│ │OpenAI│ │Meta │ |
|
|
│Gemini│ │ APIs │ │Guard │ |
|
|
└──────┘ └──────┘ └──────┘ |
|
|
|
|
|
|
|
|
┌──────────────────────────────────────────────────────────────────────────────────┐ |
|
|
│ 💾 DATA & STORAGE LAYER │ |
|
|
├──────────────────────────────────────────────────────────────────────────────────┤ |
|
|
│ │ |
|
|
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ |
|
|
│ │ analytics.csv │ │ Prompts.csv │ │ Prompts_ │ │ |
|
|
│ │ │ │ (Prompt │ │ updated.csv │ │ |
|
|
│ │ • timestamp │ │ Templates) │ │ (Modified │ │ |
|
|
│ │ • result │ │ │ │ Templates) │ │ |
|
|
│ │ • model_used │ │ • prompt │ │ │ │ |
|
|
│ │ │ │ • category │ │ CSV Import/ │ │ |
|
|
│ │ Logging & Track │ │ │ │ Export Support │ │ |
|
|
│ │ Detection History│ │ Customizable │ │ │ │ |
|
|
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │ |
|
|
│ │ |
|
|
└──────────────────────────────────────────────────────────────────────────────────┘ |
|
|
|
|
|
|
|
|
┌──────────────────────────────────────────────────────────────────────────────────┐ |
|
|
│ 📈 ANALYSIS & OUTPUT PROCESSING LAYER │ |
|
|
├──────────────────────────────────────────────────────────────────────────────────┤ |
|
|
│ │ |
|
|
│ ┌────────────────────────────────────────────────────────┐ │ |
|
|
│ │ JSON Parser & Formatter │ │ |
|
|
│ │ - Extract risk_score (0-100) │ │ |
|
|
│ │ - Parse potential_jailbreak (bool) │ │ |
|
|
│ │ - Extract policy_break_points [array] │ │ |
|
|
│ │ - Identify attack_used (string) │ │ |
|
|
│ └────────────────┬─────────────────────────────────────┘ │ |
|
|
│ │ │ |
|
|
│ ┌────────────────▼─────────────────────────────────────┐ │ |
|
|
│ │ Visual Output Formatter │ │ |
|
|
│ │ - Color-coded risk display (Green/Orange/Red) │ │ |
|
|
│ │ - Markdown rendering │ │ |
|
|
│ │ - HTML formatted output │ │ |
|
|
│ └────────────────┬─────────────────────────────────────┘ │ |
|
|
│ │ │ |
|
|
│ ┌────────────────▼─────────────────────────────────────┐ │ |
|
|
│ │ Dashboard Aggregator │ │ |
|
|
│ │ - Risk score trends (line chart) │ │ |
|
|
│ │ - Result frequency (bar chart) │ │ |
|
|
│ │ - KPI computation (unsafe rate, top model) │ │ |
|
|
│ │ - Recommendations generation │ │ |
|
|
│ └────────────────┬─────────────────────────────────────┘ │ |
|
|
│ │ │ |
|
|
└───────────────────┼────────────────────────────────────────────────────────────┘ |
|
|
▼ |
|
|
┌──────────────────────────────────────────────────────────────────────────────────┐ |
|
|
│ 📊 OUTPUT LAYER │ |
|
|
├──────────────────────────────────────────────────────────────────────────────────┤ |
|
|
│ │ |
|
|
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ |
|
|
│ │ Raw JSON │ │ Visual Analysis │ │ Analytics │ │ |
|
|
│ │ Output │ │ Report │ │ Dashboard │ │ |
|
|
│ │ │ │ (Markdown) │ │ │ │ |
|
|
│ │ - Structured │ │ - Risk Score │ │ - Trend Lines │ │ |
|
|
│ │ threat data │ │ - Jailbreak Flag │ │ - Bar Charts │ │ |
|
|
│ │ - Machine │ │ - Policy Breaks │ │ - KPIs │ │ |
|
|
│ │ readable │ │ - Attack Type │ │ - Logs │ │ |
|
|
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │ |
|
|
│ │ |
|
|
└──────────────────────────────────────────────────────────────────────────────────┘ |
|
|
|
|
|
|
|
|
╔════════════════════════════════════════════════════════════════════════════════════╗ |
|
|
║ 🔄 DATA FLOW EXAMPLES ║ |
|
|
╠════════════════════════════════════════════════════════════════════════════════════╣ |
|
|
║ ║ |
|
|
║ FLOW 1: Chat & Analysis Tab ║ |
|
|
║ User Input → Router → Claude Chat → Response → Detection Engine → ║ |
|
|
║ Risk Analysis → JSON Output + Visual Report → Display ║ |
|
|
║ ║ |
|
|
║ FLOW 2: Image Scanner Tab ║ |
|
|
║ Image Upload → Extract B64 → Vision Model → Template Matching → ║ |
|
|
║ SAFE/UNSAFE Classification → Display & Log ║ |
|
|
║ ║ |
|
|
║ FLOW 3: Text Prompt Tester Tab ║ |
|
|
║ Jailbreak Prompt → Guard Model (Llama Guard / Claude) → ║ |
|
|
║ Raw Response → JSON Parse → Display & Log ║ |
|
|
║ ║ |
|
|
║ FLOW 4: Analytics Dashboard ║ |
|
|
║ Load analytics.csv → DataFrame → Risk Aggregation → ║ |
|
|
║ Render Charts + KPIs → Display Dashboard ║ |
|
|
║ ║ |
|
|
╚════════════════════════════════════════════════════════════════════════════════════╝ |
|
|
|
|
|
|
|
|
┌──────────────────────────────────────────────────────────────────────────────────┐ |
|
|
│ 🛠️ TECHNOLOGY STACK │ |
|
|
├──────────────────────────────────────────────────────────────────────────────────┤ |
|
|
│ │ |
|
|
│ Frontend: Gradio 5.49.1 (Glass Theme) │ |
|
|
│ Backend: Python 3.x + OpenAI Python Client │ |
|
|
│ API Gateway: OpenRouter.ai/api/v1 │ |
|
|
│ Detection: Anthropic Claude Models (Prompt-based) │ |
|
|
│ Data Format: JSON, CSV, Pandas DataFrame │ |
|
|
│ Visualization: Matplotlib (Charts), Markdown (Reports) │ |
|
|
│ Logging: IST Timezone Logging, CSV Storage │ |
|
|
│ Interface: Gradio Blocks (Multi-tab UI) │ |
|
|
│ Deployment: Gradio Share (share=True) + MCP Server Support │ |
|
|
│ │ |
|
|
└──────────────────────────────────────────────────────────────────────────────────┘ |
|
|
|
|
|
|
|
|
╔════════════════════════════════════════════════════════════════════════════════════╗ |
|
|
║ 📋 COMPONENT INTERACTIONS ║ |
|
|
╠════════════════════════════════════════════════════════════════════════════════════╣ |
|
|
║ ║ |
|
|
║ ┌──────────────┐ ┌─────────────────┐ ┌──────────────────┐ ║ |
|
|
║ │ User Input │────────▶│ Gradio Frontend │───────▶│ Request Router │ ║ |
|
|
║ └──────────────┘ └─────────────────┘ └────────┬─────────┘ ║ |
|
|
║ │ ║ |
|
|
║ ┌────────────────────────────┘ ║ |
|
|
║ │ ║ |
|
|
║ ┌──────────────▼──────────────┐ ║ |
|
|
║ │ Detection Engine (Claude) │ ║ |
|
|
║ └──────────────┬──────────────┘ ║ |
|
|
║ │ ║ |
|
|
║ ┌──────────────▼──────────────┐ ║ |
|
|
║ │ OpenRouter Multi-Model API │ ║ |
|
|
║ └──────────────┬──────────────┘ ║ |
|
|
║ │ ║ |
|
|
║ ┌──────────────▼──────────────┐ ║ |
|
|
║ │ Analysis & Formatting Layer │ ║ |
|
|
║ └──────────────┬──────────────┘ ║ |
|
|
║ │ ║ |
|
|
║ ┌──────────────▼──────────────┐ ║ |
|
|
║ │ CSV Logging & Storage │ ║ |
|
|
║ └──────────────┬──────────────┘ ║ |
|
|
║ │ ║ |
|
|
║ ┌──────────────▼──────────────┐ ║ |
|
|
║ │ Dashboard & Output Display │ ║ |
|
|
║ └─────────────────────────────┘ ║ |
|
|
║ ║ |
|
|
╚════════════════════════════════════════════════════════════════════════════════════╝ |
|
|
|
|
|
|
|
|
|
|
|
## ✅ Reminder |
|
|
|
|
|
Falconz is intended **only for ethical security testing** and **AI safety research** as part of MCP Gradio Hackathon. |
|
|
Users are responsible for complying with all laws, policies, and platform terms. |
|
|
|
|
|
🛡️ Build safe. Test responsibly. Protect the future of AI , contact me to [Xhaheen](http://linkedin.com/in/sallu-mandya/) for Collab . |