Bodega-Solomon-9B
Bodega-Solomon-9B brings multimodal understanding to the Bodega ecosystem. Built on advanced vision-language architecture, this model processes images, screenshots, documents, and text in a unified understanding framework. It runs entirely on your hardware as part of Bodega OS, ensuring your visual data never leaves your machine.
Native Multimodal Capabilities
The architecture maintains a 128K token context window for multimodal content.
First-of-Its-Kind Features
Solomon supports native multimodal function calling. Images and documents serve as direct tool inputs—you can pass a screenshot to a function that analyzes UI elements, or feed a diagram to a function that extracts structured data. No text conversion required, no information loss from trying to describe visual content in words.
We extended the Model Context Protocol (MCP) to handle URL-based multimodal content. This allows Solomon to fetch and process images from web sources, local file paths, or retrieval systems as part of its agentic workflows. The model can reason about visual content, decide what additional information it needs, and retrieve that information through the appropriate tools.
Training Methodology
Solomon is based on GLM-4V Flash model, which we continually trained for Bodega's specific use cases. Rather than building a vision-language model from scratch, we took an existing strong foundation and steered it toward the workflows that matter for on-premises AI systems.
Our continued pre-training focused on understanding webviews—the model needed to get good at recognizing UI elements, identifying interactive components, and understanding the structure of web applications. We trained it on agentic workflows for image recognition and automatic tagging, which means the model can look at images and generate structured metadata without human intervention.
OCR over files became a core capability. The model needed to handle documents in various formats, extract text accurately, and understand the relationship between visual layout and semantic content. This goes beyond simple text extraction—the model understands tables, forms, diagrams, and other structured visual information.
We optimized for the kinds of visual understanding that enable autonomous agents: recognizing when a task succeeded or failed based on visual feedback, extracting structured data from unstructured visual inputs, and making decisions based on what it sees rather than just what it is told.
Architecture and Performance
Solomon is a 9 billion parameter model, specifically the flash variant optimized for speed. Vision-language architecture with integrated multimodal understanding. 128K token context window that handles both text and visual content. MXFP4 quantization for efficient deployment on consumer hardware.
The MLX-based inference engine handles multimodal processing through Apple's unified memory architecture. Images and text share the same memory space, eliminating costly data transfers. The system supports streaming for progressive rendering—you see partial results as the model processes, rather than waiting for complete generation.
Latency remains low for interactive applications. This matters for agentic workflows where the model needs to examine visual feedback, make decisions, and take actions in rapid iteration. Slow vision-language models break the interactive loop that makes agents useful.
Running On-Premises
Solomon runs on your hardware as part of Bodega OS. Your screenshots, your documents, your images—none of it gets uploaded to cloud services for processing. This is critical for visual data, which often contains sensitive information: internal application interfaces, confidential documents, design mockups, personal photos.
The model integrates with Bodega's retrieval engines, allowing it to search through your local image and document collections. You can ask it to find screenshots of specific application states, locate documents containing certain visual elements, or retrieve images related to a concept. All retrieval happens locally, all processing happens locally, all results stay local.
What Solomon Does
Solomon excels at image analysis and description with technical precision. It can interpret screenshots and identify UI elements, bugs, and usability issues. Document processing and OCR happen natively—the model reads text from images without requiring separate OCR pipelines. Visual question answering works across technical and general domains.
For multimodal agents, Solomon enables browser automation with visual feedback. The agent can see what the browser displays, verify that actions succeeded, and adapt when things do not go as planned. UI and UX analysis happens automatically—feed it screenshots and it identifies design issues, accessibility problems, and inconsistencies. Design review and feedback become systematic rather than ad-hoc.
The model supports visual data extraction from charts, diagrams, tables, and forms. It understands the structure of visual information, not just the pixels. This makes it useful for research assistance across domains where visual information carries meaning that cannot be fully captured in text.
Technical Notes on Vision-Language Architecture
Function calling with visual inputs required extending the standard function calling framework. We modified the tool use protocol to accept visual tokens as function arguments, and trained the model to construct function calls that reference visual content. This enables sophisticated agentic behaviors where the model can pass screenshots to specialized analysis tools, retrieve visual data through search functions, or request specific image processing operations.
Disclaimer
SRSWTI is not the creator or owner of the underlying foundation model architecture. The foundation model is created and provided by third parties. SRSWTI has trained this model on top of the foundation model but does not endorse, support, represent or guarantee the completeness, truthfulness, accuracy, or reliability of any outputs. You understand that this model can produce content that might be offensive, harmful, inaccurate or otherwise inappropriate, or deceptive. SRSWTI may not monitor or control all model outputs and cannot, and does not, take responsibility for any such outputs. SRSWTI disclaims all warranties or guarantees about the accuracy, reliability or benefits of this model. SRSWTI further disclaims any warranty that the model will meet your requirements, be secure, uninterrupted or available at any time or location, or error-free, viruses-free, or that any errors will be corrected, or otherwise. You will be solely responsible for any damage resulting from your use of or access to this model, your downloading of this model, or use of this model provided by or through SRSWTI.
Crafted by the Bodega team at SRSWTI Research Labs
Building the world's fastest inference and retrieval engines
Making AI accessible, efficient, and powerful for everyone
Developed by SRSWTI Inc. - Building world's fastest retrieval and inference engines.
- Downloads last month
- 27
8-bit
