patrickamadeus's picture
Upload twin-tower VLM using push_to_hub
2944422 verified
---
library_name: nanovlm
license: mit
pipeline_tag: image-text-to-text
tags:
- vision-language
- multimodal
- research
- twin-tower
---
**Twin-Tower VLM** is a vision-language model based on the twin-tower architecture. This model uses a separate vision tower to process images and generate per-layer contexts, which are then integrated with a frozen language tower for text generation.
## Architecture
The twin-tower architecture consists of:
1. **Vision Tower**: Processes images through vision encoder → modality projector → decoder layers to create per-layer contexts
2. **Language Tower**: Frozen language model that receives vision contexts and generates text
## Key Features
- **Twin-Tower Design**: Separate processing of vision and language with per-layer context integration
- **Frozen Language Tower**: Language model parameters are frozen, gradients flow through vision contexts
- **Per-Layer Contexts**: Vision tower generates contexts for each language model layer
- **Efficient Training**: Only vision tower components are trainable
## Usage
```python
from twin_tower import VisionLanguageTwinTowerModel
from config import VLMConfig
# Load the model
cfg = VLMConfig()
model = VisionLanguageTwinTowerModel.from_pretrained(cfg)
# Generate text from image
from PIL import Image
image = Image.open("your_image.jpg")
result = model.generate_from_text("What is in this image?", image)
print(result)
```
## Model Details
- **Base Model**: patrickamadeus/nanoVLM-230M-8k-twin-maxxing-15000
- **Architecture**: Twin-Tower VLM
- **Vision Encoder**: SigLIP-based
- **Language Model**: SmolLM2-based
- **Parameters**: ~230M total (vision tower trainable, language tower frozen)
For more information, check out the base nanoVLM model: https://huggingface.co/lusxvr/nanoVLM-222M.