|
|
|
|
|
--- |
|
|
library_name: nanovlm |
|
|
license: mit |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- vision-language |
|
|
- multimodal |
|
|
- research |
|
|
- twin-tower |
|
|
--- |
|
|
|
|
|
**Twin-Tower VLM** is a vision-language model based on the twin-tower architecture. This model uses a separate vision tower to process images and generate per-layer contexts, which are then integrated with a frozen language tower for text generation. |
|
|
|
|
|
## Architecture |
|
|
|
|
|
The twin-tower architecture consists of: |
|
|
|
|
|
1. **Vision Tower**: Processes images through vision encoder → modality projector → decoder layers to create per-layer contexts |
|
|
2. **Language Tower**: Frozen language model that receives vision contexts and generates text |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Twin-Tower Design**: Separate processing of vision and language with per-layer context integration |
|
|
- **Frozen Language Tower**: Language model parameters are frozen, gradients flow through vision contexts |
|
|
- **Per-Layer Contexts**: Vision tower generates contexts for each language model layer |
|
|
- **Efficient Training**: Only vision tower components are trainable |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from twin_tower import VisionLanguageTwinTowerModel |
|
|
from config import VLMConfig |
|
|
|
|
|
# Load the model |
|
|
cfg = VLMConfig() |
|
|
model = VisionLanguageTwinTowerModel.from_pretrained(cfg) |
|
|
|
|
|
# Generate text from image |
|
|
from PIL import Image |
|
|
image = Image.open("your_image.jpg") |
|
|
result = model.generate_from_text("What is in this image?", image) |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: patrickamadeus/nanoVLM-230M-8k-twin-maxxing-15000 |
|
|
- **Architecture**: Twin-Tower VLM |
|
|
- **Vision Encoder**: SigLIP-based |
|
|
- **Language Model**: SmolLM2-based |
|
|
- **Parameters**: ~230M total (vision tower trainable, language tower frozen) |
|
|
|
|
|
For more information, check out the base nanoVLM model: https://huggingface.co/lusxvr/nanoVLM-222M. |
|
|
|