patrickamadeus
/

nanoVLM-230M-8k-twin-maxxing-15000

Image-Text-to-Text

vision-language

Model card Files Files and versions

nanoVLM-230M-8k-twin-maxxing-15000 / README.md

patrickamadeus's picture

Upload twin-tower VLM using push_to_hub

2944422 verified about 1 month ago

|

history blame contribute delete

1.81 kB


	---
	library_name: nanovlm
	license: mit
	pipeline_tag: image-text-to-text
	tags:
	- vision-language
	- multimodal
	- research
	- twin-tower
	---

	Twin-Tower VLM is a vision-language model based on the twin-tower architecture. This model uses a separate vision tower to process images and generate per-layer contexts, which are then integrated with a frozen language tower for text generation.

	## Architecture

	The twin-tower architecture consists of:

	1. Vision Tower: Processes images through vision encoder → modality projector → decoder layers to create per-layer contexts
	2. Language Tower: Frozen language model that receives vision contexts and generates text

	## Key Features

	- Twin-Tower Design: Separate processing of vision and language with per-layer context integration
	- Frozen Language Tower: Language model parameters are frozen, gradients flow through vision contexts
	- Per-Layer Contexts: Vision tower generates contexts for each language model layer
	- Efficient Training: Only vision tower components are trainable

	## Usage

	```python
	from twin_tower import VisionLanguageTwinTowerModel
	from config import VLMConfig

	# Load the model
	cfg = VLMConfig()
	model = VisionLanguageTwinTowerModel.from_pretrained(cfg)

	# Generate text from image
	from PIL import Image
	image = Image.open("your_image.jpg")
	result = model.generate_from_text("What is in this image?", image)
	print(result)
	```

	## Model Details

	- Base Model: patrickamadeus/nanoVLM-230M-8k-twin-maxxing-15000
	- Architecture: Twin-Tower VLM
	- Vision Encoder: SigLIP-based
	- Language Model: SmolLM2-based
	- Parameters: ~230M total (vision tower trainable, language tower frozen)

	For more information, check out the base nanoVLM model: https://huggingface.co/lusxvr/nanoVLM-222M.