MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper
•
2509.16197
•
Published
•
54
Similarity, Classification
Visualize image patch similarity like in DINOv3 presentation
Create and enrich datasets with AI
Real-time video captioning powered by FastVLM
Demo of Talk2DINO, model presented at ICCV 2025.
Detect and segment objects in images using text, visual, or prompt-free prompts
Generate depth maps from images
Run code and analyze data in a Jupyter notebook
Describe masked parts of images