-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 34
Collections
Discover the best community collections!
Collections including paper arxiv:2509.16197
-
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper • 2509.16197 • Published • 54 -
Qwen/Qwen3-Omni-30B-A3B-Instruct
Any-to-Any • 35B • Updated • 289k • 720 -
facebook/dinov3-vitb16-pretrain-lvd1689m
Image Feature Extraction • 85.7M • Updated • 391k • 77 -
nvidia/NV-Embed-v2
Feature Extraction • 8B • Updated • 196k • 482
-
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper • 2509.16197 • Published • 54 -
InternRobotics/VLAC
Robotics • 2B • Updated • 41 • 37 -
LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence
Paper • 2509.12203 • Published • 19 -
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning
Paper • 2509.15937 • Published • 20
-
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Paper • 2508.21113 • Published • 109 -
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Paper • 2508.16949 • Published • 22 -
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
Paper • 2508.21112 • Published • 75 -
UItron: Foundational GUI Agent with Advanced Perception and Planning
Paper • 2508.21767 • Published • 12
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 189 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 43
-
Reconstruction Alignment Improves Unified Multimodal Models
Paper • 2509.07295 • Published • 40 -
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Paper • 2509.06951 • Published • 31 -
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Paper • 2509.06818 • Published • 29 -
Interleaving Reasoning for Better Text-to-Image Generation
Paper • 2509.06945 • Published • 14
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 18 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 34
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 189 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 43
-
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper • 2509.16197 • Published • 54 -
Qwen/Qwen3-Omni-30B-A3B-Instruct
Any-to-Any • 35B • Updated • 289k • 720 -
facebook/dinov3-vitb16-pretrain-lvd1689m
Image Feature Extraction • 85.7M • Updated • 391k • 77 -
nvidia/NV-Embed-v2
Feature Extraction • 8B • Updated • 196k • 482
-
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper • 2509.16197 • Published • 54 -
InternRobotics/VLAC
Robotics • 2B • Updated • 41 • 37 -
LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence
Paper • 2509.12203 • Published • 19 -
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning
Paper • 2509.15937 • Published • 20
-
Reconstruction Alignment Improves Unified Multimodal Models
Paper • 2509.07295 • Published • 40 -
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Paper • 2509.06951 • Published • 31 -
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Paper • 2509.06818 • Published • 29 -
Interleaving Reasoning for Better Text-to-Image Generation
Paper • 2509.06945 • Published • 14
-
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Paper • 2508.21113 • Published • 109 -
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Paper • 2508.16949 • Published • 22 -
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
Paper • 2508.21112 • Published • 75 -
UItron: Foundational GUI Agent with Advanced Perception and Planning
Paper • 2508.21767 • Published • 12
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 18 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48