MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research Paper • 2503.13399 • Published Mar 17 • 22
Temporal Preference Optimization for Long-Form Video Understanding Paper • 2501.13919 • Published Jan 23 • 23
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature Paper • 2501.07171 • Published Jan 13 • 55
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration Paper • 2412.13180 • Published Dec 17, 2024 • 13
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration Paper • 2412.13180 • Published Dec 17, 2024 • 13 • 2
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published Dec 13, 2024 • 147
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision Paper • 2407.06189 • Published Jul 8, 2024 • 26