Revisiting Multimodal Positional Encoding in Vision-Language Models Paper • 2510.23095 • Published 12 days ago • 18
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation Paper • 2511.02778 • Published 4 days ago • 95
ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks Paper • 2510.18455 • Published 18 days ago • 17
StreamingVLM: Real-Time Understanding for Infinite Video Streams Paper • 2510.09608 • Published 29 days ago • 49
Paper2Video: Automatic Video Generation from Scientific Papers Paper • 2510.05096 • Published Oct 6 • 110
Code2Video: A Code-centric Paradigm for Educational Video Generation Paper • 2510.01174 • Published Oct 1 • 33
Robix: A Unified Model for Robot Interaction, Reasoning and Planning Paper • 2509.01106 • Published Sep 1 • 48
Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination Paper • 2509.01986 • Published Sep 2 • 4
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning Paper • 2509.02544 • Published Sep 2 • 123
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context Paper • 2506.21277 • Published Jun 26 • 15
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding Paper • 2505.22618 • Published May 28 • 44
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning Paper • 2505.23380 • Published May 29 • 22