Collections
Discover the best community collections!
Collections including paper arxiv:2507.01006
-
GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies
Paper • 2506.14477 • Published -
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Paper • 2406.10227 • Published • 9 -
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Paper • 2506.03143 • Published • 53 -
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
Paper • 2503.21620 • Published • 62
-
WorldVLA: Towards Autoregressive Action World Model
Paper • 2506.21539 • Published • 40 -
Fast and Simplex: 2-Simplicial Attention in Triton
Paper • 2507.02754 • Published • 26 -
IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction
Paper • 2507.02025 • Published • 35 -
Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact
Paper • 2507.00951 • Published • 24
-
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Paper • 2506.04207 • Published • 48 -
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Paper • 2504.11468 • Published • 30 -
RLPR: Extrapolating RLVR to General Domains without Verifiers
Paper • 2506.18254 • Published • 31 -
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Paper • 2507.05255 • Published • 74
-
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Paper • 2506.23918 • Published • 88 -
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Paper • 2504.16030 • Published • 36 -
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Paper • 2505.24867 • Published • 80 -
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Paper • 2507.01006 • Published • 238
-
MiMo-VL Technical Report
Paper • 2506.03569 • Published • 80 -
MMSearch-R1: Incentivizing LMMs to Search
Paper • 2506.20670 • Published • 64 -
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Paper • 2507.01006 • Published • 238 -
Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation
Paper • 2506.19852 • Published • 41
-
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Paper • 2505.15966 • Published • 53 -
GRIT: Teaching MLLMs to Think with Images
Paper • 2505.15879 • Published • 12 -
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Paper • 2505.16854 • Published • 11 -
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Paper • 2505.16192 • Published • 12
-
GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies
Paper • 2506.14477 • Published -
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Paper • 2406.10227 • Published • 9 -
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Paper • 2506.03143 • Published • 53 -
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
Paper • 2503.21620 • Published • 62
-
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Paper • 2506.23918 • Published • 88 -
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Paper • 2504.16030 • Published • 36 -
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Paper • 2505.24867 • Published • 80 -
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Paper • 2507.01006 • Published • 238
-
WorldVLA: Towards Autoregressive Action World Model
Paper • 2506.21539 • Published • 40 -
Fast and Simplex: 2-Simplicial Attention in Triton
Paper • 2507.02754 • Published • 26 -
IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction
Paper • 2507.02025 • Published • 35 -
Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact
Paper • 2507.00951 • Published • 24
-
MiMo-VL Technical Report
Paper • 2506.03569 • Published • 80 -
MMSearch-R1: Incentivizing LMMs to Search
Paper • 2506.20670 • Published • 64 -
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Paper • 2507.01006 • Published • 238 -
Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation
Paper • 2506.19852 • Published • 41
-
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Paper • 2506.04207 • Published • 48 -
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Paper • 2504.11468 • Published • 30 -
RLPR: Extrapolating RLVR to General Domains without Verifiers
Paper • 2506.18254 • Published • 31 -
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Paper • 2507.05255 • Published • 74
-
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Paper • 2505.15966 • Published • 53 -
GRIT: Teaching MLLMs to Think with Images
Paper • 2505.15879 • Published • 12 -
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Paper • 2505.16854 • Published • 11 -
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Paper • 2505.16192 • Published • 12