Collections
Discover the best community collections!
Collections including paper arxiv:2501.12326
-
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper • 2501.12326 • Published • 65 -
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 73 -
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
Paper • 2501.16295 • Published • 8 -
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25
-
Video Creation by Demonstration
Paper • 2412.09551 • Published • 9 -
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Paper • 2412.07589 • Published • 48 -
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Paper • 2412.06531 • Published • 72 -
APOLLO: SGD-like Memory, AdamW-level Performance
Paper • 2412.05270 • Published • 38
-
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper • 2404.05719 • Published • 83 -
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper • 2411.17465 • Published • 90 -
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
Paper • 2410.18967 • Published • 1 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 49
-
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Paper • 2501.02955 • Published • 44 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 -
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper • 2501.12380 • Published • 85 -
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper • 2501.09781 • Published • 28
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 71 -
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Paper • 2410.05243 • Published • 20 -
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper • 2501.12326 • Published • 65
-
MobA: A Two-Level Agent System for Efficient Mobile Task Automation
Paper • 2410.13757 • Published • 33 -
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Paper • 2410.08164 • Published • 26 -
WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration
Paper • 2408.15978 • Published -
Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents
Paper • 2409.17140 • Published
-
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper • 2501.12326 • Published • 65 -
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 73 -
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
Paper • 2501.16295 • Published • 8 -
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25
-
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Paper • 2501.02955 • Published • 44 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 -
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper • 2501.12380 • Published • 85 -
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper • 2501.09781 • Published • 28
-
Video Creation by Demonstration
Paper • 2412.09551 • Published • 9 -
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Paper • 2412.07589 • Published • 48 -
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Paper • 2412.06531 • Published • 72 -
APOLLO: SGD-like Memory, AdamW-level Performance
Paper • 2412.05270 • Published • 38
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 71 -
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Paper • 2410.05243 • Published • 20 -
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper • 2501.12326 • Published • 65
-
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper • 2404.05719 • Published • 83 -
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper • 2411.17465 • Published • 90 -
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
Paper • 2410.18967 • Published • 1 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 49
-
MobA: A Two-Level Agent System for Efficient Mobile Task Automation
Paper • 2410.13757 • Published • 33 -
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Paper • 2410.08164 • Published • 26 -
WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration
Paper • 2408.15978 • Published -
Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents
Paper • 2409.17140 • Published