Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2503.16365

Perception and abstraction. Each modality is tokenized and embedded into vectors for model to comprehend.

VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 41
Octopus v4: Graph of language models

Paper • 2404.19296 • Published Apr 30, 2024 • 118
Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26, 2024 • 48
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models

Paper • 2408.15518 • Published Aug 28, 2024 • 42

about 8 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 28
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

Vision Language Models for Robotics

Unified Vision-Language-Action Model

Paper • 2506.19850 • Published Jun 24 • 27
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2 • 141
3D-VLA: A 3D Vision-Language-Action Generative World Model

Paper • 2403.09631 • Published Mar 14, 2024 • 11
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Paper • 2312.14457 • Published Dec 22, 2023 • 1

Vision-Language-Action Models in Minecraft.

CraftJarvis/JarvisVLA-Qwen2-VL-7B

Image-Text-to-Text • 8B • Updated Mar 21 • 182 • 10
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Paper • 2503.16365 • Published Mar 20 • 40
Running

11

Minecraft VLM Leaderboard

🏢

11

Search and view Minecraft LLM leaderboard
CraftJarvis/minecraft-vla-sft

Viewer • Updated Mar 21 • 3.78M • 1.83k • 8

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Paper • 2310.11441 • Published Oct 17, 2023 • 29
UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Paper • 2501.12326 • Published Jan 21 • 65
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Paper • 2406.08451 • Published Jun 12, 2024 • 25
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Paper • 2406.10819 • Published Jun 16, 2024 • 2

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34

about 4 hours ago

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 189
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 43

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Paper • 2504.05118 • Published Apr 7 • 26
T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

Paper • 2504.04718 • Published Apr 7 • 42
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement

Paper • 2504.03561 • Published Apr 4 • 18
Concept Lancet: Image Editing with Compositional Representation Transplant

Paper • 2504.02828 • Published Apr 3 • 16

RuCCoD: Towards Automated ICD Coding in Russian

Paper • 2502.21263 • Published Feb 28 • 133
Unified Reward Model for Multimodal Understanding and Generation

Paper • 2503.05236 • Published Mar 7 • 123
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

Paper • 2503.05179 • Published Mar 7 • 46
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Paper • 2503.05592 • Published Mar 7 • 27

LLM Pruning and Distillation in Practice: The Minitron Approach

Paper • 2408.11796 • Published Aug 21, 2024 • 57
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 52
To Code, or Not To Code? Exploring Impact of Code in Pre-training

Paper • 2408.10914 • Published Aug 20, 2024 • 44
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 63

Perception and abstraction. Each modality is tokenized and embedded into vectors for model to comprehend.

VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 41
Octopus v4: Graph of language models

Paper • 2404.19296 • Published Apr 30, 2024 • 118
Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26, 2024 • 48
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models

Paper • 2408.15518 • Published Aug 28, 2024 • 42

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34

about 8 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 28
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

about 4 hours ago

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 189
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 43

Vision Language Models for Robotics

Unified Vision-Language-Action Model

Paper • 2506.19850 • Published Jun 24 • 27
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2 • 141
3D-VLA: A 3D Vision-Language-Action Generative World Model

Paper • 2403.09631 • Published Mar 14, 2024 • 11
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Paper • 2312.14457 • Published Dec 22, 2023 • 1

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Paper • 2504.05118 • Published Apr 7 • 26
T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

Paper • 2504.04718 • Published Apr 7 • 42
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement

Paper • 2504.03561 • Published Apr 4 • 18
Concept Lancet: Image Editing with Compositional Representation Transplant

Paper • 2504.02828 • Published Apr 3 • 16

Vision-Language-Action Models in Minecraft.

CraftJarvis/JarvisVLA-Qwen2-VL-7B

Image-Text-to-Text • 8B • Updated Mar 21 • 182 • 10
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Paper • 2503.16365 • Published Mar 20 • 40
Running

11

Minecraft VLM Leaderboard

🏢

11

Search and view Minecraft LLM leaderboard
CraftJarvis/minecraft-vla-sft

Viewer • Updated Mar 21 • 3.78M • 1.83k • 8

RuCCoD: Towards Automated ICD Coding in Russian

Paper • 2502.21263 • Published Feb 28 • 133
Unified Reward Model for Multimodal Understanding and Generation

Paper • 2503.05236 • Published Mar 7 • 123
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

Paper • 2503.05179 • Published Mar 7 • 46
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Paper • 2503.05592 • Published Mar 7 • 27

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Paper • 2310.11441 • Published Oct 17, 2023 • 29
UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Paper • 2501.12326 • Published Jan 21 • 65
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Paper • 2406.08451 • Published Jun 12, 2024 • 25
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Paper • 2406.10819 • Published Jun 16, 2024 • 2

LLM Pruning and Distillation in Practice: The Minitron Approach

Paper • 2408.11796 • Published Aug 21, 2024 • 57
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 52
To Code, or Not To Code? Exploring Impact of Code in Pre-training

Paper • 2408.10914 • Published Aug 20, 2024 • 44
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 63

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs