Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2412.05271

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 27
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22

One-Minute Video Generation with Test-Time Training

Paper • 2504.05298 • Published Apr 7 • 110
MoCha: Towards Movie-Grade Talking Character Synthesis

Paper • 2503.23307 • Published Mar 30 • 138
Towards Understanding Camera Motions in Any Video

Paper • 2504.15376 • Published Apr 21 • 158
Antidistillation Sampling

Paper • 2504.13146 • Published Apr 17 • 59

lang-uk/recruitment-dataset-job-descriptions-english

Viewer • Updated Jun 2, 2024 • 142k • 525 • 15
lang-uk/recruitment-dataset-candidate-profiles-english

Viewer • Updated Jun 2, 2024 • 210k • 153 • 10
cnamuangtoun/resume-job-description-fit

Viewer • Updated Jul 25, 2024 • 8k • 635 • 72
shashu2325/resume-job-matcher-lora

Updated Apr 15 • 1.73k • 11

Vision Language Models

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159
Video ReCap: Recursive Captioning of Hour-Long Videos

Paper • 2402.13250 • Published Feb 20, 2024 • 26

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 189
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 43

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159
naver-clova-ix/cord-v2

Viewer • Updated Jul 19, 2022 • 1k • 3.15k • 107
naver-clova-ix/synthdog-en

Viewer • Updated Jan 31, 2024 • 66k • 1.91k • 23
impira/layoutlm-invoices

Document Question Answering • 0.1B • Updated Mar 25, 2023 • 1.92k • 222

papers about VLM reasoning

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Paper • 2504.08837 • Published Apr 10 • 43
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

Paper • 2503.17352 • Published Mar 21 • 24
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning

Paper • 2503.05379 • Published Mar 7 • 38

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 189
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 43

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 27
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159

One-Minute Video Generation with Test-Time Training

Paper • 2504.05298 • Published Apr 7 • 110
MoCha: Towards Movie-Grade Talking Character Synthesis

Paper • 2503.23307 • Published Mar 30 • 138
Towards Understanding Camera Motions in Any Video

Paper • 2504.15376 • Published Apr 21 • 158
Antidistillation Sampling

Paper • 2504.13146 • Published Apr 17 • 59

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159
naver-clova-ix/cord-v2

Viewer • Updated Jul 19, 2022 • 1k • 3.15k • 107
naver-clova-ix/synthdog-en

Viewer • Updated Jan 31, 2024 • 66k • 1.91k • 23
impira/layoutlm-invoices

Document Question Answering • 0.1B • Updated Mar 25, 2023 • 1.92k • 222

lang-uk/recruitment-dataset-job-descriptions-english

Viewer • Updated Jun 2, 2024 • 142k • 525 • 15
lang-uk/recruitment-dataset-candidate-profiles-english

Viewer • Updated Jun 2, 2024 • 210k • 153 • 10
cnamuangtoun/resume-job-description-fit

Viewer • Updated Jul 25, 2024 • 8k • 635 • 72
shashu2325/resume-job-matcher-lora

Updated Apr 15 • 1.73k • 11

papers about VLM reasoning

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Paper • 2504.08837 • Published Apr 10 • 43
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

Paper • 2503.17352 • Published Mar 21 • 24
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning

Paper • 2503.05379 • Published Mar 7 • 38

Vision Language Models

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159
Video ReCap: Recursive Captioning of Hour-Long Videos

Paper • 2402.13250 • Published Feb 20, 2024 • 26

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 159

Previous
1
2
3
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs