Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2408.15998

Paper - Multimodal

Paper related to Multimodal Model - Research for a : Modular, Multimodal, Multi-Stream, Mixture of Expert, Universal Transformer, Matryoshka embedding

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Paper • 2412.15213 • Published Dec 19, 2024 • 28
No More Adam: Learning Rate Scaling at Initialization is All You Need

Paper • 2412.11768 • Published Dec 16, 2024 • 43
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published Dec 18, 2024 • 157
Autoregressive Video Generation without Vector Quantization

Paper • 2412.14169 • Published Dec 18, 2024 • 14

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87

Multimodal Language Model

What does matter besides data receipt when training a Multimodal language model?

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 61
VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 41
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 72
openbmb/MiniCPM-V-2_6

Image-Text-to-Text • 8B • Updated Jun 13 • 84.2k • 1.01k

black-forest-labs/FLUX.1-dev

Text-to-Image • Updated Jun 27 • 1.56M • • 11.7k
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Paper • 2409.01704 • Published Sep 3, 2024 • 83
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Paper • 2408.06195 • Published Aug 12, 2024 • 73
Self-Reflection in LLM Agents: Effects on Problem-Solving Performance

Paper • 2405.06682 • Published May 5, 2024 • 3

Multimodal LLMs

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 133
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20, 2024 • 63
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Paper • 2408.16725 • Published Aug 29, 2024 • 52
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19, 2024 • 52
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 100
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 133
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22, 2024 • 51

PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

Paper • 2408.07547 • Published Aug 14, 2024 • 8
DeepSpeak Dataset v1.0

Paper • 2408.05366 • Published Aug 9, 2024 • 14
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87
Zero-shot Cross-lingual Voice Transfer for TTS

Paper • 2409.13910 • Published Sep 20, 2024 • 10

General Multimodal Learning

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Paper • 2401.14405 • Published Jan 25, 2024 • 13
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26, 2024 • 29
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22, 2024 • 36
Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29, 2024 • 95

Paper - Multimodal

Paper related to Multimodal Model - Research for a : Modular, Multimodal, Multi-Stream, Mixture of Expert, Universal Transformer, Matryoshka embedding

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Paper • 2412.15213 • Published Dec 19, 2024 • 28
No More Adam: Learning Rate Scaling at Initialization is All You Need

Paper • 2412.11768 • Published Dec 16, 2024 • 43
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published Dec 18, 2024 • 157
Autoregressive Video Generation without Vector Quantization

Paper • 2412.14169 • Published Dec 18, 2024 • 14

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Paper • 2409.01704 • Published Sep 3, 2024 • 83
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Paper • 2408.06195 • Published Aug 12, 2024 • 73
Self-Reflection in LLM Agents: Effects on Problem-Solving Performance

Paper • 2405.06682 • Published May 5, 2024 • 3

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87

Multimodal LLMs

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 133
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20, 2024 • 63
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Paper • 2408.16725 • Published Aug 29, 2024 • 52
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19, 2024 • 52
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 100
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 133
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22, 2024 • 51

Multimodal Language Model

What does matter besides data receipt when training a Multimodal language model?

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 61
VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 41
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 72
openbmb/MiniCPM-V-2_6

Image-Text-to-Text • 8B • Updated Jun 13 • 84.2k • 1.01k

PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

Paper • 2408.07547 • Published Aug 14, 2024 • 8
DeepSpeak Dataset v1.0

Paper • 2408.05366 • Published Aug 9, 2024 • 14
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87
Zero-shot Cross-lingual Voice Transfer for TTS

Paper • 2409.13910 • Published Sep 20, 2024 • 10

black-forest-labs/FLUX.1-dev

Text-to-Image • Updated Jun 27 • 1.56M • • 11.7k
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87

General Multimodal Learning

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Paper • 2401.14405 • Published Jan 25, 2024 • 13
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26, 2024 • 29
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22, 2024 • 36
Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29, 2024 • 95

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs