OnePiece123
's Collections
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large
Language Models
Paper
•
2406.17294
•
Published
•
11
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
•
2406.19389
•
Published
•
54
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything
Model
Paper
•
2406.20076
•
Published
•
10
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of
Audio Events in Text-to-audio Generation
Paper
•
2407.02869
•
Published
•
21
Unveiling Encoder-Free Vision-Language Models
Paper
•
2406.11832
•
Published
•
54
FunAudioLLM: Voice Understanding and Generation Foundation Models for
Natural Interaction Between Humans and LLMs
Paper
•
2407.04051
•
Published
•
39
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
•
2407.06135
•
Published
•
23
Vision language models are blind
Paper
•
2407.06581
•
Published
•
84
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
•
2407.07895
•
Published
•
42
SEED-Story: Multimodal Long Story Generation with Large Language Model
Paper
•
2407.08683
•
Published
•
24
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Paper
•
2505.23762
•
Published
•
45