Update README.md
Browse files
README.md
CHANGED
|
@@ -1,113 +1,136 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
-
- Qwen/Qwen2.5-7B-Instruct
|
| 5 |
pipeline_tag: any-to-any
|
| 6 |
library_name: bagel-mot
|
| 7 |
---
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
<p align="left">
|
| 11 |
<img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="480"/>
|
| 12 |
</p>
|
| 13 |
|
|
|
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
<p align="left">
|
| 20 |
<a href="https://bagel-ai.org/">
|
| 21 |
-
<img
|
| 22 |
-
src="https://img.shields.io/badge/BAGEL-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;"
|
| 23 |
-
alt="BAGEL Website"
|
| 24 |
-
/>
|
| 25 |
</a>
|
| 26 |
<a href="https://arxiv.org/abs/2505.14683">
|
| 27 |
-
<img
|
| 28 |
-
src="https://img.shields.io/badge/BAGEL-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
|
| 29 |
-
alt="BAGEL Paper on arXiv"
|
| 30 |
-
/>
|
| 31 |
</a>
|
| 32 |
-
<a href="https://github.com/bytedance-seed/BAGEL"
|
| 33 |
-
|
| 34 |
-
alt="Github" src="https://img.shields.io/badge/BAGEL-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
|
| 35 |
-
alt="BAGEL Codebase"
|
| 36 |
-
/>
|
| 37 |
</a>
|
| 38 |
<a href="https://demo.bagel-ai.org/">
|
| 39 |
-
<img
|
| 40 |
-
src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=white" style="display: inline-block; vertical-align: middle;"
|
| 41 |
-
alt="BAGEL Demo"
|
| 42 |
-
/>
|
| 43 |
</a>
|
| 44 |
<a href="https://discord.com/invite/Z836xxzy">
|
| 45 |
-
<img
|
| 46 |
-
src="https://img.shields.io/badge/BAGEL-Discord-green?logo=discord&logoColor=white" style="display: inline-block; vertical-align: middle;"
|
| 47 |
-
alt="BAGEL Discord"
|
| 48 |
-
/>
|
| 49 |
</a>
|
| 50 |
-
|
| 51 |
-
|
| 52 |
</p>
|
| 53 |
|
|
|
|
| 54 |
|
| 55 |
-
|
| 56 |
-
Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models.
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
This repository hosts the model weights for **BAGEL**. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/bytedance-seed/BAGEL).
|
| 60 |
-
|
| 61 |
|
|
|
|
| 62 |
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
|
|
|
| 65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
|
|
|
| 67 |
|
|
|
|
|
|
|
|
|
|
| 68 |
|
|
|
|
| 69 |
|
| 70 |
## 🧠 Method
|
| 71 |
-
BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.
|
| 72 |
|
| 73 |
-
BAGEL
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
<p align="
|
|
|
|
|
|
|
| 76 |
|
|
|
|
| 77 |
|
| 78 |
## 🌱 Emerging Properties
|
| 79 |
-
<p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/emerging_curves.png" width="50%"></p>
|
| 80 |
|
| 81 |
-
|
|
|
|
|
|
|
| 82 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
|
|
|
| 84 |
|
| 85 |
## 📊 Benchmarks
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
|
| 90 |
-
|
| 91 |
-
|
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
| Model | Overall ↑ |
|
| 94 |
-
|
| 95 |
| FLUX-1-dev | 0.82 |
|
| 96 |
| SD3-Medium | 0.74 |
|
| 97 |
| Janus-Pro-7B | 0.80 |
|
| 98 |
| **BAGEL** | **0.88** |
|
| 99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
| Model | GEdit-Bench-EN (SC) ↑ | GEdit-Bench-EN (PQ) ↑ | GEdit-Bench-EN (O) ↑ | IntelligentBench ↑ |
|
| 101 |
-
|
| 102 |
-
| Step1X-Edit | 7.09 | 6.76
|
| 103 |
-
| Gemini-2-exp. | 6.73 | 6.61
|
| 104 |
-
| **BAGEL** | **7.36** | **6.83**
|
| 105 |
-
| **BAGEL+CoT** | –
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
-
##
|
| 108 |
-
BAGEL is licensed under the Apache 2.0 license. It is finetuned from [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2) model, and uses the [FLUX.1-schnell VAE model](https://huggingface.co/black-forest-labs/FLUX.1-schnell), all under Apache 2.0.
|
| 109 |
|
| 110 |
-
## ✍️ Citation
|
| 111 |
```bibtex
|
| 112 |
@article{deng2025bagel,
|
| 113 |
title = {Emerging Properties in Unified Multimodal Pretraining},
|
|
@@ -115,4 +138,3 @@ BAGEL is licensed under the Apache 2.0 license. It is finetuned from [Qwen2.5-7B
|
|
| 115 |
journal = {arXiv preprint arXiv:2505.14683},
|
| 116 |
year = {2025}
|
| 117 |
}
|
| 118 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
+
- Qwen/Qwen2.5-7B-Instruct
|
| 5 |
pipeline_tag: any-to-any
|
| 6 |
library_name: bagel-mot
|
| 7 |
---
|
| 8 |
+
<p align="center">
|
|
|
|
|
|
|
| 9 |
<img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="480"/>
|
| 10 |
</p>
|
| 11 |
|
| 12 |
+
# 🥯 BAGEL: Unified Model for Multimodal Understanding and Generation
|
| 13 |
|
| 14 |
+
<p align="center">
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
<a href="https://bagel-ai.org/">
|
| 16 |
+
<img src="https://img.shields.io/badge/BAGEL-Website-0A66C2?logo=safari&logoColor=white" />
|
|
|
|
|
|
|
|
|
|
| 17 |
</a>
|
| 18 |
<a href="https://arxiv.org/abs/2505.14683">
|
| 19 |
+
<img src="https://img.shields.io/badge/BAGEL-Paper-red?logo=arxiv&logoColor=red" />
|
|
|
|
|
|
|
|
|
|
| 20 |
</a>
|
| 21 |
+
<a href="https://github.com/bytedance-seed/BAGEL">
|
| 22 |
+
<img src="https://img.shields.io/badge/BAGEL-Codebase-536af5?logo=github" />
|
|
|
|
|
|
|
|
|
|
| 23 |
</a>
|
| 24 |
<a href="https://demo.bagel-ai.org/">
|
| 25 |
+
<img src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=white" />
|
|
|
|
|
|
|
|
|
|
| 26 |
</a>
|
| 27 |
<a href="https://discord.com/invite/Z836xxzy">
|
| 28 |
+
<img src="https://img.shields.io/badge/BAGEL-Discord-green?logo=discord&logoColor=white" />
|
|
|
|
|
|
|
|
|
|
| 29 |
</a>
|
|
|
|
|
|
|
| 30 |
</p>
|
| 31 |
|
| 32 |
+
---
|
| 33 |
|
| 34 |
+
We present **BAGEL**, an open‑source multimodal foundation model with **7B active parameters (14B total)** trained on large‑scale interleaved multimodal data.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
**BAGEL** outperforms leading open‑source VLMs like **Qwen2.5-VL** and **InternVL-2.5** on standard benchmarks and delivers text‑to‑image quality competitive with specialist generators such as **SD3**.
|
| 37 |
|
| 38 |
+
It supports:
|
| 39 |
+
- Free-form **visual manipulation**
|
| 40 |
+
- **Multiview synthesis**
|
| 41 |
+
- **World navigation**
|
| 42 |
+
- Advanced **image editing** beyond traditional models
|
| 43 |
|
| 44 |
+
---
|
| 45 |
|
| 46 |
+
### 🔧 Installation & Usage
|
| 47 |
+
Please refer to our [GitHub Repository](https://github.com/bytedance-seed/BAGEL) for:
|
| 48 |
+
- Setup instructions
|
| 49 |
+
- Example scripts
|
| 50 |
+
- Demo usage
|
| 51 |
|
| 52 |
+
---
|
| 53 |
|
| 54 |
+
<p align="center">
|
| 55 |
+
<img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/teaser.webp" width="80%"/>
|
| 56 |
+
</p>
|
| 57 |
|
| 58 |
+
---
|
| 59 |
|
| 60 |
## 🧠 Method
|
|
|
|
| 61 |
|
| 62 |
+
**BAGEL** uses a **Mixture-of-Transformer-Experts (MoT)** architecture with:
|
| 63 |
+
- Dual encoders: capturing **pixel-level** and **semantic-level** features
|
| 64 |
+
- Training objective: **Next Group of Token Prediction**
|
| 65 |
+
- Vision token compression via [FLUX.1 VAE](https://huggingface.co/black-forest-labs/FLUX.1-schnell)
|
| 66 |
|
| 67 |
+
<p align="center">
|
| 68 |
+
<img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/arch.png" width="50%"/>
|
| 69 |
+
</p>
|
| 70 |
|
| 71 |
+
---
|
| 72 |
|
| 73 |
## 🌱 Emerging Properties
|
|
|
|
| 74 |
|
| 75 |
+
<p align="center">
|
| 76 |
+
<img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/emerging_curves.png" width="50%"/>
|
| 77 |
+
</p>
|
| 78 |
|
| 79 |
+
Performance improves as pretraining scales, progressing from:
|
| 80 |
+
- Multimodal understanding
|
| 81 |
+
- Generation
|
| 82 |
+
- Basic image editing
|
| 83 |
+
- Advanced multimodal reasoning and 3D/world modeling
|
| 84 |
|
| 85 |
+
---
|
| 86 |
|
| 87 |
## 📊 Benchmarks
|
| 88 |
+
|
| 89 |
+
### 🖼️ Visual Understanding
|
| 90 |
+
|
| 91 |
+
| Model | MME ↑ | MMBench ↑ | MMMU ↑ | MM-Vet ↑ | MathVista ↑ |
|
| 92 |
+
|------------------|-------:|-----------:|--------:|----------:|-------------:|
|
| 93 |
+
| Janus-Pro-7B | – | 79.2 | 41.0 | 50.0 | – |
|
| 94 |
+
| Qwen2.5-VL-7B | 2347 | 83.5 | **58.6**| 67.1 | 68.2 |
|
| 95 |
+
| **BAGEL** | **2388**| **85.0** | 55.3 | **67.2** | **73.1** |
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
### 🖌️ Text-to-Image Generation (GenEval)
|
| 100 |
+
|
| 101 |
| Model | Overall ↑ |
|
| 102 |
+
|--------------|-----------|
|
| 103 |
| FLUX-1-dev | 0.82 |
|
| 104 |
| SD3-Medium | 0.74 |
|
| 105 |
| Janus-Pro-7B | 0.80 |
|
| 106 |
| **BAGEL** | **0.88** |
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
### 🪄 Image Editing
|
| 111 |
+
|
| 112 |
| Model | GEdit-Bench-EN (SC) ↑ | GEdit-Bench-EN (PQ) ↑ | GEdit-Bench-EN (O) ↑ | IntelligentBench ↑ |
|
| 113 |
+
|---------------|-----------------------|------------------------|----------------------|---------------------|
|
| 114 |
+
| Step1X-Edit | 7.09 | 6.76 | **6.70** | 14.9 |
|
| 115 |
+
| Gemini-2-exp. | 6.73 | 6.61 | 6.32 | **57.6** |
|
| 116 |
+
| **BAGEL** | **7.36** | **6.83** | 6.52 | 44.0 |
|
| 117 |
+
| **BAGEL+CoT** | – | – | – | 55.3 |
|
| 118 |
+
|
| 119 |
+
---
|
| 120 |
+
|
| 121 |
+
## ⚖️ License
|
| 122 |
+
|
| 123 |
+
BAGEL is licensed under the **Apache 2.0 License**.
|
| 124 |
+
|
| 125 |
+
Finetuned from:
|
| 126 |
+
- [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
|
| 127 |
+
- [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2)
|
| 128 |
+
- Uses [FLUX.1-schnell VAE](https://huggingface.co/black-forest-labs/FLUX.1-schnell)
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
|
| 132 |
+
## 📚 Citation
|
|
|
|
| 133 |
|
|
|
|
| 134 |
```bibtex
|
| 135 |
@article{deng2025bagel,
|
| 136 |
title = {Emerging Properties in Unified Multimodal Pretraining},
|
|
|
|
| 138 |
journal = {arXiv preprint arXiv:2505.14683},
|
| 139 |
year = {2025}
|
| 140 |
}
|
|
|