nielsr HF Staff commited on
Commit
98a4f05
Β·
verified Β·
1 Parent(s): 96d1e4c

Add model card with pipeline tag, library name and Github README content

Browse files

This PR adds a model card, linking it to the paper and the code. It also adds the appropriate pipeline tag and library name.

Files changed (1) hide show
  1. README.md +140 -3
README.md CHANGED
@@ -1,3 +1,140 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-to-image
4
+ library_name: transformers
5
+ ---
6
+
7
+ <div align="center">
8
+
9
+ <img src="assets/logo.png" width="30%"/>
10
+
11
+ <h3>UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding</h3>
12
+
13
+ [Yang Jiao](https://sxjyjay.github.io/)<sup>1,2</sup>, &nbsp; [Haibo Qiu](https://haibo-qiu.github.io/)<sup>3</sup>, &nbsp; [Zequn Jie](https://scholar.google.com/citations?user=4sKGNB0AAAAJ&hl=zh-CN&oi=sra)<sup>3</sup>, &nbsp; [Shaoxiang Chen](https://scholar.google.com/citations?user=WL5mbfEAAAAJ&hl=zh-CN)<sup>3</sup>, &nbsp; [Jingjing Chen](https://jingjing1.github.io/)<sup>1,2</sup>, &nbsp; </br>
14
+ [Lin Ma](https://forestlinma.com/)<sup>3</sup>, &nbsp; [Yu-Gang Jiang](https://fvl.fudan.edu.cn/)<sup>1,2</sup>
15
+
16
+ <sup>1</sup>Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University &nbsp; </br>
17
+ <sup>2</sup>Shanghai Collaborative Innovation Center on Intelligent Visual Computing &nbsp; </br>
18
+ <sup>3</sup>Meituan
19
+
20
+ [![UniToken](https://img.shields.io/badge/Paper-UniToken-d32f2f.svg?logo=arXiv)](https://arxiv.org/abs/2504.04423)&#160;
21
+ <a href='https://huggingface.co/OceanJay/UniToken-AnyRes-StageII'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face%20-models-blue'></a><br>
22
+
23
+ </div>
24
+
25
+ <img src="assets/demo.png">
26
+
27
+ ## πŸ“£ News
28
+ - **[2025-04-02] πŸŽ‰πŸŽ‰πŸŽ‰ UniToken [paper](https://arxiv.org/abs/2504.04423) is accepted to CVPR 2025 workshop! πŸŽ‰πŸŽ‰πŸŽ‰**
29
+ - **[2025-04-01] πŸŽ‰πŸŽ‰πŸŽ‰ We release the [recaptioned text prompts](https://huggingface.co/datasets/OceanJay/rewrite_geneval_t2icompbench) of GenEval and T2I-Compbench! πŸŽ‰πŸŽ‰πŸŽ‰**
30
+ - **[2025-02-16] πŸŽ‰πŸŽ‰πŸŽ‰ UniToken [paper](https://arxiv.org/abs/2504.04423) and training codes are released! πŸŽ‰πŸŽ‰πŸŽ‰**
31
+
32
+ ## πŸ› οΈ Installation
33
+
34
+ See [INSTALL.md](./INSTALL.md) for detailed instructions.
35
+
36
+ ## πŸŽ“ Training
37
+ See [unitoken/TRAIN.md](unitoken/TRAIN.md)
38
+
39
+ ## πŸ€– Inference
40
+
41
+ ### Preparation
42
+
43
+ Download the original [VQ-VAE weights](https://github.com/facebookresearch/chameleon), [Lumina-mGPT-512](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-512) and [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384), and put them to the following directory:
44
+
45
+ ```
46
+ UniToken
47
+ - unitoken/
48
+ - ckpts/
49
+ - chameleon/
50
+ - tokenizer/
51
+ - text_tokenizer.json
52
+ - vqgan.yaml
53
+ - vqgan.ckpt
54
+ - Lumina-mGPT-7B-512/
55
+ - SigLIP/
56
+ - xllmx/
57
+ - ...
58
+ ```
59
+
60
+ ### Simple Inference
61
+
62
+ The simplest code for UniToken inference:
63
+
64
+ ```python
65
+ from inference_solver_anyres import FlexARInferenceSolverAnyRes
66
+ from PIL import Image
67
+
68
+ # ******************** Image Generation ********************
69
+ inference_solver = FlexARInferenceSolverAnyRes(
70
+ model_path="OceanJay/UniToken-AnyRes-StageII",
71
+ precision="bf16",
72
+ target_size=512,
73
+ )
74
+
75
+ q1 = f"Generate an image according to the following prompt:
76
+ " \
77
+ f"A majestic phoenix with fiery wings soaring above a tranquil mountain lake, casting shimmering reflections on the water. Sparks and embers trail behind it as the sky glows with hues of orange and gold."
78
+
79
+ # generated: tuple of (generated response, list of generated images)
80
+ generated = inference_solver.generate_img(
81
+ images=[],
82
+ qas=[[q1, None]],
83
+ max_gen_len=1536,
84
+ temperature=1.0,
85
+ logits_processor=inference_solver.create_logits_processor(cfg=3.0, image_top_k=4000),
86
+ )
87
+
88
+ a1, new_image = generated[0], generated[1][0]
89
+
90
+ # ******************* Image Understanding ******************
91
+ inference_solver = FlexARInferenceSolverAnyRes(
92
+ model_path="OceanJay/UniToken-AnyRes-StageII",
93
+ precision="bf16",
94
+ target_size=512,
95
+ )
96
+
97
+ # "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
98
+ q1 = "<|image|>Please describe the details of the image as much as possible."
99
+
100
+ images = [Image.open("../assets/1.png").convert('RGB')]
101
+ qas = [[q1, None]]
102
+
103
+ # `len(images)` should be equal to the number of appearance of "<|image|>" in qas
104
+ generated = inference_solver.generate(
105
+ images=images,
106
+ qas=qas,
107
+ max_gen_len=512,
108
+ temperature=1.0,
109
+ logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
110
+ )
111
+
112
+ a1 = generated[0]
113
+ # generated[1], namely the list of newly generated images, should typically be empty in this case.
114
+ ```
115
+
116
+ ## πŸ€— Checkpoints
117
+
118
+ | Model |Huggingface |
119
+ | ------------ | ---------------------------------------------------------------------------------------- |
120
+ | UniToken-base-StageI | [OceanJay/UniToken-base-StageI](https://huggingface.co/OceanJay/UniToken-base-StageI) |
121
+ | UniToken-base-StageII | [OceanJay/UniToken-base-StageII](https://huggingface.co/OceanJay/UniToken-base-StageII) |
122
+ | UniToken-AnyRes-StageI | [OceanJay/UniToken-AnyRes-StageI](https://huggingface.co/OceanJay/UniToken-AnyRes-StageI) |
123
+ | UniToken-AnyRes-StageII | [OceanJay/UniToken-AnyRes-StageII](https://huggingface.co/OceanJay/UniToken-AnyRes-StageII) |
124
+
125
+ ## πŸ“š Datasets
126
+ We've observed that existing text-to-image generation models struggle with short text prompts in benchmarks such as GenEval and T2I-Compbench++. To address this issue, we have revised these prompts to be more descriptive. We are excited to share our enhanced version on [Hugging Face](https://huggingface.co/datasets/OceanJay/rewrite_geneval_t2icompbench). We encourage you to try it out and see the improvements for your own model!
127
+
128
+ ## πŸ™ Acknowledgement
129
+
130
+ We sincerely appreciate [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT) for providing high-quality training codes, as well as [Emu3](https://github.com/baaivision/Emu3) and [Janus](https://github.com/deepseek-ai/Janus) for releasing pretrained checkpoints for evaluation.
131
+
132
+ ## πŸ“„ Citation
133
+
134
+ ```
135
+ @misc{jiao2025unitoken,
136
+ title={UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding},
137
+ author={Yang Jiao and Haibo Qiu and Zequn Jie and Shaoxiang Chen and Jingjing Chen and Lin Ma and Yu-Gang Jiang},
138
+ year={2025}
139
+ }
140
+ ```