jianwenzh commited on
Commit
42bd386
·
verified ·
1 Parent(s): 263bf32

Upload folder using huggingface_hub

Browse files
LICENSE.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model License, Data Attribution & Disclaimer
2
+
3
+ ## Model License
4
+ This model and its associated weights are released under the
5
+ **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)**.
6
+
7
+ You are free to:
8
+ - **Share** — copy and redistribute the model in any medium or format
9
+ - **Adapt** — remix, transform, and build upon the model
10
+
11
+ Under the following terms:
12
+ - **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
13
+ - **NonCommercial** — You may not use the model for commercial purposes.
14
+ - **ShareAlike** — If you modify, fine-tune, or build upon the model, you must distribute your contributions under the same license.
15
+
16
+ Full license text: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)
17
+
18
+ ---
19
+
20
+ ## Data Sources & Attribution
21
+ This model was trained on **derived data** based on the following publicly available datasets. **No original dataset content is included in this release.**
22
+
23
+ - Datasets under `CC BY-NC-SA 4.0`
24
+
25
+ - [UGroundV1](https://huggingface.co/datasets/osunlp/UGround-V1-Data)
26
+ - [UGround-V1-Data-Box](https://huggingface.co/datasets/osunlp/UGround-V1-Data-Box)
27
+ - [GTA1 grounding dataset](https://huggingface.co/datasets/HelloKKMe/grounding_dataset)
28
+
29
+ - Datasets under `MIT License`
30
+ - [AgentNet](https://huggingface.co/datasets/xlangai/AgentNet)
31
+ - [GUI-R1](https://huggingface.co/datasets/ritzzai/GUI-R1): only used in evaluation.
32
+
33
+ - Datasets under `Apache License 2.0`
34
+ - [Jedi](https://huggingface.co/datasets/xlangai/Jedi)
35
+ - [GUI-Net-Mini](https://huggingface.co/datasets/Bofeee5675/GUI-Net-Mini)
36
+ - [GUI-Net-1M](https://huggingface.co/datasets/Bofeee5675/GUI-Net-1M)
37
+ - [Aguvis-stage1](https://huggingface.co/datasets/xlangai/aguvis-stage1)
38
+ - [Aguvis-stage2](https://huggingface.co/datasets/xlangai/aguvis-stage2)
39
+ - [OS-Atlas-data](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data)
40
+
41
+ - Unlicensed datasets
42
+ - [DocVQA](https://www.docvqa.org/): Publicly available, no license restrictions.
43
+
44
+
45
+ All rights for these datasets remain with their respective authors and licensors.
46
+
47
+ ---
48
+
49
+ ## Combined Licensing Context
50
+ Since several datasets with **CC BY-NC-SA 4.0** license are used, it must also be distributed under that license.
51
+ Datasets under MIT and Apache 2.0 are license-compatible and impose no additional obligations.
52
+
53
+ ---
54
+
55
+ ## Disclaimer
56
+ This model and documentation are provided **“as is”**, without warranty of any kind, express or implied, including but not limited to merchantability, fitness for a particular purpose, and non-infringement.
57
+
58
+ The authors and contributors assume no responsibility for how this model or any derivative works are used.
59
+ Users are solely responsible for ensuring compliance with all applicable dataset licenses, laws, and regulations.
60
+ Commercial use of this model is **not permitted** under the CC BY-NC-SA 4.0 license.
61
+
62
+ ---
63
+
64
+ ## © Copyright
65
+ © Vocaela AI, 2025
66
+
67
+ All rights reserved except as granted under the licenses above.
README.md CHANGED
@@ -1,3 +1,413 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - HuggingFaceTB/SmolVLM2-500M-Video-Instruct
7
+ ---
8
+
9
+ # Vocaela-500M: A Tiny Mighty GUI Agent Model
10
+
11
+ **TL;DR:**
12
+ A compact 500M-parameter Vision-Language Model (VLM) designed for GUI agents. Given a screenshot and a simple instruction (e.g., “click the submit button”), it outputs structured JSON actions with precise pixel coordinates. Despite its small size, it delivers surprisingly strong performance—often surpassing much larger models—while running smoothly on laptops and even mobile devices.
13
+
14
+ ## Model description
15
+
16
+ A growing number of models can now operate computer and mobile GUIs on behalf of users. However, most are massive and impractical for everyday devices like laptops or phones. While many GUI agent models chase higher autonomy, Vocaela-500M explores a different path: a smaller, efficient model focused on precise low-level control.
17
+
18
+ Given a screenshot and an explicit instruction such as “click the submit button,” it produces structured JSON actions with pixel coordinates. By narrowing the scope, we maximize efficiency, achieving smooth performance on laptops and even mobile devices.
19
+
20
+ Despite its compact 500M parameters, Vocaela-500M performs surprisingly well on grounding and GUI control tasks—often matching or surpassing larger models. This marks a new step in scaling GUI agent models downward toward lightweight, practical deployment.
21
+
22
+ - Type: Vision-Language Model (VLM) for Computer GUI Agents
23
+ - Size: 500M parameters
24
+ - Input: Screenshot + natural language instruction (specific GUI action)
25
+ - Output: Structured JSON describing GUI action(s), including pixel coordinates
26
+ - Recommended image resolution: longer edge < 2048
27
+ - Fine-tuned from: [HuggingFaceTB/SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) (brief as `SmolVLM2-500M` below)
28
+ - License: [CC BY-NC-SA 4.0](./LICENSE.md)
29
+ - Developed by: [Vocaela AI](https://vocaela.ai/)
30
+
31
+ ## Action space
32
+
33
+ The following table lists the default action schema used during training. Users may extend or redefine it via system prompts.
34
+
35
+ | | Action | Parameters | Parameters' Values | Example | Meaning |
36
+ |:-------------|:--------------|:------------------------|:----------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------|:-|
37
+ | Common | type | text | string, the text to type in | {"action": "type", "text": "example"} | Typing in specified text |
38
+ | | click | coordinate | [x,y], scaled [0, 1), position to click on | {"action": "click", "coordinate": [0.1,0.5]} | Click using mouse or tap using finger at specified position |
39
+ | Desktop Only | mouse_move | coordinate | [x,y], scaled [0, 1), position to move to | {"action": "mouse_move", "coordinate": [0.1,0.5]} | Move mouse to specified position |
40
+ | | drag | coordinate, coordinate2 | [x,y], scaled [0, 1), start (`coordinate`) and end (`coordinate2`) position to drag | {"action": "drag", "coordinate": [0.1,0.5], "coordinate2": [0.2,0.6]} | Drag mouse (click left button and hold) from specified start position to end position |
41
+ | | right_click | coordinate | [x,y], scaled [0, 1), position to click on | {"action": "right_click", "coordinate": [0.1,0.5]} | Click right mouse button at specified position |
42
+ | | middle_click | coordinate | [x,y], scaled [0, 1), position to click on | {"action": "middle_click", "coordinate": [0.1,0.5]} | Click middle mouse button at specified position |
43
+ | | double_click | coordinate | [x,y], scaled [0, 1), position to click on | {"action": "double_click", "coordinate": [0.1,0.5]} | Double click left mouse button at specified position |
44
+ | | scroll | scroll_direction | enum: {'up', 'down'} | {"action": "scroll", "scroll_direction": "up"} | Scroll mouse wheel with specified direction |
45
+ | | press_key | key, presses | `key`: string, single key pressed; `presses`, integer, number of times to press | {"action": "press_key", "key": 'enter'} | Press a single key |
46
+ | | hotkey | hotkeys | list of string, combination of keys to press, e.g, ['ctrl', 'c'] | {"action": "hotkey", "hotkeys": ["ctrl", "c"]} | Press hotkey combinations e.g., Ctrl+C |
47
+ | Mobile Only | long_press | coordinate, time | `coordinate`: [x,y], scaled [0, 1), position to press on; `time`: seconds to hold | {"action": "long_press", "coordinate": [0.1,0.5], "time": 5} | Press at specified position and hold for specified time (s) |
48
+ | | swipe | swipe_direction, swipe_from, coordinate | `swipe_direction`: direction swipe towards, enum {'up', 'down', 'left', 'right'}, `swipe_from`: general area to swipe from, enum {'top', 'bottom', 'left', 'right', 'center', 'top_left', 'top_right', 'bottom_left', 'bottom_right'}, `coordinate`: [x,y], scaled [0, 1), accurate position to swipe from. `swipe_from` and `coordinate` are optional. | {"action": "swipe", "coordinate": [0.1,0.5], "swipe_direction": 'up'} | Swipe from specified start position towards specified direction |
49
+ | | system_button | button | string, system button to press, enum: {'Back', 'Home', 'Menu', 'Enter'} | {"action": "system_button", "button": "home"} | Press a specified system button |
50
+ | | open | text | string, name of app to open | {"action": "open", "text": "Google Chrome"} | Open a specified app |
51
+
52
+ See below Section [System messages](#system-messages) for example of how to instruct the action space.
53
+
54
+ ## How to use
55
+
56
+ The model is used in the same way as SmolVLM2-500M. The example below shows how to load the model and processor, construct multimodal messages, and perform inference. For system messages, please refer to Section [System messages](#system-messages).
57
+ For a completed running example, please refer to the simple play demo [vocaela-500m-demo](https://github.com/vocaela/vocaela-500m-demo/blob/main/readme.md), which is a completed example of loading model, creating messages, preprocessing, and doing model inference.
58
+
59
+
60
+ ```python
61
+ from transformers import AutoProcessor, AutoModelForImageTextToText
62
+ import torch
63
+
64
+ model_path = "Vocaela/Vocaela-500M"
65
+ processor = AutoProcessor.from_pretrained(model_path)
66
+ torch_dtype = torch.float16 # using torch.bfloat16 if your device supports it
67
+ device = 'cuda' # using 'cpu' if want to inference on cpu
68
+ _attn_implementation = 'sdpa' # using "flash_attention_2" if it available in your env
69
+ model = AutoModelForImageTextToText.from_pretrained(model_path, torch_dtype=torch_dtype, _attn_implementation=_attn_implementation).to(device)
70
+
71
+ # Ensure the 'content' field uses a list format for every message, even for single items; otherwise, apply_chat_template's result will be wrong without raising any exception.
72
+ messages = [
73
+ {
74
+ "role": "system",
75
+ "content": [
76
+ { "type": "text", "text": "<SYSTEM_MESSAGE>"}, # please reference section [System messages](#system-messages) for choices of using message for computer use or mobile use.
77
+ ]
78
+ },
79
+ {
80
+ "role": "user",
81
+ "content": [
82
+ {"type": "image", "url": "<image full path>"},
83
+ {"type": "text", "text": "Click the ..."},
84
+ ]
85
+ },
86
+ ]
87
+
88
+ inputs = processor.apply_chat_template(
89
+ messages,
90
+ add_generation_prompt=True,
91
+ tokenize=True,
92
+ return_dict=True,
93
+ return_tensors="pt",
94
+ ).to(model.device, dtype=torch.bfloat16)
95
+
96
+ generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
97
+ generated_texts = processor.batch_decode(
98
+ generated_ids,
99
+ skip_special_tokens=True,
100
+ )
101
+ print(generated_texts[0])
102
+ ```
103
+
104
+ ## Evaluation results
105
+
106
+ We evaluated the model on two levels of tasks:
107
+ - Grounding: the model is asked to directly output screen coordinate (x,y) of a concerned GUI element. In the past, related works show a trend of improving this low-level capability by scaling up model size. However, this model impresses us by how a tiny model can still perform remarkablly on grounding.
108
+ - Low-level GUI agent task: the model is asked to execute low-level GUI instructions such as "click the submit button", "type 'diet food' in the search box", "scroll up the page", "open chrome", etc. Although the task is not like those popular highly autonomous agentic ones, we hope it is still a self-contained "agent" model instead of only a "grounding" model.
109
+
110
+ ### Grounding evaluation
111
+
112
+ #### Screenspot
113
+
114
+ The following table compares Vocaela-500M with other small (<=4B) specialized GUI models on the ScreenSpot benchmark. Numbers are from original papers/pages unless otherwise noted.
115
+
116
+ | Model | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Overall |
117
+ |:---------------------------------:|:------------:|:------------:|:------------:|:------------:|:--------:|:--------:|:---------:|
118
+ | *General purpose models*
119
+ | QWen2.5-VL-3B [[1]](#ref1) | - | - | - | - | - | - | 55.5 (from [[10]](#ref10)) |
120
+ | InternVL3.5-4B [[2]](#ref2) | - | - | - | - | - | - | 83.6 |
121
+ | QWen3-VL-4B Instruct [[3]](#ref3) | - | - | - | - | - | - | 94.0 |
122
+ | *Specialized GUI models* |
123
+ | OS-Atlas-4B [[4]](#ref4) | 85.7 | 58.5 | 72.2 | 45.7 | 82.6 | 63.1 | 70.1 |
124
+ | ShowUI-2B [[5]](#ref5) | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 |
125
+ | UGround-V1-2B [[6]](#ref6) | 89.4 | 72.0 | 88.7 | 65.7 | 81.3 | 68.9 | 77.7 |
126
+ | UI-Tars-2B [[7]](#ref7) | 93.0 | 75.5 | 90.7 | 68.6 | 84.3 | 74.8 | 82.3 |
127
+ | <u>**Vocaela-500M**</u> | 92.7 | 70.3 | 90.7 | 75.0 | 87.4 | 78.2 | 83.1 |
128
+ | TongUI-3B [[8]](#ref8) | 92.6 | 77.7 | 92.3 | 77.1 | 87.8 | 74.8 | 83.6 |
129
+ | GUI-Actor-2B [[9]](#ref9) | 93.0 | 79.9 | 88.1 | 78.6 | 90.9 | 84.0 | 86.5 |
130
+ | InfiGUI-R1-3B [[10]](#ref10) | 97.1 | 81.2 | 94.3 | 77.1 | 91.7 | 77.6 | 87.5 |
131
+ | UI-R1-E-3B | 97.1 | 83.0 | 95.4 | 77.9 | 91.7 | 85.0 | 89.2 |
132
+
133
+ #### ScreenspotV2
134
+
135
+ The following table compares Vocaela-500M with other small (<=4B) specialized GUI models on the ScreenSpotV2 benchmark. Numbers are from the [ScreenSpotV2/ScreenSpotPro leaderboard page](https://gui-agent.github.io/grounding-leaderboard/screenspot.html) except Phi-Ground-4B and GUI-Actor-2B from the original papers.
136
+
137
+ | Model | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Overall |
138
+ |:-----------------------------------:|:------------:|:------------:|:------------:|:------------:|:--------:|:--------:|:---------:|
139
+ | *General purpose models* |
140
+ | QWen2.5-VL-3B [[1]](#ref1) | 62.1 | 46.4 | 54.1 | 30.0 | 31.2 | 48.3 | 46.9 |
141
+ | *Specialized GUI models* |
142
+ | OS-Atlas-4B [[4]](#ref4) | 82.8 | 64.0 | 64.4 | 46.4 | 78.6 | 60.1 | 68.5 |
143
+ | ShowUI-2B [[5]](#ref5) | 92.1 | 75.4 | 78.9 | 59.3 | 84.2 | 61.1 | 77.3 |
144
+ | Phi-Ground-4B [[11]](#ref11) | 94.1 | 62.0 | 91.7 | 77.1 | 94.4 | 78.3 | 84.1 |
145
+ | TongUI-3B [[8]](#ref8) | 94.4 | 79.6 | 92.8 | 75.0 | 87.6 | 77.8 | 85.5 |
146
+ | <u>**Vocaela-500M**</u> | 95.9 | 73.93 | 95.4 | 75.7 | 91.0 | 75.4 | 85.8 |
147
+ | ZonUI-3B [[14]](#ref14) | 98.6 | 82.9 | 92.3 | 74.3 | 88.0 | 74.4 | 86.6 |
148
+ | GUI-Actor-2B [[9]](#ref9) | 95.0 | 82.2 | 92.2 | 81.8 | 92.9 | 82.7 | 88.6 |
149
+ | UI-R1-E-3B | 98.2 | 83.9 | 94.8 | 75.0 | 93.2 | 83.7 | 89.5 |
150
+ | Holo1.5-3B [[12]](#ref12) | 99.2 | 88.0 | 95.0 | 89.7 | 91.8 | 84.8 | 91.7 |
151
+
152
+
153
+ #### Showdown
154
+
155
+ The following table compares Vocaela-500M with other small specialized GUI models on the Showdown benchmark. Numbers are from the Phi-Ground-4B [[11]](#ref11) paper except Holo1.5-3B/Holo1.5-7B from their model pages. Compared with ScreenSpot and ScreenSpotV2, the Showdown dataset includes many examples requiring app-specific knowledge rather than pure visual grounding—posing a greater challenge for compact models like Vocaela-500M. Despite this, it still outperforms several 4B–7B models.
156
+
157
+ | Model | Acc |
158
+ |:------------------------------:|:-----------:|
159
+ | OS-Atlas-4B [[4]](#ref4) | 15.8 |
160
+ | SeeClick-9.6B [[15]](#ref15) | 24.6 |
161
+ | OS-Atlas-7B [[4]](#ref4) | 41.1 |
162
+ | UGround-7B [[6]](#ref6) | 46.5 |
163
+ | <u>**Vocaela-500M**</u> | 52.1 |
164
+ | UGround-v1-7B [[6]](#ref6) | 57.8 |
165
+ | Phi-Ground-4B [[11]](#ref11) | 58.2 |
166
+ | UI-TARS-2B [[7]](#ref7) | 59.8 |
167
+ | Phi-Ground-7B [[11]](#ref11) | 62.5 |
168
+ | UI-TARS-7B [[7]](#ref7) | 66.1 |
169
+ | UI-TARS-1.5-7B [[7]](#ref7) | 67.2 |
170
+ | Holo1.5-3B [[12]](#ref12) | 67.5 |
171
+ | Holo1.5-7B [[13]](#ref13) | 72.2 |
172
+
173
+
174
+ ### Low-level agent evaluation
175
+
176
+ Following convention of related works, we report the three metrics:
177
+ - `Type`: Accuracy of predict the action type, e.g., 'click', 'type' etc.
178
+ - `Grounding`: The accuracy of the coordinate of some actions requiring output screen coordinate, such as 'click'.
179
+ - `SR`: The step success rate.
180
+
181
+ #### AndroidControl-Low
182
+
183
+
184
+ | Model | Type | Grounding | SR |
185
+ |:----------------------------:|:------:|:---------:|:------:|
186
+ | OS-Atlas-4B [[4]](#ref4) | 64.58 | 71.19 | 40.62 |
187
+ | OS-Atlas-7B [[4]](#ref4) | 73.00 | 73.37 | 50.94 |
188
+ | GUI-R1-3B [[16]](#ref16) | 83.68 | 81.59 | 64.41 |
189
+ | GUI-R1-7B [[16]](#ref16) | 85.17 | 84.02 | 66.52 |
190
+ | Aria-UI (3.5B act. of 24.9B) [[17]](#ref17) | - | 87.7 | 67.3 |
191
+ | <u>**Vocaela-500M**</u> | 83.98 | 81.52 | 69.68 |
192
+ | Aguvis-7B [[18]](#ref18) | – | – | 80.5 |
193
+ | UI-R1-3B [[19]](#ref19) | 94.3 | 82.6 | 88.5 |
194
+ | UI-TARS-2B [[7]](#ref7) | 98.1 | 87.3 | 89.3 (from [[10]](#ref10)) |
195
+ | InfiGUI-R1-3B [[10]](#ref10) | 96.0 | 93.2 | 92.1 |
196
+
197
+ Numbers other than Vocaela-500M are from each own reference except UI-TARS-2B which is from InfiGUI-R1-3B [[10]](#ref10).
198
+
199
+ #### GUI-Act-Web
200
+
201
+ | Model | Type | Grounding | SR |
202
+ |:----------------------------:|:------:|:---------:|:------:|
203
+ | OS-Atlas-4B [[4]](#ref4) | 79.22 | 58.57 | 42.62 |
204
+ | OS-Atlas-7B [[4]](#ref4) | 86.95 | 75.61 | 57.02 |
205
+ | UI-R1-3B [[19]](#ref19) | 75.89 | 79.43 | 67.31 (from [[10]](#ref10)) |
206
+ | GUI-R1-3B [[16]](#ref16) | 89.86 | 87.42 | 76.31 |
207
+ | GUI-R1-7B [[16]](#ref16) | 90.85 | 88.06 | 80.31 |
208
+ | <u>**Vocaela-500M**</u> | 90.28 | 79.71 | 80.43 |
209
+
210
+
211
+ #### OmniAct-Web
212
+ | Model | Type | Grounding | SR |
213
+ |:----------------------------:|:------:|:---------:|:------:|
214
+ | OS-Atlas-4B [[4]](#ref4) | 46.74 | 49.24 | 22.99 |
215
+ | OS-Atlas-7B [[4]](#ref4) | 85.63 | 69.35 | 59.15 |
216
+ | UI-R1-3B [[19]](#ref19) | 75.42 | 61.35 | 61.33 (from [[16]](#ref16)) |
217
+ | <u>**Vocaela-500M**</u> | 88.16 | 72.42 | 67.13 |
218
+ | GUI-R1-3B [[16]](#ref16) | 88.58 | 75.10 | 75.08 |
219
+ | GUI-R1-7B [[16]](#ref16) | 91.16 | 77.29 | 77.35 |
220
+
221
+ #### OmniAct-Desktop
222
+ | Model | Type | Grounding | SR |
223
+ |:----------------------------:|:------:|:---------:|:------:|
224
+ | OS-Atlas-4B [[4]](#ref4) | 63.30 | 42.55 | 26.94 |
225
+ | OS-Atlas-7B [[4]](#ref4) | 90.24 | 62.87 | 56.73 |
226
+ | UI-R1-3B [[19]](#ref19) | 73.41 | 64.12 | 63.98 (from [[16]](#ref16)) |
227
+ | GUI-R1-3B [[16]](#ref16) | 91.86 | 78.37 | 78.31 |
228
+ | <u>**Vocaela-500M**</u> | 89.23 | 83.05 | 79.12 |
229
+ | GUI-R1-7B [[16]](#ref16) | 92.20 | 83.36 | 83.33 |
230
+
231
+
232
+ ## Training strategy
233
+
234
+ The model architecture and configuration remain identical to the base model, except for its slightly customized chat template (see Section [Special tokens & chat template](#special-tokens--chat-template)). Training followed three stages: Two stages of Supervised Fine-Tuning (SFT) and then Reinforcement Fine-Tuning (RFT) using GRPO.
235
+
236
+ SFT Stage 1: ~7M examples from public datasets after extensive preprocessing, unifying action spaces, synthesis, and balancing.
237
+
238
+ SFT Stage 2: ~256K examples after filtering, re-sampling to balance action distribution, synthesis to enrich rare actions
239
+
240
+ RFT: ~40K examples
241
+
242
+ ## Limitations
243
+
244
+ - **Not suitable for high-resolution images**
245
+
246
+ There are two factors in the base model [SmolVLM2-500M](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) which limit the capability of handling high resolution images:
247
+ 1. Aggressive pixel shuffle (r=4), compressing 64 pixels into one token.
248
+ 2. Fixed scaling to 2048px on the longest side.
249
+
250
+ Together, they severely impact grounding on high-resolution screens. We evaluated Vocaela-500M on ScreenSpotPro. The average score is 15.1. Although still better than several 2B/3B/7B models, the relative superiority is much worse compared to that on ScreenSpot/ScreenSpotV2. The result verifies the limitation.
251
+
252
+ - **Not suitable for high-level agentic tasks / lacks reasoning capability**
253
+
254
+ The base model SmolVLM2-500M does not exhibit reasoning capabilities. Considering that the main goal of this model is for low-level command execution and the very compact model size, the model was not trained with reasoning capabilities. We evaluated Vocaela-500M on high level task of AndroidControl-High. Shown as below, the result verifies this limitation.
255
+
256
+ | Task | Model | Type | Grounding | SR |
257
+ |:-------------------:|:-------------:|:------:|:---------:|:-----:|
258
+ | AndroidControl-High | Vocaela-500M | 25.9 | 13.1 | 13.1 |
259
+
260
+ - **Loss of general-purpose capabilities**
261
+
262
+ The model is heavily-tuned on this specific scenario and hence loses general-purpose capabilities, such as chat, QA, instruct-following etc.
263
+
264
+ - **No video support**
265
+
266
+ The model was not trained with any video data in SFT/RFT.
267
+
268
+ ## System messages
269
+ Below system messages were used in training and hence recommended to use for inference.
270
+
271
+ ### System message for computer use
272
+ ```python
273
+ Vocaelam_Computer_Use_System_Message = """
274
+ You are an assistant trained to navigate the computer screen.
275
+ Given a task instruction, a screen observation, and an action history sequence,
276
+ output the next actions and wait for the next observation.
277
+
278
+ ## Allowed ACTION_TYPEs and parameters:
279
+ 1. `PRESS_KEY`: Press one specified key. Two parameters: `key`, string, the single key to press; `presses`, integer, the number of times to press the key (default is 1).
280
+ 2. `TYPE`: Type a string into an element. Parameter: `text`, string, the text to type.
281
+ 3. `MOUSE_MOVE`: Move the mouse cursor to a specified position. Parameter: `coordinate`, formatted as [x,y], the position to move the cursor to.
282
+ 4. `CLICK`: Click left mouse button once on an element. Parameter: `coordinate`, formatted as [x,y], the position to click on.
283
+ 5. `DRAG`: Drag the cursor with the left mouse button pressed, start and end positions are specified. Two parameters: `coordinate`, formatted as [x,y], the start position to drag from; `coordinate2`, formatted as [x2,y2], the end position to drag to.
284
+ 6. `RIGHT_CLICK`: Click right mouse button once on an element. Parameter: `coordinate`, formatted as [x,y], the position to right click on.
285
+ 7. `MIDDLE_CLICK`: Click middle mouse button once on an element. Parameter: `coordinate`, formatted as [x,y], the position to middle click on.
286
+ 8. `DOUBLE_CLICK`: Click left mouse button twice on an element. Parameter: `coordinate`, formatted as [x,y], the position to double click on.
287
+ 9. `SCROLL`: Scroll the screen (via mouse wheel). Parameter: `scroll_direction`, the direction (`up`/`down`/`left`/`right`) to scroll.
288
+ 10. `WAIT`: Wait for several seconds. Parameter: `time`, duration in seconds to wait.
289
+ 11. `TERMINATE`: Terminate the task. Parameter: `status`, the status of the task, `success`/`failure`.
290
+ 12. `REFUSE`: Refuse to perform the task if not feasible. No parameter.
291
+ 13. `HOTKEY`: Press a combination of keys simultaneously. Parameter: `hotkeys`, list of strings, the keys to press together.
292
+
293
+ * NOTE *: The `coordinate` and `coordinate2` parameters (formatted as [x,y]) are the relative coordinates on the screenshot scaled to range of 0-1, [0,0] is the top-left corner and [1,1] is the bottom-right corner.
294
+
295
+ ## Format your response as
296
+ <Action>the next actions</Action>
297
+
298
+ `The next actions` can be one or multiple actions. Format `the next actions` as a JSON array of objects as below, each object is an action:
299
+ [{"action": "<ACTION_TYPE>", "key": "<key>", "presses": <presses>, "hotkeys": ["<hotkeys>"], "text": "<text>", "coordinate": [x,y], "coordinate2": [x2,y2], "time": <time>, "status": "<status>", "scroll_direction": "<scroll_direction>"}]
300
+
301
+ If a parameter is not applicable, don't include it in the JSON object.
302
+ """
303
+ ```
304
+
305
+ ### System message for mobile phone use
306
+ ```python
307
+ Vocaela_Mobile_Use_System_Message = """
308
+ You are an assistant trained to navigate the mobile phone.
309
+ Given a task instruction, a screen observation, and an action history sequence,
310
+ output the next actions and wait for the next observation.
311
+
312
+ ## Allowed ACTION_TYPEs and parameters:
313
+ 1. `CLICK`: Click/tap on the screen. Parameter: `coordinate`, formatted as [x,y], the position to click on.
314
+ 2. `LONG_PRESS`: Long press on the screen. Two parameters: `coordinate`, formatted as [x,y], the position to long press on; `time`, duration in seconds to long press.
315
+ 3. `SWIPE`: Swipe on the screen. Two parameters: `swipe_from`, the start area to swipe from, only allowed value in {'top', 'bottom', 'left', 'right', 'center', `top_left`, `top_right`, `bottom_left`, `bottom_right`}; `swipe_direction`, the direction (`up`/`down`/`left`/`right`) to swipe towards.
316
+ 4. `TYPE`: Type a string into an element. Parameter: `text`, string, the text to type.
317
+ 5. `SYSTEM_BUTTON`: Press a system button. Parameter: `button`, the system button to press, allowed button values: 'Back', 'Home', 'Menu', 'Enter'.
318
+ 6. `OPEN`: Open an app. Parameter: `text`, string, the app name to open.
319
+
320
+ * NOTE *: The `coordinate` parameter (formatted as [x,y]) is the relative coordinates on the screenshot scaled to range of 0-1, [0,0] is the top-left corner and [1,1] is the bottom-right corner.
321
+
322
+ ## Format your response as
323
+ <Action>the next actions</Action>
324
+
325
+ `The next actions` can be one or multiple actions. Format `the next actions` as a JSON array of objects as below, each object is an action:
326
+ [{"action": "<ACTION_TYPE>", "text": "<text>", "coordinate": [x,y], "swipe_from": "<swipe_from>", "swipe_direction": "<swipe_direction>", "button": "<button>"}]
327
+
328
+ If a parameter is not applicable, don't include it in the JSON object.
329
+ """
330
+ ```
331
+
332
+ ## Special tokens & chat template
333
+
334
+ The base model SmolVLM2-500M does not provided special token to identify user or assistant role. For convenience of accurately masking user-turn messages in SFT, two existing special tokens were used to mark the beginning and end of an assistant message, `<|reserved_special_token_50|>` for the beginning and `<|reserved_special_token_51|>` for the end. Consequently, if looking into the `chat_template.jinja` file of the model folder, you will find the chat template added the prefix token `<|reserved_special_token_50|>` for inference:
335
+ ```
336
+ <|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
337
+ {% endfor %}{% if add_generation_prompt %}{{ 'Assistant:<|reserved_special_token_50|>' }}{% endif %}
338
+ ```
339
+
340
+ For a normal generation, if you configured tokenizer decoding not skipping special tokens, a completed sequence ends with two successive special tokens `<|reserved_special_token_51|><end_of_utterance>`, where `<end_of_utterance>` is the default end token of the base model and `<|reserved_special_token_51|>` is introduced in by our training process.
341
+
342
+
343
+ ## License
344
+
345
+ This model is made available under the [CC BY-NC-SA 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/). To comply with the license, you may use, modify, and share the dataset or derivative works for non-commercial purposes only. **Any derivative works must be shared under the same license.**
346
+
347
+ Reason of adopting `CC BY-NC-SA 4.0` license: model training used datasets under `CC BY-NC-SA 4.0` license.
348
+
349
+ Please see the full license [here](./LICENSE.md).
350
+
351
+ ## Acknowledgements
352
+
353
+ - Thanks to Microsoft [*Azure startup credit offer*](https://learn.microsoft.com/en-us/azure/signups/overview) for partially funding the computing
354
+ - Thanks to related projects [Jedi](https://osworld-grounding.github.io/), [TongUI](https://tongui-agent.github.io/), [UGround](https://osu-nlp-group.github.io/UGround/), [Aguvis](https://aguvis-project.github.io/), [OS-ATLAS](https://osatlas.github.io/), [GTA-1](https://github.com/Yan98/GTA1), [OpenCUA](https://github.com/xlang-ai/OpenCUA), [GUI-R1](https://github.com/ritzz-ai/GUI-R1), etc. We leveraged datasets, code, and insights shared out from them.
355
+
356
+ ## References
357
+
358
+ <a id="ref1">[1]</a>
359
+ Bai, Shuai, et al. "Qwen2. 5-vl technical report." arXiv preprint arXiv:2502.13923 (2025).
360
+
361
+ <a id="ref2">[2]</a>
362
+ Wang, Weiyun, et al. "Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency." arXiv preprint arXiv:2508.18265 (2025).
363
+
364
+ <a id="ref3">[3]</a>
365
+ Qwen3-VL. https://github.com/QwenLM/Qwen3-VL
366
+
367
+ <a id="ref4">[4]</a>
368
+ Wu, Zhiyong, et al. "Os-atlas: A foundation action model for generalist gui agents." arXiv preprint arXiv:2410.23218 (2024).
369
+
370
+ <a id="ref5">[5]</a>
371
+ Lin, Kevin Qinghong, et al. "Showui: One vision-language-action model for gui visual agent." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
372
+
373
+ <a id="ref6">[6]</a>
374
+ Gou, Boyu, et al. "Navigating the digital world as humans do: Universal visual grounding for gui agents." arXiv preprint arXiv:2410.05243 (2024).
375
+
376
+ <a id="ref7">[7]</a>
377
+ Qin, Yujia, et al. "Ui-tars: Pioneering automated gui interaction with native agents." arXiv preprint arXiv:2501.12326 (2025).
378
+
379
+ <a id="ref8">[8]</a>
380
+ Zhang, Bofei, et al. "TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials." arXiv preprint arXiv:2504.12679 (2025).
381
+
382
+ <a id="ref9">[9]</a>
383
+ Wu, Qianhui, et al. "GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents." arXiv preprint arXiv:2506.03143 (2025).
384
+
385
+ <a id="ref10">[10]</a>
386
+ Liu, Yuhang, et al. "Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners." arXiv preprint arXiv:2504.14239 (2025).
387
+
388
+ <a id="ref11">[11]</a>
389
+ Zhang, Miaosen, et al. "Phi-ground tech report: Advancing perception in gui grounding." arXiv preprint arXiv:2507.23779 (2025).
390
+
391
+ <a id="ref12">[12]</a>
392
+ Holo1.5-3B. https://huggingface.co/Hcompany/Holo1.5-3B
393
+
394
+ <a id="ref13">[13]</a>
395
+ Holo1.5-7B. https://huggingface.co/Hcompany/Holo1.5-7B
396
+
397
+ <a id="ref14">[14]</a>
398
+ Hsieh, ZongHan, and Tzer-Jen Wei. "ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding" arXiv preprint arXiv:2506.23491 (2025).
399
+
400
+ <a id="ref15">[15]</a>
401
+ Cheng, Kanzhi, et al. "Seeclick: Harnessing gui grounding for advanced visual gui agents." arXiv preprint arXiv:2401.10935 (2024).
402
+
403
+ <a id="ref16">[16]</a>
404
+ Luo, Run, et al. "Gui-r1: A generalist r1-style vision-language action model for gui agents." arXiv preprint arXiv:2504.10458 (2025).
405
+
406
+ <a id="ref17">[17]</a>
407
+ Yang, Yuhao, et al. "Aria-ui: Visual grounding for gui instructions." arXiv preprint arXiv:2412.16256 (2024).
408
+
409
+ <a id="ref18">[18]</a>
410
+ Xu, Yiheng, et al. "Aguvis: Unified pure vision agents for autonomous gui interaction." arXiv preprint arXiv:2412.04454 (2024).
411
+
412
+ <a id="ref19">[19]</a>
413
+ Lu, Zhengxi, et al. "UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning." arXiv preprint arXiv:2503.21620 (2025).
added_tokens.json ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<end_of_utterance>": 49279,
3
+ "<fake_token_around_image>": 49189,
4
+ "<global-img>": 49152,
5
+ "<image>": 49190,
6
+ "<row_1_col_1>": 49153,
7
+ "<row_1_col_2>": 49154,
8
+ "<row_1_col_3>": 49155,
9
+ "<row_1_col_4>": 49156,
10
+ "<row_1_col_5>": 49157,
11
+ "<row_1_col_6>": 49158,
12
+ "<row_2_col_1>": 49159,
13
+ "<row_2_col_2>": 49160,
14
+ "<row_2_col_3>": 49161,
15
+ "<row_2_col_4>": 49162,
16
+ "<row_2_col_5>": 49163,
17
+ "<row_2_col_6>": 49164,
18
+ "<row_3_col_1>": 49165,
19
+ "<row_3_col_2>": 49166,
20
+ "<row_3_col_3>": 49167,
21
+ "<row_3_col_4>": 49168,
22
+ "<row_3_col_5>": 49169,
23
+ "<row_3_col_6>": 49170,
24
+ "<row_4_col_1>": 49171,
25
+ "<row_4_col_2>": 49172,
26
+ "<row_4_col_3>": 49173,
27
+ "<row_4_col_4>": 49174,
28
+ "<row_4_col_5>": 49175,
29
+ "<row_4_col_6>": 49176,
30
+ "<row_5_col_1>": 49177,
31
+ "<row_5_col_2>": 49178,
32
+ "<row_5_col_3>": 49179,
33
+ "<row_5_col_4>": 49180,
34
+ "<row_5_col_5>": 49181,
35
+ "<row_5_col_6>": 49182,
36
+ "<row_6_col_1>": 49183,
37
+ "<row_6_col_2>": 49184,
38
+ "<row_6_col_3>": 49185,
39
+ "<row_6_col_4>": 49186,
40
+ "<row_6_col_5>": 49187,
41
+ "<row_6_col_6>": 49188,
42
+ "<|reserved_special_token_0|>": 49191,
43
+ "<|reserved_special_token_10|>": 49201,
44
+ "<|reserved_special_token_11|>": 49202,
45
+ "<|reserved_special_token_12|>": 49203,
46
+ "<|reserved_special_token_13|>": 49204,
47
+ "<|reserved_special_token_14|>": 49205,
48
+ "<|reserved_special_token_15|>": 49206,
49
+ "<|reserved_special_token_16|>": 49207,
50
+ "<|reserved_special_token_17|>": 49208,
51
+ "<|reserved_special_token_18|>": 49209,
52
+ "<|reserved_special_token_19|>": 49210,
53
+ "<|reserved_special_token_1|>": 49192,
54
+ "<|reserved_special_token_20|>": 49211,
55
+ "<|reserved_special_token_21|>": 49212,
56
+ "<|reserved_special_token_22|>": 49213,
57
+ "<|reserved_special_token_23|>": 49214,
58
+ "<|reserved_special_token_24|>": 49215,
59
+ "<|reserved_special_token_25|>": 49216,
60
+ "<|reserved_special_token_26|>": 49217,
61
+ "<|reserved_special_token_27|>": 49218,
62
+ "<|reserved_special_token_28|>": 49219,
63
+ "<|reserved_special_token_29|>": 49220,
64
+ "<|reserved_special_token_2|>": 49193,
65
+ "<|reserved_special_token_30|>": 49221,
66
+ "<|reserved_special_token_31|>": 49222,
67
+ "<|reserved_special_token_32|>": 49223,
68
+ "<|reserved_special_token_33|>": 49224,
69
+ "<|reserved_special_token_34|>": 49225,
70
+ "<|reserved_special_token_35|>": 49226,
71
+ "<|reserved_special_token_36|>": 49227,
72
+ "<|reserved_special_token_37|>": 49228,
73
+ "<|reserved_special_token_38|>": 49229,
74
+ "<|reserved_special_token_39|>": 49230,
75
+ "<|reserved_special_token_3|>": 49194,
76
+ "<|reserved_special_token_40|>": 49231,
77
+ "<|reserved_special_token_41|>": 49232,
78
+ "<|reserved_special_token_42|>": 49233,
79
+ "<|reserved_special_token_43|>": 49234,
80
+ "<|reserved_special_token_44|>": 49235,
81
+ "<|reserved_special_token_45|>": 49236,
82
+ "<|reserved_special_token_46|>": 49237,
83
+ "<|reserved_special_token_47|>": 49238,
84
+ "<|reserved_special_token_48|>": 49239,
85
+ "<|reserved_special_token_49|>": 49240,
86
+ "<|reserved_special_token_4|>": 49195,
87
+ "<|reserved_special_token_50|>": 49241,
88
+ "<|reserved_special_token_51|>": 49242,
89
+ "<|reserved_special_token_52|>": 49243,
90
+ "<|reserved_special_token_53|>": 49244,
91
+ "<|reserved_special_token_54|>": 49245,
92
+ "<|reserved_special_token_55|>": 49246,
93
+ "<|reserved_special_token_56|>": 49247,
94
+ "<|reserved_special_token_57|>": 49248,
95
+ "<|reserved_special_token_58|>": 49249,
96
+ "<|reserved_special_token_59|>": 49250,
97
+ "<|reserved_special_token_5|>": 49196,
98
+ "<|reserved_special_token_60|>": 49251,
99
+ "<|reserved_special_token_61|>": 49252,
100
+ "<|reserved_special_token_62|>": 49253,
101
+ "<|reserved_special_token_63|>": 49254,
102
+ "<|reserved_special_token_64|>": 49255,
103
+ "<|reserved_special_token_65|>": 49256,
104
+ "<|reserved_special_token_66|>": 49257,
105
+ "<|reserved_special_token_67|>": 49258,
106
+ "<|reserved_special_token_68|>": 49259,
107
+ "<|reserved_special_token_69|>": 49260,
108
+ "<|reserved_special_token_6|>": 49197,
109
+ "<|reserved_special_token_70|>": 49261,
110
+ "<|reserved_special_token_71|>": 49262,
111
+ "<|reserved_special_token_72|>": 49263,
112
+ "<|reserved_special_token_73|>": 49264,
113
+ "<|reserved_special_token_74|>": 49265,
114
+ "<|reserved_special_token_75|>": 49266,
115
+ "<|reserved_special_token_76|>": 49267,
116
+ "<|reserved_special_token_77|>": 49268,
117
+ "<|reserved_special_token_78|>": 49269,
118
+ "<|reserved_special_token_79|>": 49270,
119
+ "<|reserved_special_token_7|>": 49198,
120
+ "<|reserved_special_token_80|>": 49271,
121
+ "<|reserved_special_token_81|>": 49272,
122
+ "<|reserved_special_token_82|>": 49273,
123
+ "<|reserved_special_token_83|>": 49274,
124
+ "<|reserved_special_token_84|>": 49275,
125
+ "<|reserved_special_token_85|>": 49276,
126
+ "<|reserved_special_token_86|>": 49277,
127
+ "<|reserved_special_token_87|>": 49278,
128
+ "<|reserved_special_token_8|>": 49199,
129
+ "<|reserved_special_token_9|>": 49200
130
+ }
chat_template.jinja ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ <|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
2
+ {% endfor %}{% if add_generation_prompt %}{{ 'Assistant:<|reserved_special_token_50|>' }}{% endif %}
config.json ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "SmolVLMForConditionalGeneration"
4
+ ],
5
+ "bos_token_id": 1,
6
+ "dtype": "bfloat16",
7
+ "eos_token_id": 49279,
8
+ "image_token_id": 49190,
9
+ "model_type": "smolvlm",
10
+ "pad_token_id": 2,
11
+ "scale_factor": 4,
12
+ "text_config": {
13
+ "_flash_attn_2_enabled": true,
14
+ "_name_or_path": "None",
15
+ "architectures": [
16
+ "VLlama3ForCausalLM"
17
+ ],
18
+ "attention_bias": false,
19
+ "attention_dropout": 0.0,
20
+ "dtype": "bfloat16",
21
+ "head_dim": 64,
22
+ "hidden_act": "silu",
23
+ "hidden_size": 960,
24
+ "initializer_range": 0.02,
25
+ "intermediate_size": 2560,
26
+ "is_llama_config": true,
27
+ "max_position_embeddings": 8192,
28
+ "mlp_bias": false,
29
+ "model_type": "llama",
30
+ "neftune_noise_alpha": 0.0,
31
+ "num_attention_heads": 15,
32
+ "num_hidden_layers": 32,
33
+ "num_key_value_heads": 5,
34
+ "pad_token_id": 2,
35
+ "perceiver_config": {
36
+ "_name_or_path": "",
37
+ "add_cross_attention": false,
38
+ "architectures": null,
39
+ "attention_dropout": 0.0,
40
+ "bad_words_ids": null,
41
+ "begin_suppress_tokens": null,
42
+ "bos_token_id": null,
43
+ "chunk_size_feed_forward": 0,
44
+ "cross_attention_hidden_size": null,
45
+ "decoder_start_token_id": null,
46
+ "diversity_penalty": 0.0,
47
+ "do_sample": false,
48
+ "early_stopping": false,
49
+ "encoder_no_repeat_ngram_size": 0,
50
+ "eos_token_id": null,
51
+ "exponential_decay_length_penalty": null,
52
+ "finetuning_task": null,
53
+ "forced_bos_token_id": null,
54
+ "forced_eos_token_id": null,
55
+ "hidden_act": "silu",
56
+ "id2label": {
57
+ "0": "LABEL_0",
58
+ "1": "LABEL_1"
59
+ },
60
+ "is_decoder": false,
61
+ "is_encoder_decoder": false,
62
+ "label2id": {
63
+ "LABEL_0": 0,
64
+ "LABEL_1": 1
65
+ },
66
+ "length_penalty": 1.0,
67
+ "max_length": 20,
68
+ "min_length": 0,
69
+ "model_type": "vllama3",
70
+ "no_repeat_ngram_size": 0,
71
+ "num_beam_groups": 1,
72
+ "num_beams": 1,
73
+ "num_key_value_heads": 1,
74
+ "num_return_sequences": 1,
75
+ "output_attentions": false,
76
+ "output_hidden_states": false,
77
+ "output_scores": false,
78
+ "pad_token_id": null,
79
+ "prefix": null,
80
+ "problem_type": null,
81
+ "pruned_heads": {},
82
+ "qk_layer_norms_perceiver": false,
83
+ "remove_invalid_values": false,
84
+ "repetition_penalty": 1.0,
85
+ "resampler_depth": 6,
86
+ "resampler_head_dim": 96,
87
+ "resampler_n_heads": 16,
88
+ "resampler_n_latents": 64,
89
+ "return_dict": true,
90
+ "return_dict_in_generate": false,
91
+ "sep_token_id": null,
92
+ "suppress_tokens": null,
93
+ "task_specific_params": null,
94
+ "temperature": 1.0,
95
+ "tf_legacy_loss": false,
96
+ "tie_encoder_decoder": false,
97
+ "tie_word_embeddings": true,
98
+ "tokenizer_class": null,
99
+ "top_k": 50,
100
+ "top_p": 1.0,
101
+ "torch_dtype": null,
102
+ "torchscript": false,
103
+ "transformers_version": "4.46.0",
104
+ "typical_p": 1.0,
105
+ "use_bfloat16": false
106
+ },
107
+ "pixel_shuffle_factor": 4,
108
+ "pretraining_tp": 1,
109
+ "qk_layer_norms": false,
110
+ "rms_norm_eps": 1e-05,
111
+ "rope_interleaved": false,
112
+ "rope_scaling": null,
113
+ "rope_theta": 100000,
114
+ "transformers.js_config": {
115
+ "kv_cache_dtype": {
116
+ "fp16": "float16",
117
+ "q4f16": "float16"
118
+ }
119
+ },
120
+ "use_cache": true,
121
+ "use_resampler": false,
122
+ "vocab_size": 49280
123
+ },
124
+ "tie_word_embeddings": false,
125
+ "transformers.js_config": {
126
+ "kv_cache_dtype": {
127
+ "fp16": "float16",
128
+ "q4f16": "float16"
129
+ }
130
+ },
131
+ "transformers_version": "4.56.1",
132
+ "use_cache": false,
133
+ "use_reentrant_checkpointing": false,
134
+ "vision_config": {
135
+ "attention_dropout": 0.0,
136
+ "dtype": "bfloat16",
137
+ "hidden_act": "gelu_pytorch_tanh",
138
+ "hidden_size": 768,
139
+ "image_size": 512,
140
+ "initializer_range": 0.02,
141
+ "intermediate_size": 3072,
142
+ "layer_norm_eps": 1e-06,
143
+ "max_image_size": {
144
+ "longest_edge": 512
145
+ },
146
+ "model_type": "smolvlm_vision",
147
+ "num_attention_heads": 12,
148
+ "num_channels": 3,
149
+ "num_hidden_layers": 12,
150
+ "patch_size": 16,
151
+ "size": {
152
+ "longest_edge": 2048
153
+ },
154
+ "tie_word_embeddings": false,
155
+ "use_base_siglip": false
156
+ },
157
+ "vocab_size": 49280
158
+ }
generation_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": [
5
+ 49279,
6
+ 49279
7
+ ],
8
+ "pad_token_id": 2,
9
+ "transformers_version": "4.56.1"
10
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b56a3303e3d4bbe2cc619d2a59b5f059fc2675adcdf8cfbd1a537e63d2668dca
3
+ size 1015025832
preprocessor_config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": true,
3
+ "do_image_splitting": true,
4
+ "do_normalize": true,
5
+ "do_pad": true,
6
+ "do_rescale": true,
7
+ "do_resize": true,
8
+ "image_mean": [
9
+ 0.5,
10
+ 0.5,
11
+ 0.5
12
+ ],
13
+ "image_processor_type": "SmolVLMImageProcessor",
14
+ "image_std": [
15
+ 0.5,
16
+ 0.5,
17
+ 0.5
18
+ ],
19
+ "max_image_size": {
20
+ "longest_edge": 512
21
+ },
22
+ "processor_class": "SmolVLMProcessor",
23
+ "resample": 1,
24
+ "rescale_factor": 0.00392156862745098,
25
+ "size": {
26
+ "longest_edge": 2048
27
+ },
28
+ "video_sampling": {
29
+ "fps": 1,
30
+ "max_frames": 64,
31
+ "video_size": {
32
+ "longest_edge": 512
33
+ }
34
+ }
35
+ }
processor_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "image_seq_len": 64,
3
+ "processor_class": "SmolVLMProcessor"
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<fake_token_around_image>",
4
+ "<image>",
5
+ "<end_of_utterance>"
6
+ ],
7
+ "bos_token": {
8
+ "content": "<|im_start|>",
9
+ "lstrip": false,
10
+ "normalized": false,
11
+ "rstrip": false,
12
+ "single_word": false
13
+ },
14
+ "end_of_utterance_token": "<end_of_utterance>",
15
+ "eos_token": {
16
+ "content": "<end_of_utterance>",
17
+ "lstrip": false,
18
+ "normalized": false,
19
+ "rstrip": false,
20
+ "single_word": false
21
+ },
22
+ "fake_image_token": "<fake_token_around_image>",
23
+ "global_image_token": "<global-img>",
24
+ "image_token": "<image>",
25
+ "pad_token": {
26
+ "content": "<|im_end|>",
27
+ "lstrip": false,
28
+ "normalized": false,
29
+ "rstrip": false,
30
+ "single_word": false
31
+ },
32
+ "unk_token": {
33
+ "content": "<|endoftext|>",
34
+ "lstrip": false,
35
+ "normalized": false,
36
+ "rstrip": false,
37
+ "single_word": false
38
+ }
39
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,1191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<repo_name>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "4": {
37
+ "content": "<reponame>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "5": {
45
+ "content": "<file_sep>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "6": {
53
+ "content": "<filename>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "7": {
61
+ "content": "<gh_stars>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "8": {
69
+ "content": "<issue_start>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "9": {
77
+ "content": "<issue_comment>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "10": {
85
+ "content": "<issue_closed>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "11": {
93
+ "content": "<jupyter_start>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "12": {
101
+ "content": "<jupyter_text>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "13": {
109
+ "content": "<jupyter_code>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "14": {
117
+ "content": "<jupyter_output>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": true
123
+ },
124
+ "15": {
125
+ "content": "<jupyter_script>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": true
131
+ },
132
+ "16": {
133
+ "content": "<empty_output>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": true
139
+ },
140
+ "49152": {
141
+ "content": "<global-img>",
142
+ "lstrip": false,
143
+ "normalized": false,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": true
147
+ },
148
+ "49153": {
149
+ "content": "<row_1_col_1>",
150
+ "lstrip": false,
151
+ "normalized": false,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": true
155
+ },
156
+ "49154": {
157
+ "content": "<row_1_col_2>",
158
+ "lstrip": false,
159
+ "normalized": false,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": true
163
+ },
164
+ "49155": {
165
+ "content": "<row_1_col_3>",
166
+ "lstrip": false,
167
+ "normalized": false,
168
+ "rstrip": false,
169
+ "single_word": false,
170
+ "special": true
171
+ },
172
+ "49156": {
173
+ "content": "<row_1_col_4>",
174
+ "lstrip": false,
175
+ "normalized": false,
176
+ "rstrip": false,
177
+ "single_word": false,
178
+ "special": true
179
+ },
180
+ "49157": {
181
+ "content": "<row_1_col_5>",
182
+ "lstrip": false,
183
+ "normalized": false,
184
+ "rstrip": false,
185
+ "single_word": false,
186
+ "special": true
187
+ },
188
+ "49158": {
189
+ "content": "<row_1_col_6>",
190
+ "lstrip": false,
191
+ "normalized": false,
192
+ "rstrip": false,
193
+ "single_word": false,
194
+ "special": true
195
+ },
196
+ "49159": {
197
+ "content": "<row_2_col_1>",
198
+ "lstrip": false,
199
+ "normalized": false,
200
+ "rstrip": false,
201
+ "single_word": false,
202
+ "special": true
203
+ },
204
+ "49160": {
205
+ "content": "<row_2_col_2>",
206
+ "lstrip": false,
207
+ "normalized": false,
208
+ "rstrip": false,
209
+ "single_word": false,
210
+ "special": true
211
+ },
212
+ "49161": {
213
+ "content": "<row_2_col_3>",
214
+ "lstrip": false,
215
+ "normalized": false,
216
+ "rstrip": false,
217
+ "single_word": false,
218
+ "special": true
219
+ },
220
+ "49162": {
221
+ "content": "<row_2_col_4>",
222
+ "lstrip": false,
223
+ "normalized": false,
224
+ "rstrip": false,
225
+ "single_word": false,
226
+ "special": true
227
+ },
228
+ "49163": {
229
+ "content": "<row_2_col_5>",
230
+ "lstrip": false,
231
+ "normalized": false,
232
+ "rstrip": false,
233
+ "single_word": false,
234
+ "special": true
235
+ },
236
+ "49164": {
237
+ "content": "<row_2_col_6>",
238
+ "lstrip": false,
239
+ "normalized": false,
240
+ "rstrip": false,
241
+ "single_word": false,
242
+ "special": true
243
+ },
244
+ "49165": {
245
+ "content": "<row_3_col_1>",
246
+ "lstrip": false,
247
+ "normalized": false,
248
+ "rstrip": false,
249
+ "single_word": false,
250
+ "special": true
251
+ },
252
+ "49166": {
253
+ "content": "<row_3_col_2>",
254
+ "lstrip": false,
255
+ "normalized": false,
256
+ "rstrip": false,
257
+ "single_word": false,
258
+ "special": true
259
+ },
260
+ "49167": {
261
+ "content": "<row_3_col_3>",
262
+ "lstrip": false,
263
+ "normalized": false,
264
+ "rstrip": false,
265
+ "single_word": false,
266
+ "special": true
267
+ },
268
+ "49168": {
269
+ "content": "<row_3_col_4>",
270
+ "lstrip": false,
271
+ "normalized": false,
272
+ "rstrip": false,
273
+ "single_word": false,
274
+ "special": true
275
+ },
276
+ "49169": {
277
+ "content": "<row_3_col_5>",
278
+ "lstrip": false,
279
+ "normalized": false,
280
+ "rstrip": false,
281
+ "single_word": false,
282
+ "special": true
283
+ },
284
+ "49170": {
285
+ "content": "<row_3_col_6>",
286
+ "lstrip": false,
287
+ "normalized": false,
288
+ "rstrip": false,
289
+ "single_word": false,
290
+ "special": true
291
+ },
292
+ "49171": {
293
+ "content": "<row_4_col_1>",
294
+ "lstrip": false,
295
+ "normalized": false,
296
+ "rstrip": false,
297
+ "single_word": false,
298
+ "special": true
299
+ },
300
+ "49172": {
301
+ "content": "<row_4_col_2>",
302
+ "lstrip": false,
303
+ "normalized": false,
304
+ "rstrip": false,
305
+ "single_word": false,
306
+ "special": true
307
+ },
308
+ "49173": {
309
+ "content": "<row_4_col_3>",
310
+ "lstrip": false,
311
+ "normalized": false,
312
+ "rstrip": false,
313
+ "single_word": false,
314
+ "special": true
315
+ },
316
+ "49174": {
317
+ "content": "<row_4_col_4>",
318
+ "lstrip": false,
319
+ "normalized": false,
320
+ "rstrip": false,
321
+ "single_word": false,
322
+ "special": true
323
+ },
324
+ "49175": {
325
+ "content": "<row_4_col_5>",
326
+ "lstrip": false,
327
+ "normalized": false,
328
+ "rstrip": false,
329
+ "single_word": false,
330
+ "special": true
331
+ },
332
+ "49176": {
333
+ "content": "<row_4_col_6>",
334
+ "lstrip": false,
335
+ "normalized": false,
336
+ "rstrip": false,
337
+ "single_word": false,
338
+ "special": true
339
+ },
340
+ "49177": {
341
+ "content": "<row_5_col_1>",
342
+ "lstrip": false,
343
+ "normalized": false,
344
+ "rstrip": false,
345
+ "single_word": false,
346
+ "special": true
347
+ },
348
+ "49178": {
349
+ "content": "<row_5_col_2>",
350
+ "lstrip": false,
351
+ "normalized": false,
352
+ "rstrip": false,
353
+ "single_word": false,
354
+ "special": true
355
+ },
356
+ "49179": {
357
+ "content": "<row_5_col_3>",
358
+ "lstrip": false,
359
+ "normalized": false,
360
+ "rstrip": false,
361
+ "single_word": false,
362
+ "special": true
363
+ },
364
+ "49180": {
365
+ "content": "<row_5_col_4>",
366
+ "lstrip": false,
367
+ "normalized": false,
368
+ "rstrip": false,
369
+ "single_word": false,
370
+ "special": true
371
+ },
372
+ "49181": {
373
+ "content": "<row_5_col_5>",
374
+ "lstrip": false,
375
+ "normalized": false,
376
+ "rstrip": false,
377
+ "single_word": false,
378
+ "special": true
379
+ },
380
+ "49182": {
381
+ "content": "<row_5_col_6>",
382
+ "lstrip": false,
383
+ "normalized": false,
384
+ "rstrip": false,
385
+ "single_word": false,
386
+ "special": true
387
+ },
388
+ "49183": {
389
+ "content": "<row_6_col_1>",
390
+ "lstrip": false,
391
+ "normalized": false,
392
+ "rstrip": false,
393
+ "single_word": false,
394
+ "special": true
395
+ },
396
+ "49184": {
397
+ "content": "<row_6_col_2>",
398
+ "lstrip": false,
399
+ "normalized": false,
400
+ "rstrip": false,
401
+ "single_word": false,
402
+ "special": true
403
+ },
404
+ "49185": {
405
+ "content": "<row_6_col_3>",
406
+ "lstrip": false,
407
+ "normalized": false,
408
+ "rstrip": false,
409
+ "single_word": false,
410
+ "special": true
411
+ },
412
+ "49186": {
413
+ "content": "<row_6_col_4>",
414
+ "lstrip": false,
415
+ "normalized": false,
416
+ "rstrip": false,
417
+ "single_word": false,
418
+ "special": true
419
+ },
420
+ "49187": {
421
+ "content": "<row_6_col_5>",
422
+ "lstrip": false,
423
+ "normalized": false,
424
+ "rstrip": false,
425
+ "single_word": false,
426
+ "special": true
427
+ },
428
+ "49188": {
429
+ "content": "<row_6_col_6>",
430
+ "lstrip": false,
431
+ "normalized": false,
432
+ "rstrip": false,
433
+ "single_word": false,
434
+ "special": true
435
+ },
436
+ "49189": {
437
+ "content": "<fake_token_around_image>",
438
+ "lstrip": false,
439
+ "normalized": false,
440
+ "rstrip": false,
441
+ "single_word": false,
442
+ "special": true
443
+ },
444
+ "49190": {
445
+ "content": "<image>",
446
+ "lstrip": false,
447
+ "normalized": false,
448
+ "rstrip": false,
449
+ "single_word": false,
450
+ "special": true
451
+ },
452
+ "49191": {
453
+ "content": "<|reserved_special_token_0|>",
454
+ "lstrip": false,
455
+ "normalized": false,
456
+ "rstrip": false,
457
+ "single_word": false,
458
+ "special": true
459
+ },
460
+ "49192": {
461
+ "content": "<|reserved_special_token_1|>",
462
+ "lstrip": false,
463
+ "normalized": false,
464
+ "rstrip": false,
465
+ "single_word": false,
466
+ "special": true
467
+ },
468
+ "49193": {
469
+ "content": "<|reserved_special_token_2|>",
470
+ "lstrip": false,
471
+ "normalized": false,
472
+ "rstrip": false,
473
+ "single_word": false,
474
+ "special": true
475
+ },
476
+ "49194": {
477
+ "content": "<|reserved_special_token_3|>",
478
+ "lstrip": false,
479
+ "normalized": false,
480
+ "rstrip": false,
481
+ "single_word": false,
482
+ "special": true
483
+ },
484
+ "49195": {
485
+ "content": "<|reserved_special_token_4|>",
486
+ "lstrip": false,
487
+ "normalized": false,
488
+ "rstrip": false,
489
+ "single_word": false,
490
+ "special": true
491
+ },
492
+ "49196": {
493
+ "content": "<|reserved_special_token_5|>",
494
+ "lstrip": false,
495
+ "normalized": false,
496
+ "rstrip": false,
497
+ "single_word": false,
498
+ "special": true
499
+ },
500
+ "49197": {
501
+ "content": "<|reserved_special_token_6|>",
502
+ "lstrip": false,
503
+ "normalized": false,
504
+ "rstrip": false,
505
+ "single_word": false,
506
+ "special": true
507
+ },
508
+ "49198": {
509
+ "content": "<|reserved_special_token_7|>",
510
+ "lstrip": false,
511
+ "normalized": false,
512
+ "rstrip": false,
513
+ "single_word": false,
514
+ "special": true
515
+ },
516
+ "49199": {
517
+ "content": "<|reserved_special_token_8|>",
518
+ "lstrip": false,
519
+ "normalized": false,
520
+ "rstrip": false,
521
+ "single_word": false,
522
+ "special": true
523
+ },
524
+ "49200": {
525
+ "content": "<|reserved_special_token_9|>",
526
+ "lstrip": false,
527
+ "normalized": false,
528
+ "rstrip": false,
529
+ "single_word": false,
530
+ "special": true
531
+ },
532
+ "49201": {
533
+ "content": "<|reserved_special_token_10|>",
534
+ "lstrip": false,
535
+ "normalized": false,
536
+ "rstrip": false,
537
+ "single_word": false,
538
+ "special": true
539
+ },
540
+ "49202": {
541
+ "content": "<|reserved_special_token_11|>",
542
+ "lstrip": false,
543
+ "normalized": false,
544
+ "rstrip": false,
545
+ "single_word": false,
546
+ "special": true
547
+ },
548
+ "49203": {
549
+ "content": "<|reserved_special_token_12|>",
550
+ "lstrip": false,
551
+ "normalized": false,
552
+ "rstrip": false,
553
+ "single_word": false,
554
+ "special": true
555
+ },
556
+ "49204": {
557
+ "content": "<|reserved_special_token_13|>",
558
+ "lstrip": false,
559
+ "normalized": false,
560
+ "rstrip": false,
561
+ "single_word": false,
562
+ "special": true
563
+ },
564
+ "49205": {
565
+ "content": "<|reserved_special_token_14|>",
566
+ "lstrip": false,
567
+ "normalized": false,
568
+ "rstrip": false,
569
+ "single_word": false,
570
+ "special": true
571
+ },
572
+ "49206": {
573
+ "content": "<|reserved_special_token_15|>",
574
+ "lstrip": false,
575
+ "normalized": false,
576
+ "rstrip": false,
577
+ "single_word": false,
578
+ "special": true
579
+ },
580
+ "49207": {
581
+ "content": "<|reserved_special_token_16|>",
582
+ "lstrip": false,
583
+ "normalized": false,
584
+ "rstrip": false,
585
+ "single_word": false,
586
+ "special": true
587
+ },
588
+ "49208": {
589
+ "content": "<|reserved_special_token_17|>",
590
+ "lstrip": false,
591
+ "normalized": false,
592
+ "rstrip": false,
593
+ "single_word": false,
594
+ "special": true
595
+ },
596
+ "49209": {
597
+ "content": "<|reserved_special_token_18|>",
598
+ "lstrip": false,
599
+ "normalized": false,
600
+ "rstrip": false,
601
+ "single_word": false,
602
+ "special": true
603
+ },
604
+ "49210": {
605
+ "content": "<|reserved_special_token_19|>",
606
+ "lstrip": false,
607
+ "normalized": false,
608
+ "rstrip": false,
609
+ "single_word": false,
610
+ "special": true
611
+ },
612
+ "49211": {
613
+ "content": "<|reserved_special_token_20|>",
614
+ "lstrip": false,
615
+ "normalized": false,
616
+ "rstrip": false,
617
+ "single_word": false,
618
+ "special": true
619
+ },
620
+ "49212": {
621
+ "content": "<|reserved_special_token_21|>",
622
+ "lstrip": false,
623
+ "normalized": false,
624
+ "rstrip": false,
625
+ "single_word": false,
626
+ "special": true
627
+ },
628
+ "49213": {
629
+ "content": "<|reserved_special_token_22|>",
630
+ "lstrip": false,
631
+ "normalized": false,
632
+ "rstrip": false,
633
+ "single_word": false,
634
+ "special": true
635
+ },
636
+ "49214": {
637
+ "content": "<|reserved_special_token_23|>",
638
+ "lstrip": false,
639
+ "normalized": false,
640
+ "rstrip": false,
641
+ "single_word": false,
642
+ "special": true
643
+ },
644
+ "49215": {
645
+ "content": "<|reserved_special_token_24|>",
646
+ "lstrip": false,
647
+ "normalized": false,
648
+ "rstrip": false,
649
+ "single_word": false,
650
+ "special": true
651
+ },
652
+ "49216": {
653
+ "content": "<|reserved_special_token_25|>",
654
+ "lstrip": false,
655
+ "normalized": false,
656
+ "rstrip": false,
657
+ "single_word": false,
658
+ "special": true
659
+ },
660
+ "49217": {
661
+ "content": "<|reserved_special_token_26|>",
662
+ "lstrip": false,
663
+ "normalized": false,
664
+ "rstrip": false,
665
+ "single_word": false,
666
+ "special": true
667
+ },
668
+ "49218": {
669
+ "content": "<|reserved_special_token_27|>",
670
+ "lstrip": false,
671
+ "normalized": false,
672
+ "rstrip": false,
673
+ "single_word": false,
674
+ "special": true
675
+ },
676
+ "49219": {
677
+ "content": "<|reserved_special_token_28|>",
678
+ "lstrip": false,
679
+ "normalized": false,
680
+ "rstrip": false,
681
+ "single_word": false,
682
+ "special": true
683
+ },
684
+ "49220": {
685
+ "content": "<|reserved_special_token_29|>",
686
+ "lstrip": false,
687
+ "normalized": false,
688
+ "rstrip": false,
689
+ "single_word": false,
690
+ "special": true
691
+ },
692
+ "49221": {
693
+ "content": "<|reserved_special_token_30|>",
694
+ "lstrip": false,
695
+ "normalized": false,
696
+ "rstrip": false,
697
+ "single_word": false,
698
+ "special": true
699
+ },
700
+ "49222": {
701
+ "content": "<|reserved_special_token_31|>",
702
+ "lstrip": false,
703
+ "normalized": false,
704
+ "rstrip": false,
705
+ "single_word": false,
706
+ "special": true
707
+ },
708
+ "49223": {
709
+ "content": "<|reserved_special_token_32|>",
710
+ "lstrip": false,
711
+ "normalized": false,
712
+ "rstrip": false,
713
+ "single_word": false,
714
+ "special": true
715
+ },
716
+ "49224": {
717
+ "content": "<|reserved_special_token_33|>",
718
+ "lstrip": false,
719
+ "normalized": false,
720
+ "rstrip": false,
721
+ "single_word": false,
722
+ "special": true
723
+ },
724
+ "49225": {
725
+ "content": "<|reserved_special_token_34|>",
726
+ "lstrip": false,
727
+ "normalized": false,
728
+ "rstrip": false,
729
+ "single_word": false,
730
+ "special": true
731
+ },
732
+ "49226": {
733
+ "content": "<|reserved_special_token_35|>",
734
+ "lstrip": false,
735
+ "normalized": false,
736
+ "rstrip": false,
737
+ "single_word": false,
738
+ "special": true
739
+ },
740
+ "49227": {
741
+ "content": "<|reserved_special_token_36|>",
742
+ "lstrip": false,
743
+ "normalized": false,
744
+ "rstrip": false,
745
+ "single_word": false,
746
+ "special": true
747
+ },
748
+ "49228": {
749
+ "content": "<|reserved_special_token_37|>",
750
+ "lstrip": false,
751
+ "normalized": false,
752
+ "rstrip": false,
753
+ "single_word": false,
754
+ "special": true
755
+ },
756
+ "49229": {
757
+ "content": "<|reserved_special_token_38|>",
758
+ "lstrip": false,
759
+ "normalized": false,
760
+ "rstrip": false,
761
+ "single_word": false,
762
+ "special": true
763
+ },
764
+ "49230": {
765
+ "content": "<|reserved_special_token_39|>",
766
+ "lstrip": false,
767
+ "normalized": false,
768
+ "rstrip": false,
769
+ "single_word": false,
770
+ "special": true
771
+ },
772
+ "49231": {
773
+ "content": "<|reserved_special_token_40|>",
774
+ "lstrip": false,
775
+ "normalized": false,
776
+ "rstrip": false,
777
+ "single_word": false,
778
+ "special": true
779
+ },
780
+ "49232": {
781
+ "content": "<|reserved_special_token_41|>",
782
+ "lstrip": false,
783
+ "normalized": false,
784
+ "rstrip": false,
785
+ "single_word": false,
786
+ "special": true
787
+ },
788
+ "49233": {
789
+ "content": "<|reserved_special_token_42|>",
790
+ "lstrip": false,
791
+ "normalized": false,
792
+ "rstrip": false,
793
+ "single_word": false,
794
+ "special": true
795
+ },
796
+ "49234": {
797
+ "content": "<|reserved_special_token_43|>",
798
+ "lstrip": false,
799
+ "normalized": false,
800
+ "rstrip": false,
801
+ "single_word": false,
802
+ "special": true
803
+ },
804
+ "49235": {
805
+ "content": "<|reserved_special_token_44|>",
806
+ "lstrip": false,
807
+ "normalized": false,
808
+ "rstrip": false,
809
+ "single_word": false,
810
+ "special": true
811
+ },
812
+ "49236": {
813
+ "content": "<|reserved_special_token_45|>",
814
+ "lstrip": false,
815
+ "normalized": false,
816
+ "rstrip": false,
817
+ "single_word": false,
818
+ "special": true
819
+ },
820
+ "49237": {
821
+ "content": "<|reserved_special_token_46|>",
822
+ "lstrip": false,
823
+ "normalized": false,
824
+ "rstrip": false,
825
+ "single_word": false,
826
+ "special": true
827
+ },
828
+ "49238": {
829
+ "content": "<|reserved_special_token_47|>",
830
+ "lstrip": false,
831
+ "normalized": false,
832
+ "rstrip": false,
833
+ "single_word": false,
834
+ "special": true
835
+ },
836
+ "49239": {
837
+ "content": "<|reserved_special_token_48|>",
838
+ "lstrip": false,
839
+ "normalized": false,
840
+ "rstrip": false,
841
+ "single_word": false,
842
+ "special": true
843
+ },
844
+ "49240": {
845
+ "content": "<|reserved_special_token_49|>",
846
+ "lstrip": false,
847
+ "normalized": false,
848
+ "rstrip": false,
849
+ "single_word": false,
850
+ "special": true
851
+ },
852
+ "49241": {
853
+ "content": "<|reserved_special_token_50|>",
854
+ "lstrip": false,
855
+ "normalized": false,
856
+ "rstrip": false,
857
+ "single_word": false,
858
+ "special": true
859
+ },
860
+ "49242": {
861
+ "content": "<|reserved_special_token_51|>",
862
+ "lstrip": false,
863
+ "normalized": false,
864
+ "rstrip": false,
865
+ "single_word": false,
866
+ "special": true
867
+ },
868
+ "49243": {
869
+ "content": "<|reserved_special_token_52|>",
870
+ "lstrip": false,
871
+ "normalized": false,
872
+ "rstrip": false,
873
+ "single_word": false,
874
+ "special": true
875
+ },
876
+ "49244": {
877
+ "content": "<|reserved_special_token_53|>",
878
+ "lstrip": false,
879
+ "normalized": false,
880
+ "rstrip": false,
881
+ "single_word": false,
882
+ "special": true
883
+ },
884
+ "49245": {
885
+ "content": "<|reserved_special_token_54|>",
886
+ "lstrip": false,
887
+ "normalized": false,
888
+ "rstrip": false,
889
+ "single_word": false,
890
+ "special": true
891
+ },
892
+ "49246": {
893
+ "content": "<|reserved_special_token_55|>",
894
+ "lstrip": false,
895
+ "normalized": false,
896
+ "rstrip": false,
897
+ "single_word": false,
898
+ "special": true
899
+ },
900
+ "49247": {
901
+ "content": "<|reserved_special_token_56|>",
902
+ "lstrip": false,
903
+ "normalized": false,
904
+ "rstrip": false,
905
+ "single_word": false,
906
+ "special": true
907
+ },
908
+ "49248": {
909
+ "content": "<|reserved_special_token_57|>",
910
+ "lstrip": false,
911
+ "normalized": false,
912
+ "rstrip": false,
913
+ "single_word": false,
914
+ "special": true
915
+ },
916
+ "49249": {
917
+ "content": "<|reserved_special_token_58|>",
918
+ "lstrip": false,
919
+ "normalized": false,
920
+ "rstrip": false,
921
+ "single_word": false,
922
+ "special": true
923
+ },
924
+ "49250": {
925
+ "content": "<|reserved_special_token_59|>",
926
+ "lstrip": false,
927
+ "normalized": false,
928
+ "rstrip": false,
929
+ "single_word": false,
930
+ "special": true
931
+ },
932
+ "49251": {
933
+ "content": "<|reserved_special_token_60|>",
934
+ "lstrip": false,
935
+ "normalized": false,
936
+ "rstrip": false,
937
+ "single_word": false,
938
+ "special": true
939
+ },
940
+ "49252": {
941
+ "content": "<|reserved_special_token_61|>",
942
+ "lstrip": false,
943
+ "normalized": false,
944
+ "rstrip": false,
945
+ "single_word": false,
946
+ "special": true
947
+ },
948
+ "49253": {
949
+ "content": "<|reserved_special_token_62|>",
950
+ "lstrip": false,
951
+ "normalized": false,
952
+ "rstrip": false,
953
+ "single_word": false,
954
+ "special": true
955
+ },
956
+ "49254": {
957
+ "content": "<|reserved_special_token_63|>",
958
+ "lstrip": false,
959
+ "normalized": false,
960
+ "rstrip": false,
961
+ "single_word": false,
962
+ "special": true
963
+ },
964
+ "49255": {
965
+ "content": "<|reserved_special_token_64|>",
966
+ "lstrip": false,
967
+ "normalized": false,
968
+ "rstrip": false,
969
+ "single_word": false,
970
+ "special": true
971
+ },
972
+ "49256": {
973
+ "content": "<|reserved_special_token_65|>",
974
+ "lstrip": false,
975
+ "normalized": false,
976
+ "rstrip": false,
977
+ "single_word": false,
978
+ "special": true
979
+ },
980
+ "49257": {
981
+ "content": "<|reserved_special_token_66|>",
982
+ "lstrip": false,
983
+ "normalized": false,
984
+ "rstrip": false,
985
+ "single_word": false,
986
+ "special": true
987
+ },
988
+ "49258": {
989
+ "content": "<|reserved_special_token_67|>",
990
+ "lstrip": false,
991
+ "normalized": false,
992
+ "rstrip": false,
993
+ "single_word": false,
994
+ "special": true
995
+ },
996
+ "49259": {
997
+ "content": "<|reserved_special_token_68|>",
998
+ "lstrip": false,
999
+ "normalized": false,
1000
+ "rstrip": false,
1001
+ "single_word": false,
1002
+ "special": true
1003
+ },
1004
+ "49260": {
1005
+ "content": "<|reserved_special_token_69|>",
1006
+ "lstrip": false,
1007
+ "normalized": false,
1008
+ "rstrip": false,
1009
+ "single_word": false,
1010
+ "special": true
1011
+ },
1012
+ "49261": {
1013
+ "content": "<|reserved_special_token_70|>",
1014
+ "lstrip": false,
1015
+ "normalized": false,
1016
+ "rstrip": false,
1017
+ "single_word": false,
1018
+ "special": true
1019
+ },
1020
+ "49262": {
1021
+ "content": "<|reserved_special_token_71|>",
1022
+ "lstrip": false,
1023
+ "normalized": false,
1024
+ "rstrip": false,
1025
+ "single_word": false,
1026
+ "special": true
1027
+ },
1028
+ "49263": {
1029
+ "content": "<|reserved_special_token_72|>",
1030
+ "lstrip": false,
1031
+ "normalized": false,
1032
+ "rstrip": false,
1033
+ "single_word": false,
1034
+ "special": true
1035
+ },
1036
+ "49264": {
1037
+ "content": "<|reserved_special_token_73|>",
1038
+ "lstrip": false,
1039
+ "normalized": false,
1040
+ "rstrip": false,
1041
+ "single_word": false,
1042
+ "special": true
1043
+ },
1044
+ "49265": {
1045
+ "content": "<|reserved_special_token_74|>",
1046
+ "lstrip": false,
1047
+ "normalized": false,
1048
+ "rstrip": false,
1049
+ "single_word": false,
1050
+ "special": true
1051
+ },
1052
+ "49266": {
1053
+ "content": "<|reserved_special_token_75|>",
1054
+ "lstrip": false,
1055
+ "normalized": false,
1056
+ "rstrip": false,
1057
+ "single_word": false,
1058
+ "special": true
1059
+ },
1060
+ "49267": {
1061
+ "content": "<|reserved_special_token_76|>",
1062
+ "lstrip": false,
1063
+ "normalized": false,
1064
+ "rstrip": false,
1065
+ "single_word": false,
1066
+ "special": true
1067
+ },
1068
+ "49268": {
1069
+ "content": "<|reserved_special_token_77|>",
1070
+ "lstrip": false,
1071
+ "normalized": false,
1072
+ "rstrip": false,
1073
+ "single_word": false,
1074
+ "special": true
1075
+ },
1076
+ "49269": {
1077
+ "content": "<|reserved_special_token_78|>",
1078
+ "lstrip": false,
1079
+ "normalized": false,
1080
+ "rstrip": false,
1081
+ "single_word": false,
1082
+ "special": true
1083
+ },
1084
+ "49270": {
1085
+ "content": "<|reserved_special_token_79|>",
1086
+ "lstrip": false,
1087
+ "normalized": false,
1088
+ "rstrip": false,
1089
+ "single_word": false,
1090
+ "special": true
1091
+ },
1092
+ "49271": {
1093
+ "content": "<|reserved_special_token_80|>",
1094
+ "lstrip": false,
1095
+ "normalized": false,
1096
+ "rstrip": false,
1097
+ "single_word": false,
1098
+ "special": true
1099
+ },
1100
+ "49272": {
1101
+ "content": "<|reserved_special_token_81|>",
1102
+ "lstrip": false,
1103
+ "normalized": false,
1104
+ "rstrip": false,
1105
+ "single_word": false,
1106
+ "special": true
1107
+ },
1108
+ "49273": {
1109
+ "content": "<|reserved_special_token_82|>",
1110
+ "lstrip": false,
1111
+ "normalized": false,
1112
+ "rstrip": false,
1113
+ "single_word": false,
1114
+ "special": true
1115
+ },
1116
+ "49274": {
1117
+ "content": "<|reserved_special_token_83|>",
1118
+ "lstrip": false,
1119
+ "normalized": false,
1120
+ "rstrip": false,
1121
+ "single_word": false,
1122
+ "special": true
1123
+ },
1124
+ "49275": {
1125
+ "content": "<|reserved_special_token_84|>",
1126
+ "lstrip": false,
1127
+ "normalized": false,
1128
+ "rstrip": false,
1129
+ "single_word": false,
1130
+ "special": true
1131
+ },
1132
+ "49276": {
1133
+ "content": "<|reserved_special_token_85|>",
1134
+ "lstrip": false,
1135
+ "normalized": false,
1136
+ "rstrip": false,
1137
+ "single_word": false,
1138
+ "special": true
1139
+ },
1140
+ "49277": {
1141
+ "content": "<|reserved_special_token_86|>",
1142
+ "lstrip": false,
1143
+ "normalized": false,
1144
+ "rstrip": false,
1145
+ "single_word": false,
1146
+ "special": true
1147
+ },
1148
+ "49278": {
1149
+ "content": "<|reserved_special_token_87|>",
1150
+ "lstrip": false,
1151
+ "normalized": false,
1152
+ "rstrip": false,
1153
+ "single_word": false,
1154
+ "special": true
1155
+ },
1156
+ "49279": {
1157
+ "content": "<end_of_utterance>",
1158
+ "lstrip": false,
1159
+ "normalized": false,
1160
+ "rstrip": false,
1161
+ "single_word": false,
1162
+ "special": true
1163
+ }
1164
+ },
1165
+ "additional_special_tokens": [
1166
+ "<fake_token_around_image>",
1167
+ "<image>",
1168
+ "<end_of_utterance>"
1169
+ ],
1170
+ "bos_token": "<|im_start|>",
1171
+ "clean_up_tokenization_spaces": false,
1172
+ "end_of_utterance_token": "<end_of_utterance>",
1173
+ "eos_token": "<end_of_utterance>",
1174
+ "extra_special_tokens": {
1175
+ "end_of_utterance_token": "<end_of_utterance>",
1176
+ "fake_image_token": "<fake_token_around_image>",
1177
+ "global_image_token": "<global-img>",
1178
+ "image_token": "<image>"
1179
+ },
1180
+ "fake_image_token": "<fake_token_around_image>",
1181
+ "global_image_token": "<global-img>",
1182
+ "image_token": "<image>",
1183
+ "legacy": false,
1184
+ "model_max_length": 8192,
1185
+ "pad_token": "<|im_end|>",
1186
+ "processor_class": "SmolVLMProcessor",
1187
+ "tokenizer_class": "GPT2Tokenizer",
1188
+ "truncation_side": "left",
1189
+ "unk_token": "<|endoftext|>",
1190
+ "vocab_size": 49152
1191
+ }
video_preprocessor_config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "do_center_crop": null,
7
+ "do_convert_rgb": true,
8
+ "do_image_splitting": true,
9
+ "do_normalize": true,
10
+ "do_pad": true,
11
+ "do_rescale": true,
12
+ "do_resize": true,
13
+ "do_sample_frames": false,
14
+ "fps": 1,
15
+ "image_mean": [
16
+ 0.5,
17
+ 0.5,
18
+ 0.5
19
+ ],
20
+ "image_processor_type": "SmolVLMImageProcessor",
21
+ "image_std": [
22
+ 0.5,
23
+ 0.5,
24
+ 0.5
25
+ ],
26
+ "input_data_format": null,
27
+ "max_image_size": {
28
+ "longest_edge": 512
29
+ },
30
+ "num_frames": 64,
31
+ "processor_class": "SmolVLMProcessor",
32
+ "resample": 1,
33
+ "rescale_factor": 0.00392156862745098,
34
+ "return_metadata": false,
35
+ "size": {
36
+ "longest_edge": 2048
37
+ },
38
+ "size_divisor": null,
39
+ "video_metadata": null,
40
+ "video_processor_type": "SmolVLMVideoProcessor",
41
+ "video_sampling": {
42
+ "fps": 1,
43
+ "max_frames": 64,
44
+ "video_size": {
45
+ "longest_edge": 2048
46
+ }
47
+ }
48
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff